Suraj Deshmukh | सुरज देशमुख
@suraj.io
📤 211
📥 301
📝 167
@Microsoft.com
| ex-@kinvolkio ex-@RedHat | bibliophile | He/Him | Opinions are my own. 🟥 🟩 🟦 🟨
Really cool tool to visualize what token/s really looks like at different speeds, starting from 0.05 tok/s to 2000 tok/s. After 800 tok/s you can't really tell the difference it all feels the same!
mikeveerman.github.io/tokenspeed/?...
loading . . .
tokenspeed — feel LLM tokens-per-second
https://mikeveerman.github.io/tokenspeed/?rate=30&mode=code
2 days ago
0
0
0
This post highlights a critical issue with rapid, AI-driven development in team projects: long-term technical debt. If AI is helping you write code twice as fast, it's also doubling the amount of code you have to maintain.
loading . . .
James Shore: You Need AI That Reduces Maintenance Costs
https://www.jamesshore.com/v2/blog/2026/you-need-ai-that-reduces-your-maintenance-costs
9 days ago
1
0
0
NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
developer.nvidia.com/blog/nvidia-...
loading . . .
NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes | NVIDIA Technical Blog
In production inference deployments, demand fluctuates over time, requiring inference replicas to scale elastically. However, cold-starting inference workloads on Kubernetes can take several minutes.
https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/
10 days ago
0
2
0
1/n We need to start talking about "KV Cache Engineering." The efficiency of any LLM serving system hinges on how it manages the KV cache—where to place it, how to discover it, and how long to keep it alive. Yet, most inference systems out there don't give clients the control they need.
10 days ago
1
0
0
Using Claude Code: The unreasonable effectiveness of HTML How and why members of the Claude Code team use HTML instead of Markdown to produce richer, more readable, and easily shareable outputs.
claude.com/blog/using-c...
loading . . .
Using Claude Code: The unreasonable effectiveness of HTML | Claude
How and why members of the Claude Code team use HTML instead of Markdown to produce richer, more readable, and easily shareable outputs.
https://claude.com/blog/using-claude-code-the-unreasonable-effectiveness-of-html
22 days ago
0
0
0
Is it ok to treat Claude generated code as if it was generated by another team you relied upon? What if you trust it too much because it worked in the past but then it bites you some day in the future?
open.substack.com/pub/simonw/p...
loading . . .
Vibe coding and agentic engineering are getting closer than I’d like
Plus updates from Anthropic's Code w/ Claude conference
https://open.substack.com/pub/simonw/p/vibe-coding-and-agentic-engineering?r=ax4pb&utm_medium=ios
about 1 month ago
1
0
0
Disaggregated Inference: 18 Months Later
haoailab.com/blogs/distse...
loading . . .
Disaggregated Inference: 18 Months Later
Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every production...
https://haoailab.com/blogs/distserve-retro/
about 2 months ago
0
1
0
Official™️ Claude Code skill from Anthropic that creates Claude Code skills:
github.com/anthropics/s...
loading . . .
skills/skills/skill-creator at main · anthropics/skills
Public repository for Agent Skills. Contribute to anthropics/skills development by creating an account on GitHub.
https://github.com/anthropics/skills/tree/main/skills/skill-creator
3 months ago
1
2
1
1/n I just spent time reading
@simonwillison.net
’s guide on agentic engineering patterns and it shifted how I think about coding with AI. The mindset isn’t “let the AI figure it out” — that’s vibe-coding™️.
3 months ago
2
1
0
Living dangerously with Claude
simonwillison.net/2025/Oct/22/...
loading . . .
Living dangerously with Claude
I gave a talk last night at Claude Code Anonymous in San Francisco, the unofficial meetup for coding agent enthusiasts. I decided to talk about a dichotomy I’ve been struggling …
https://simonwillison.net/2025/Oct/22/living-dangerously-with-claude/
3 months ago
0
0
0
Github has a recommendation on doing dotfiles:
dotfiles.github.io
loading . . .
GitHub does dotfiles - dotfiles.github.io
https://dotfiles.github.io
4 months ago
0
0
0
I just published a new guide on configuring
#OpenClaw
🦀 to run with
#Azure
AI Foundry models. You control data control, so more privacy, talk to it from
#Telegram
or using the console! Check it out here:
suraj.io/post/2026/op...
loading . . .
Setting Up OpenClaw with Azure AI Foundry
Learn how to configure OpenClaw to use Azure AI Foundry models, giving you a self-hosted AI assistant accessible from Telegram and other chat apps.
https://suraj.io/post/2026/openclaw-with-azure/
4 months ago
0
2
0
Apple has a new native container CLI for macOS! Run Linux containers without Docker Desktop—with sub-second startup times. 🚀 My guide covers setup, resource limits, and fixing macOS firewall blocks: 🔗
suraj.io/post/2026/us...
#macOS
#Containers
loading . . .
Running Linux Containers Natively on macOS with Apple's Container CLI
Learn how to use Apple's container CLI tool to run Linux containers as lightweight VMs on macOS with sub-second startup times
https://suraj.io/post/2026/using-osx-containerization/
4 months ago
0
2
0
1/n 📚 Made something for fellow book nerds using Openclaw: A Goodreads skill that lets your AI agent search for books, pull up details & reviews, get personalized recommendations, and manage your reading lists — all through natural language.
loading . . .
goodreads — ClawHub
Search for books, get book details and reviews, discover personalized recommendations, and manage reading lists on Goodreads — all through browser automation.
https://clawhub.ai/surajssd/goodreads
4 months ago
1
1
0
Deploying
#Kimi
K2.5 on
#Azure
: A Complete Guide to Running MoonshotAI's Model
suraj.io/post/2026/de...
loading . . .
Deploying Kimi K2.5 on Azure: A Complete Guide to Running MoonshotAI's Model
Learn how to deploy and configure Kimi K2.5 on Azure AI Foundry with this step-by-step guide.
https://suraj.io/post/2026/deploying-kimi-k2-on-azure/
4 months ago
0
0
0
Running Pydantic’s Monty Rust sandboxed Python subset in WebAssembly
simonwillison.net/2026/Feb/6/p...
loading . . .
Running Pydantic’s Monty Rust sandboxed Python subset in WebAssembly
There’s a jargon-filled headline for you! Everyone’s building sandboxes for running untrusted code right now, and Pydantic’s latest attempt, Monty, provides a custom Python-like language (a subset of ...
https://simonwillison.net/2026/Feb/6/pydantic-monty/
4 months ago
0
2
2
Thanks to
@scott.hanselman.com
for showing me Handy (
handy.computer
) — a free, open-source speech-to-text tool that runs locally on your machine. Push-to-talk, privacy-focused, and just works. Check it out!
loading . . .
Handy
Handy is a cross platform, open-source, speech-to-text application for your computer
https://handy.computer
4 months ago
2
42
13
Running Docker Commands on a Remote Machine via SSH
suraj.io/post/2026/re...
#docker
#ssh
#remote
#containers
#cli
#development
#devops
loading . . .
Running Docker Commands on a Remote Machine via SSH
Learn how to execute Docker commands on a remote machine from your local terminal using SSH and Docker contexts
https://suraj.io/post/2026/remote-machine-as-docker-runner/
4 months ago
0
0
0
Using Claude Code with GitHub-Hosted Anthropic Models
suraj.io/post/2026/us...
#claude
#github-models
#ai
#litellm
#anthropic
loading . . .
Using Claude Code with GitHub-Hosted Anthropic Models
Learn how to use Claude Code CLI with GitHub Models by proxying requests through litellm-proxy
https://suraj.io/post/2026/use-claude-code-with-gh-models/
4 months ago
0
0
0
Meta’s Kubernetes-based Portable AI Research Environment
youtu.be/ts7bI51gRCo?...
loading . . .
Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/ts7bI51gRCo?si=TkhhTSBAHp6jmbU8
7 months ago
0
1
0
Our talk (me & Yuhan Liu) on improving LLM serving efficienty is on YouTube now!
youtu.be/2YCDvZokqnk?...
#vllm
#kubernetes
#kubecon
loading . . .
LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/2YCDvZokqnk?si=of1DG2k5dBIBN0I2
7 months ago
0
3
0
Infinite scale: The architecture behind the Azure AI superfactory
blogs.microsoft.com/blog/2025/11...
loading . . .
Infinite scale: The architecture behind the Azure AI superfactory - The Official Microsoft Blog
Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of A...
https://blogs.microsoft.com/blog/2025/11/12/infinite-scale-the-architecture-behind-the-azure-ai-superfactory/
7 months ago
0
2
0
Gemini 3, Open AI kv cache and much more
open.substack.com/pub/simonw/p...
loading . . .
Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark
Plus what happens if AI labs train for pelicans riding bicycles?
https://open.substack.com/pub/simonw/p/trying-out-gemini-3-pro-with-audio?r=ax4pb&utm_medium=ios
7 months ago
0
1
0
Open AI gave some of the details from the user POV as to what kv cache features are available
platform.openai.com/docs/guides/...
It is interesting to see that they cache for 10 min and if no request is found they remove hot caches from GPU
loading . . .
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
https://platform.openai.com/docs/guides/prompt-caching
7 months ago
1
1
0
From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory
news.microsoft.com/source/featu...
loading . . .
Microsoft AI superfactory
Microsoft unveiled its second Fairwater AI datacenter in Atlanta as part of a new AI superfactory working across states in nearly real time.
https://news.microsoft.com/source/features/ai/from-wisconsin-to-atlanta-microsoft-connects-datacenters-to-build-its-first-ai-superfactory/
7 months ago
0
0
0
Satya Nadella – How Microsoft thinks about AGI
youtu.be/8-boBsWcr5A?...
loading . . .
Satya Nadella – How Microsoft thinks about AGI
YouTube video by Dwarkesh Patel
https://youtu.be/8-boBsWcr5A?si=L15eJ4kpSqJSY4MJ
7 months ago
0
0
0
How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte Scale
www.youtube.com/watch?v=pbOv...
loading . . .
Keynote: How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte... F. Ponce
YouTube video by CNCF [Cloud Native Computing Foundation]
https://www.youtube.com/watch?v=pbOvWxuYPIU
7 months ago
0
0
0
Come see us (me & Yuhan Liu) tomorrow for our talk. Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1. More info:
sched.co/27FcQ
#kubecon
#vllm
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
7 months ago
0
0
0
Announcing Ray Direct Transport: RDMA Support in Ray Core
www.anyscale.com/blog/ray-dir...
loading . . .
Ray Direct Transport: RDMA Support in Ray Core (Part 1)
Ray Direct Transport enables fast and direct GPU transfers in Ray via RDMA-backed transports. Using RDT, we can achieve up to 1000x faster GPU-GPU transfers than Ray’s native object store with a few l...
https://www.anyscale.com/blog/ray-direct-transport-rdma-support-in-ray-core
7 months ago
0
1
0
Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...
loading . . .
Building a tool to copy-paste share terminal sessions using Claude Code for web
Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas
https://open.substack.com/pub/simonw/p/building-a-tool-to-copy-paste-share?utm_campaign=post&utm_medium=email
8 months ago
0
2
0
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665
loading . . .
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...
https://arxiv.org/abs/2510.09665
8 months ago
0
1
0
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
developer.nvidia.com/blog/underst...
loading . . .
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware…
https://developer.nvidia.com/blog/understanding-memory-management-on-hardware-coherent-platforms/
8 months ago
0
1
0
Join me and Yuhan Liu for our talk at the upcoming
#Kubecon
NA 2025 in Atlanta:
sched.co/27FcQ
we will talk about increasing efficency while serving
#LLMs
using
#vLLM
&
#LMCache
!
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
8 months ago
0
1
0
Using Claude Code but with Github Copilot hosted Claude models:
github.com/surajssd/dot...
TFS
@nilekh.bsky.social
loading . . .
https://github.com/surajssd/dotfiles/blob/master/local-bin/litellm-proxy.sh
8 months ago
0
1
0
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
developer.nvidia.com/blog/nvidia-...
loading . . .
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
SemiAnalysis recently launched InferenceMAX v1, a new open source initiative that provides a comprehensive methodology to evaluate inference hardware performance. Published results demonstrate that…
https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/
8 months ago
0
0
0
Claude Code: Tips and Tricks
youtu.be/HSkLeECsBcw?...
loading . . .
Claude Code: Tips and Tricks
YouTube video by Anand Tyagi
https://youtu.be/HSkLeECsBcw?si=MjgHvnKZGmuFA7WQ
8 months ago
0
0
0
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...
loading . . .
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
YouTube video by @Scale
https://www.youtube.com/watch?v=4Beffz-HNsk
9 months ago
0
0
0
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
developer.nvidia.com/blog/how-to-...
#LMCache
loading . . .
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…
https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
9 months ago
0
0
0
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
www.infoq.com/articles/llm...
loading . . .
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving...
https://www.infoq.com/articles/llms-evolution-ai-infrastructure/
9 months ago
0
1
0
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
developer.nvidia.com/blog/cut-mod...
loading . . .
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…
https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
9 months ago
0
0
0
The Only Trait for Success in the AI Era—How to Build It
youtu.be/xWYb7tImErI?...
loading . . .
The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh
YouTube video by EO
https://youtu.be/xWYb7tImErI?si=JU6exneyjb7V724-
9 months ago
0
0
0
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving
youtu.be/WwJvecXOeUA?...
loading . . .
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
YouTube video by USENIX
https://youtu.be/WwJvecXOeUA?si=pPBbxLak2QcQc5fh
10 months ago
0
0
0
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
youtu.be/S8rq3pYboZY?...
loading . . .
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
YouTube video by USENIX
https://youtu.be/S8rq3pYboZY?si=6_rmrSnAV3eGYao9
10 months ago
0
0
0
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation
youtu.be/YqIHESG0suI?...
loading . . .
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/YqIHESG0suI?si=MzauN6ZbtaELrj-N
10 months ago
0
1
0
Extending Kubernetes for AI | Lessons Learned From Platform Engineering
youtu.be/d9K5PSsHtDg?...
loading . . .
Extending Kubernetes for AI | Lessons Learned From Platform... - Susan, Lucy, Andrea, Etienne, Tim
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/d9K5PSsHtDg?si=XEBcpZXMfIm_fJqe
10 months ago
0
0
0
You Need to Be Bored. Here's Why.
www.youtube.com/watch?v=orQK...
loading . . .
You Need to Be Bored. Here's Why.
YouTube video by Harvard Business Review
https://www.youtube.com/watch?v=orQKfIXMiA8
10 months ago
0
1
0
You can use ChatGPT and other models on a flight using onboard free WiFi via WhatsApp. Use MetaAI out of the box or save these contacts: - ChatGPT 1800 242 8478 - Microsoft Copilot +1 (877) 224-1042
10 months ago
0
0
0
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion
developer.nvidia.com/blog/scaling...
loading . . .
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion | NVIDIA Technical Blog
The exponential growth in AI model complexity has driven parameter counts from millions to trillions, requiring unprecedented computational resources that require clusters of GPUs to accommodate.
https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion
10 months ago
0
0
0
Andrej Karpathy: Software Is Changing (Again)
youtu.be/LCEmiRjPEtQ?...
loading . . .
Andrej Karpathy: Software Is Changing (Again)
YouTube video by Y Combinator
https://youtu.be/LCEmiRjPEtQ?si=vafoLV7HtvyAZ2fX
10 months ago
0
0
0
Claude, Qwen and Google models
open.substack.com/pub/simonw/p...
loading . . .
Reverse engineering some updates to Claude
Plus Qwen 3 Coder Flash, Gemini Deep Think, kimi-k2-turbo-preview
https://open.substack.com/pub/simonw/p/reverse-engineering-some-updates?r=ax4pb
10 months ago
0
0
0
Load more
feeds!
log in