Suraj Deshmukh | सुरज देशमुख
@suraj.io
📤 211
📥 299
📝 158
@Microsoft.com
| ex-@kinvolkio ex-@RedHat | bibliophile | He/Him | Opinions are my own. 🟥 🟩 🟦 🟨
Disaggregated Inference: 18 Months Later
haoailab.com/blogs/distse...
loading . . .
Disaggregated Inference: 18 Months Later
Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every production...
https://haoailab.com/blogs/distserve-retro/
3 days ago
0
0
0
Official™️ Claude Code skill from Anthropic that creates Claude Code skills:
github.com/anthropics/s...
loading . . .
skills/skills/skill-creator at main · anthropics/skills
Public repository for Agent Skills. Contribute to anthropics/skills development by creating an account on GitHub.
https://github.com/anthropics/skills/tree/main/skills/skill-creator
about 1 month ago
1
2
1
1/n I just spent time reading
@simonwillison.net
’s guide on agentic engineering patterns and it shifted how I think about coding with AI. The mindset isn’t “let the AI figure it out” — that’s vibe-coding™️.
about 1 month ago
2
1
0
Living dangerously with Claude
simonwillison.net/2025/Oct/22/...
loading . . .
Living dangerously with Claude
I gave a talk last night at Claude Code Anonymous in San Francisco, the unofficial meetup for coding agent enthusiasts. I decided to talk about a dichotomy I’ve been struggling …
https://simonwillison.net/2025/Oct/22/living-dangerously-with-claude/
about 1 month ago
0
0
0
Github has a recommendation on doing dotfiles:
dotfiles.github.io
loading . . .
GitHub does dotfiles - dotfiles.github.io
https://dotfiles.github.io
2 months ago
0
0
0
I just published a new guide on configuring
#OpenClaw
🦀 to run with
#Azure
AI Foundry models. You control data control, so more privacy, talk to it from
#Telegram
or using the console! Check it out here:
suraj.io/post/2026/op...
loading . . .
Setting Up OpenClaw with Azure AI Foundry
Learn how to configure OpenClaw to use Azure AI Foundry models, giving you a self-hosted AI assistant accessible from Telegram and other chat apps.
https://suraj.io/post/2026/openclaw-with-azure/
2 months ago
0
2
0
Apple has a new native container CLI for macOS! Run Linux containers without Docker Desktop—with sub-second startup times. 🚀 My guide covers setup, resource limits, and fixing macOS firewall blocks: 🔗
suraj.io/post/2026/us...
#macOS
#Containers
loading . . .
Running Linux Containers Natively on macOS with Apple's Container CLI
Learn how to use Apple's container CLI tool to run Linux containers as lightweight VMs on macOS with sub-second startup times
https://suraj.io/post/2026/using-osx-containerization/
2 months ago
0
2
0
1/n 📚 Made something for fellow book nerds using Openclaw: A Goodreads skill that lets your AI agent search for books, pull up details & reviews, get personalized recommendations, and manage your reading lists — all through natural language.
loading . . .
goodreads — ClawHub
Search for books, get book details and reviews, discover personalized recommendations, and manage reading lists on Goodreads — all through browser automation.
https://clawhub.ai/surajssd/goodreads
2 months ago
1
1
0
Deploying
#Kimi
K2.5 on
#Azure
: A Complete Guide to Running MoonshotAI's Model
suraj.io/post/2026/de...
loading . . .
Deploying Kimi K2.5 on Azure: A Complete Guide to Running MoonshotAI's Model
Learn how to deploy and configure Kimi K2.5 on Azure AI Foundry with this step-by-step guide.
https://suraj.io/post/2026/deploying-kimi-k2-on-azure/
3 months ago
0
0
0
Running Pydantic’s Monty Rust sandboxed Python subset in WebAssembly
simonwillison.net/2026/Feb/6/p...
loading . . .
Running Pydantic’s Monty Rust sandboxed Python subset in WebAssembly
There’s a jargon-filled headline for you! Everyone’s building sandboxes for running untrusted code right now, and Pydantic’s latest attempt, Monty, provides a custom Python-like language (a subset of ...
https://simonwillison.net/2026/Feb/6/pydantic-monty/
3 months ago
0
2
2
Thanks to
@scott.hanselman.com
for showing me Handy (
handy.computer
) — a free, open-source speech-to-text tool that runs locally on your machine. Push-to-talk, privacy-focused, and just works. Check it out!
loading . . .
Handy
Handy is a cross platform, open-source, speech-to-text application for your computer
https://handy.computer
3 months ago
2
42
13
Running Docker Commands on a Remote Machine via SSH
suraj.io/post/2026/re...
#docker
#ssh
#remote
#containers
#cli
#development
#devops
loading . . .
Running Docker Commands on a Remote Machine via SSH
Learn how to execute Docker commands on a remote machine from your local terminal using SSH and Docker contexts
https://suraj.io/post/2026/remote-machine-as-docker-runner/
3 months ago
0
0
0
Using Claude Code with GitHub-Hosted Anthropic Models
suraj.io/post/2026/us...
#claude
#github-models
#ai
#litellm
#anthropic
loading . . .
Using Claude Code with GitHub-Hosted Anthropic Models
Learn how to use Claude Code CLI with GitHub Models by proxying requests through litellm-proxy
https://suraj.io/post/2026/use-claude-code-with-gh-models/
3 months ago
0
0
0
Meta’s Kubernetes-based Portable AI Research Environment
youtu.be/ts7bI51gRCo?...
loading . . .
Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/ts7bI51gRCo?si=TkhhTSBAHp6jmbU8
5 months ago
0
1
0
Our talk (me & Yuhan Liu) on improving LLM serving efficienty is on YouTube now!
youtu.be/2YCDvZokqnk?...
#vllm
#kubernetes
#kubecon
loading . . .
LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/2YCDvZokqnk?si=of1DG2k5dBIBN0I2
5 months ago
0
3
0
Infinite scale: The architecture behind the Azure AI superfactory
blogs.microsoft.com/blog/2025/11...
loading . . .
Infinite scale: The architecture behind the Azure AI superfactory - The Official Microsoft Blog
Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of A...
https://blogs.microsoft.com/blog/2025/11/12/infinite-scale-the-architecture-behind-the-azure-ai-superfactory/
5 months ago
0
2
0
Gemini 3, Open AI kv cache and much more
open.substack.com/pub/simonw/p...
loading . . .
Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark
Plus what happens if AI labs train for pelicans riding bicycles?
https://open.substack.com/pub/simonw/p/trying-out-gemini-3-pro-with-audio?r=ax4pb&utm_medium=ios
5 months ago
0
1
0
Open AI gave some of the details from the user POV as to what kv cache features are available
platform.openai.com/docs/guides/...
It is interesting to see that they cache for 10 min and if no request is found they remove hot caches from GPU
loading . . .
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
https://platform.openai.com/docs/guides/prompt-caching
5 months ago
1
1
0
From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory
news.microsoft.com/source/featu...
loading . . .
Microsoft AI superfactory
Microsoft unveiled its second Fairwater AI datacenter in Atlanta as part of a new AI superfactory working across states in nearly real time.
https://news.microsoft.com/source/features/ai/from-wisconsin-to-atlanta-microsoft-connects-datacenters-to-build-its-first-ai-superfactory/
5 months ago
0
0
0
Satya Nadella – How Microsoft thinks about AGI
youtu.be/8-boBsWcr5A?...
loading . . .
Satya Nadella – How Microsoft thinks about AGI
YouTube video by Dwarkesh Patel
https://youtu.be/8-boBsWcr5A?si=L15eJ4kpSqJSY4MJ
5 months ago
0
0
0
How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte Scale
www.youtube.com/watch?v=pbOv...
loading . . .
Keynote: How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte... F. Ponce
YouTube video by CNCF [Cloud Native Computing Foundation]
https://www.youtube.com/watch?v=pbOvWxuYPIU
5 months ago
0
0
0
Come see us (me & Yuhan Liu) tomorrow for our talk. Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1. More info:
sched.co/27FcQ
#kubecon
#vllm
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
6 months ago
0
0
0
Announcing Ray Direct Transport: RDMA Support in Ray Core
www.anyscale.com/blog/ray-dir...
loading . . .
Ray Direct Transport: RDMA Support in Ray Core (Part 1)
Ray Direct Transport enables fast and direct GPU transfers in Ray via RDMA-backed transports. Using RDT, we can achieve up to 1000x faster GPU-GPU transfers than Ray’s native object store with a few l...
https://www.anyscale.com/blog/ray-direct-transport-rdma-support-in-ray-core
6 months ago
0
1
0
Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...
loading . . .
Building a tool to copy-paste share terminal sessions using Claude Code for web
Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas
https://open.substack.com/pub/simonw/p/building-a-tool-to-copy-paste-share?utm_campaign=post&utm_medium=email
6 months ago
0
2
0
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665
loading . . .
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...
https://arxiv.org/abs/2510.09665
6 months ago
0
1
0
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
developer.nvidia.com/blog/underst...
loading . . .
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware…
https://developer.nvidia.com/blog/understanding-memory-management-on-hardware-coherent-platforms/
6 months ago
0
1
0
Join me and Yuhan Liu for our talk at the upcoming
#Kubecon
NA 2025 in Atlanta:
sched.co/27FcQ
we will talk about increasing efficency while serving
#LLMs
using
#vLLM
&
#LMCache
!
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
7 months ago
0
1
0
Using Claude Code but with Github Copilot hosted Claude models:
github.com/surajssd/dot...
TFS
@nilekh.bsky.social
loading . . .
https://github.com/surajssd/dotfiles/blob/master/local-bin/litellm-proxy.sh
7 months ago
0
1
0
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
developer.nvidia.com/blog/nvidia-...
loading . . .
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
SemiAnalysis recently launched InferenceMAX v1, a new open source initiative that provides a comprehensive methodology to evaluate inference hardware performance. Published results demonstrate that…
https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/
7 months ago
0
0
0
Claude Code: Tips and Tricks
youtu.be/HSkLeECsBcw?...
loading . . .
Claude Code: Tips and Tricks
YouTube video by Anand Tyagi
https://youtu.be/HSkLeECsBcw?si=MjgHvnKZGmuFA7WQ
7 months ago
0
0
0
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...
loading . . .
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
YouTube video by @Scale
https://www.youtube.com/watch?v=4Beffz-HNsk
7 months ago
0
0
0
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
developer.nvidia.com/blog/how-to-...
#LMCache
loading . . .
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…
https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
7 months ago
0
0
0
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
www.infoq.com/articles/llm...
loading . . .
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving...
https://www.infoq.com/articles/llms-evolution-ai-infrastructure/
7 months ago
0
1
0
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
developer.nvidia.com/blog/cut-mod...
loading . . .
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…
https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
7 months ago
0
0
0
The Only Trait for Success in the AI Era—How to Build It
youtu.be/xWYb7tImErI?...
loading . . .
The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh
YouTube video by EO
https://youtu.be/xWYb7tImErI?si=JU6exneyjb7V724-
8 months ago
0
0
0
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving
youtu.be/WwJvecXOeUA?...
loading . . .
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
YouTube video by USENIX
https://youtu.be/WwJvecXOeUA?si=pPBbxLak2QcQc5fh
8 months ago
0
0
0
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
youtu.be/S8rq3pYboZY?...
loading . . .
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
YouTube video by USENIX
https://youtu.be/S8rq3pYboZY?si=6_rmrSnAV3eGYao9
8 months ago
0
0
0
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation
youtu.be/YqIHESG0suI?...
loading . . .
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/YqIHESG0suI?si=MzauN6ZbtaELrj-N
8 months ago
0
1
0
Extending Kubernetes for AI | Lessons Learned From Platform Engineering
youtu.be/d9K5PSsHtDg?...
loading . . .
Extending Kubernetes for AI | Lessons Learned From Platform... - Susan, Lucy, Andrea, Etienne, Tim
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/d9K5PSsHtDg?si=XEBcpZXMfIm_fJqe
8 months ago
0
0
0
You Need to Be Bored. Here's Why.
www.youtube.com/watch?v=orQK...
loading . . .
You Need to Be Bored. Here's Why.
YouTube video by Harvard Business Review
https://www.youtube.com/watch?v=orQKfIXMiA8
8 months ago
0
1
0
You can use ChatGPT and other models on a flight using onboard free WiFi via WhatsApp. Use MetaAI out of the box or save these contacts: - ChatGPT 1800 242 8478 - Microsoft Copilot +1 (877) 224-1042
8 months ago
0
0
0
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion
developer.nvidia.com/blog/scaling...
loading . . .
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion | NVIDIA Technical Blog
The exponential growth in AI model complexity has driven parameter counts from millions to trillions, requiring unprecedented computational resources that require clusters of GPUs to accommodate.
https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion
8 months ago
0
0
0
Andrej Karpathy: Software Is Changing (Again)
youtu.be/LCEmiRjPEtQ?...
loading . . .
Andrej Karpathy: Software Is Changing (Again)
YouTube video by Y Combinator
https://youtu.be/LCEmiRjPEtQ?si=vafoLV7HtvyAZ2fX
8 months ago
0
0
0
Claude, Qwen and Google models
open.substack.com/pub/simonw/p...
loading . . .
Reverse engineering some updates to Claude
Plus Qwen 3 Coder Flash, Gemini Deep Think, kimi-k2-turbo-preview
https://open.substack.com/pub/simonw/p/reverse-engineering-some-updates?r=ax4pb
9 months ago
0
0
0
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
www.youtube.com/live/Bh-jlh5...
loading . . .
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
YouTube video by PyTorch
https://www.youtube.com/live/Bh-jlh5vlF0?si=AW68CxMCp70y2JMD
9 months ago
0
0
0
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
arxiv.org/html/2506.23...
loading . . .
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
https://arxiv.org/html/2506.23628v1
9 months ago
0
0
0
This is a handy database to look at the pricing, supported input and context window size:
models.dev
loading . . .
Models.dev — An open-source database of AI models
Models.dev is a comprehensive open-source database of AI model specifications, pricing, and features.
https://models.dev/
9 months ago
0
0
0
Using LLMs to write meaningful commit messages from CLI
suraj.io/post/2025/ll...
loading . . .
Using LLMs to write meaningful commit messages
Learn how to use the llm CLI tool with GitHub Copilot models to generate meaningful commit messages directly from your terminal.
https://suraj.io/post/2025/llm-commit-messages/
10 months ago
0
0
0
Reverse engineering claude code:
simonwillison.net/2025/Jun/2/c...
loading . . .
claude-trace
I've been thinking for a while it would be interesting to run some kind of HTTP proxy against the Claude Code CLI app and take a peek at how it …
https://simonwillison.net/2025/Jun/2/claude-trace/
10 months ago
0
0
0
A Model Context Protocol (MCP) server that provides browser automation capabilities using Playwright. This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.
github.com/microsoft/pl...
loading . . .
GitHub - microsoft/playwright-mcp: Playwright MCP server
Playwright MCP server. Contribute to microsoft/playwright-mcp development by creating an account on GitHub.
https://github.com/microsoft/playwright-mcp
10 months ago
2
1
0
Load more
feeds!
log in