Suraj Deshmukh | सुरज देशमुख
@suraj.io
📤 202
📥 291
📝 141
@Microsoft.com
| ex-@kinvolkio ex-@RedHat | bibliophile | He/Him | Opinions are my own. 🟥 🟩 🟦 🟨
Meta’s Kubernetes-based Portable AI Research Environment
youtu.be/ts7bI51gRCo?...
loading . . .
Meta’s Kubernetes-based Portable AI Research Environment - Shaun Hopper, Meta & Navarre Pratt
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/ts7bI51gRCo?si=TkhhTSBAHp6jmbU8
20 days ago
0
0
0
Our talk (me & Yuhan Liu) on improving LLM serving efficienty is on YouTube now!
youtu.be/2YCDvZokqnk?...
#vllm
#kubernetes
#kubecon
loading . . .
LLMs on Kubernetes: Squeeze 5x GPU Efficiency With Cache, Route, Repea... Yuhan Liu & Suraj Deshmukh
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/2YCDvZokqnk?si=of1DG2k5dBIBN0I2
20 days ago
0
2
0
Infinite scale: The architecture behind the Azure AI superfactory
blogs.microsoft.com/blog/2025/11...
loading . . .
Infinite scale: The architecture behind the Azure AI superfactory - The Official Microsoft Blog
Today, we are unveiling the next Fairwater site of Azure AI datacenters in Atlanta, Georgia. This purpose-built datacenter is connected to our first Fairwater site in Wisconsin, prior generations of A...
https://blogs.microsoft.com/blog/2025/11/12/infinite-scale-the-architecture-behind-the-azure-ai-superfactory/
26 days ago
0
2
0
Gemini 3, Open AI kv cache and much more
open.substack.com/pub/simonw/p...
loading . . .
Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark
Plus what happens if AI labs train for pelicans riding bicycles?
https://open.substack.com/pub/simonw/p/trying-out-gemini-3-pro-with-audio?r=ax4pb&utm_medium=ios
26 days ago
0
1
0
Open AI gave some of the details from the user POV as to what kv cache features are available
platform.openai.com/docs/guides/...
It is interesting to see that they cache for 10 min and if no request is found they remove hot caches from GPU
loading . . .
OpenAI Platform
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
https://platform.openai.com/docs/guides/prompt-caching
26 days ago
1
1
0
From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory
news.microsoft.com/source/featu...
loading . . .
Microsoft AI superfactory
Microsoft unveiled its second Fairwater AI datacenter in Atlanta as part of a new AI superfactory working across states in nearly real time.
https://news.microsoft.com/source/features/ai/from-wisconsin-to-atlanta-microsoft-connects-datacenters-to-build-its-first-ai-superfactory/
27 days ago
0
0
0
Satya Nadella – How Microsoft thinks about AGI
youtu.be/8-boBsWcr5A?...
loading . . .
Satya Nadella – How Microsoft thinks about AGI
YouTube video by Dwarkesh Patel
https://youtu.be/8-boBsWcr5A?si=L15eJ4kpSqJSY4MJ
about 1 month ago
0
0
0
How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte Scale
www.youtube.com/watch?v=pbOv...
loading . . .
Keynote: How One Line of Code Freed 30,000 CPU Cores: Deep-Diving Fluent Bit at Petabyte... F. Ponce
YouTube video by CNCF [Cloud Native Computing Foundation]
https://www.youtube.com/watch?v=pbOvWxuYPIU
about 1 month ago
0
0
0
Come see us (me & Yuhan Liu) tomorrow for our talk. Specifically, Wednesday November 12, 2025 5:30pm - 6:00pm EST at Building B | Level 5 | Thomas Murphy Ballroom 1. More info:
sched.co/27FcQ
#kubecon
#vllm
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
about 1 month ago
0
0
0
Announcing Ray Direct Transport: RDMA Support in Ray Core
www.anyscale.com/blog/ray-dir...
loading . . .
Ray Direct Transport: RDMA Support in Ray Core (Part 1)
Ray Direct Transport enables fast and direct GPU transfers in Ray via RDMA-backed transports. Using RDT, we can achieve up to 1000x faster GPU-GPU transfers than Ray’s native object store with a few l...
https://www.anyscale.com/blog/ray-direct-transport-rdma-support-in-ray-core
about 1 month ago
0
1
0
Building a tool to copy-paste share terminal sessions using Claude Code for web
open.substack.com/pub/simonw/p...
loading . . .
Building a tool to copy-paste share terminal sessions using Claude Code for web
Plus Living dangerously with Claude, and prompt injection risks for ChatGPT Atlas
https://open.substack.com/pub/simonw/p/building-a-tool-to-copy-paste-share?utm_campaign=post&utm_medium=email
about 2 months ago
0
2
0
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
arxiv.org/abs/2510.09665
loading . . .
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant compu...
https://arxiv.org/abs/2510.09665
about 2 months ago
0
1
0
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
developer.nvidia.com/blog/underst...
loading . . .
Understanding Memory Management on Hardware-Coherent Platforms | NVIDIA Technical Blog
If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware…
https://developer.nvidia.com/blog/understanding-memory-management-on-hardware-coherent-platforms/
about 2 months ago
0
1
0
Join me and Yuhan Liu for our talk at the upcoming
#Kubecon
NA 2025 in Atlanta:
sched.co/27FcQ
we will talk about increasing efficency while serving
#LLMs
using
#vLLM
&
#LMCache
!
loading . . .
KubeCon + CloudNativeCon North America 2025: LLMs on Kubernetes: Squeeze 5x GPU Effic...
View more about this event at KubeCon + CloudNativeCon North America 2025
https://sched.co/27FcQ
2 months ago
0
1
0
Using Claude Code but with Github Copilot hosted Claude models:
github.com/surajssd/dot...
TFS
@nilekh.bsky.social
loading . . .
https://github.com/surajssd/dotfiles/blob/master/local-bin/litellm-proxy.sh
2 months ago
0
1
0
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
developer.nvidia.com/blog/nvidia-...
loading . . .
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks | NVIDIA Technical Blog
SemiAnalysis recently launched InferenceMAX v1, a new open source initiative that provides a comprehensive methodology to evaluate inference hardware performance. Published results demonstrate that…
https://developer.nvidia.com/blog/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks/
2 months ago
0
0
0
Claude Code: Tips and Tricks
youtu.be/HSkLeECsBcw?...
loading . . .
Claude Code: Tips and Tricks
YouTube video by Anand Tyagi
https://youtu.be/HSkLeECsBcw?si=MjgHvnKZGmuFA7WQ
2 months ago
0
0
0
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...
loading . . .
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
YouTube video by @Scale
https://www.youtube.com/watch?v=4Beffz-HNsk
3 months ago
0
0
0
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
developer.nvidia.com/blog/how-to-...
#LMCache
loading . . .
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…
https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
3 months ago
0
0
0
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
www.infoq.com/articles/llm...
loading . . .
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving...
https://www.infoq.com/articles/llms-evolution-ai-infrastructure/
3 months ago
0
1
0
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
developer.nvidia.com/blog/cut-mod...
loading . . .
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…
https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
3 months ago
0
0
0
The Only Trait for Success in the AI Era—How to Build It
youtu.be/xWYb7tImErI?...
loading . . .
The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh
YouTube video by EO
https://youtu.be/xWYb7tImErI?si=JU6exneyjb7V724-
3 months ago
0
0
0
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving
youtu.be/WwJvecXOeUA?...
loading . . .
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
YouTube video by USENIX
https://youtu.be/WwJvecXOeUA?si=pPBbxLak2QcQc5fh
4 months ago
0
0
0
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
youtu.be/S8rq3pYboZY?...
loading . . .
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
YouTube video by USENIX
https://youtu.be/S8rq3pYboZY?si=6_rmrSnAV3eGYao9
4 months ago
0
0
0
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation
youtu.be/YqIHESG0suI?...
loading . . .
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/YqIHESG0suI?si=MzauN6ZbtaELrj-N
4 months ago
0
1
0
Extending Kubernetes for AI | Lessons Learned From Platform Engineering
youtu.be/d9K5PSsHtDg?...
loading . . .
Extending Kubernetes for AI | Lessons Learned From Platform... - Susan, Lucy, Andrea, Etienne, Tim
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/d9K5PSsHtDg?si=XEBcpZXMfIm_fJqe
4 months ago
0
0
0
You Need to Be Bored. Here's Why.
www.youtube.com/watch?v=orQK...
loading . . .
You Need to Be Bored. Here's Why.
YouTube video by Harvard Business Review
https://www.youtube.com/watch?v=orQKfIXMiA8
4 months ago
0
1
0
You can use ChatGPT and other models on a flight using onboard free WiFi via WhatsApp. Use MetaAI out of the box or save these contacts: - ChatGPT 1800 242 8478 - Microsoft Copilot +1 (877) 224-1042
4 months ago
0
0
0
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion
developer.nvidia.com/blog/scaling...
loading . . .
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion | NVIDIA Technical Blog
The exponential growth in AI model complexity has driven parameter counts from millions to trillions, requiring unprecedented computational resources that require clusters of GPUs to accommodate.
https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion
4 months ago
0
0
0
Andrej Karpathy: Software Is Changing (Again)
youtu.be/LCEmiRjPEtQ?...
loading . . .
Andrej Karpathy: Software Is Changing (Again)
YouTube video by Y Combinator
https://youtu.be/LCEmiRjPEtQ?si=vafoLV7HtvyAZ2fX
4 months ago
0
0
0
Claude, Qwen and Google models
open.substack.com/pub/simonw/p...
loading . . .
Reverse engineering some updates to Claude
Plus Qwen 3 Coder Flash, Gemini Deep Think, kimi-k2-turbo-preview
https://open.substack.com/pub/simonw/p/reverse-engineering-some-updates?r=ax4pb
4 months ago
0
0
0
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
www.youtube.com/live/Bh-jlh5...
loading . . .
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
YouTube video by PyTorch
https://www.youtube.com/live/Bh-jlh5vlF0?si=AW68CxMCp70y2JMD
5 months ago
0
0
0
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
arxiv.org/html/2506.23...
loading . . .
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
https://arxiv.org/html/2506.23628v1
5 months ago
0
0
0
This is a handy database to look at the pricing, supported input and context window size:
models.dev
loading . . .
Models.dev — An open-source database of AI models
Models.dev is a comprehensive open-source database of AI model specifications, pricing, and features.
https://models.dev/
5 months ago
0
0
0
Using LLMs to write meaningful commit messages from CLI
suraj.io/post/2025/ll...
loading . . .
Using LLMs to write meaningful commit messages
Learn how to use the llm CLI tool with GitHub Copilot models to generate meaningful commit messages directly from your terminal.
https://suraj.io/post/2025/llm-commit-messages/
5 months ago
0
0
0
Reverse engineering claude code:
simonwillison.net/2025/Jun/2/c...
loading . . .
claude-trace
I've been thinking for a while it would be interesting to run some kind of HTTP proxy against the Claude Code CLI app and take a peek at how it …
https://simonwillison.net/2025/Jun/2/claude-trace/
5 months ago
0
0
0
A Model Context Protocol (MCP) server that provides browser automation capabilities using Playwright. This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.
github.com/microsoft/pl...
loading . . .
GitHub - microsoft/playwright-mcp: Playwright MCP server
Playwright MCP server. Contribute to microsoft/playwright-mcp development by creating an account on GitHub.
https://github.com/microsoft/playwright-mcp
5 months ago
2
1
0
InfiniBand Multilayered Security Protects Data Centers and AI Workloads
developer.nvidia.com/blog/infinib...
loading . . .
InfiniBand Multilayered Security Protects Data Centers and AI Workloads | NVIDIA Technical Blog
In today’s data-driven world, security isn’t just a feature—it’s the foundation. With the exponential growth of AI, HPC, and hyperscale cloud computing, the integrity of the network fabric is more…
https://developer.nvidia.com/blog/infiniband-multilayered-security-protects-data-centers-and-ai-workloads
5 months ago
0
0
0
AI for you CI CD:
github.com/githubnext/a...
loading . . .
GitHub - githubnext/awesome-continuous-ai: An awesome list of Continuous AI Actions and Frameworks
An awesome list of Continuous AI Actions and Frameworks - githubnext/awesome-continuous-ai
https://github.com/githubnext/awesome-continuous-ai
5 months ago
0
0
0
How Susceptible Are You to the Sunk Cost Fallacy?
hbr.org/2021/07/how-...
loading . . .
How Susceptible Are You to the Sunk Cost Fallacy?
Many managers are susceptible to the famous sunk cost effect, whereby they persist investing in a money-losing project even when it makes sense to invest the new money in alternative new projects. The...
https://hbr.org/2021/07/how-susceptible-are-you-to-the-sunk-cost-fallacy
5 months ago
0
0
0
Keep ublock origin working on Google Chrome:
www.reddit.com/r/Adblock/co...
loading . . .
Pasko13's comment on "whats currently the best way to force re-enable ublock origin in chrome?"
Explore this conversation and more from the Adblock community
https://www.reddit.com/r/Adblock/comments/1luqxs1/comment/n2kaxwr/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
5 months ago
1
0
0
The Speed of Thought: Navigate LLM Inference Autoscaling for a Gen AI Application Toward Production
www.nvidia.com/en-us/on-dem...
loading . . .
The Speed of Thought: Navigate LLM Inference Autoscaling for a Gen AI Application Toward Production DLIT71339 | GTC 2025 | NVIDIA On-Demand
Learn how to choose the autoscaling hyperparameters for your LLM applications by understanding the key metrics during inference
https://www.nvidia.com/en-us/on-demand/session/gtc25-dlit71339/
5 months ago
0
0
0
Making your own MCP server in VS Code
youtu.be/SYcQXozpb_E?...
loading . . .
Making your own MCP server in VS Code
YouTube video by Microsoft Developer
https://youtu.be/SYcQXozpb_E?si=BCm5Im00LwXBdvhT
5 months ago
0
0
0
Benchmarking LLM Inference Costs for Smarter Scaling and Deployment
developer.nvidia.com/blog/benchma...
loading . . .
Benchmarking LLM Inference Costs for Smarter Scaling and Deployment | NVIDIA Technical Blog
This is the third post in the large language model latency-throughput benchmarking series, which aims to instruct developers on how to determine the cost of LLM inference by estimating the total cost…
https://developer.nvidia.com/blog/benchmarking-llm-inference-costs-for-smarter-scaling-and-deployment
6 months ago
0
0
0
Use OpenAI's Codex with Grok on Azure:
suraj.io/post/2025/de...
loading . . .
Deploying Grok-3 on Azure: A Complete Guide to Running xAI's Latest Model
Learn how to deploy and configure Grok-3 on Azure AI Foundry with this step-by-step guide. Set up your own instance of xAI's powerful language model in the cloud.
https://suraj.io/post/2025/deploying-grok-3-on-azure/#using-grok-with-openai-codex
6 months ago
0
0
0
Reference: H100 Inference Performance - Max Throughput Llama v3.1 70B and 8B
developer.nvidia.com/deep-learnin...
loading . . .
Inference Performance for Data Center Deep Learning
Deliver great user experiences by lowering latency.
https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference#:~:text=H100%20Inference%20Performance%20%2D%20Max%20Throughput
6 months ago
0
0
0
LLM Inference Benchmarking: Fundamental Concepts
developer.nvidia.com/blog/llm-ben...
loading . . .
LLM Inference Benchmarking: Fundamental Concepts | NVIDIA Technical Blog
This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM benchmarking, fundamental concepts…
https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/
6 months ago
0
0
0
The “S” in MCP Stands for Security
elenacross7.medium.com/%EF%B8%8F-th...
loading . . .
The “S” in MCP Stands for Security
Spoiler: it doesn’t. But it should.
https://elenacross7.medium.com/%EF%B8%8F-the-s-in-mcp-stands-for-security-91407b33ed6b
6 months ago
1
1
0
The first copyright ruling on generative AI training is a win for AI labs
www.understandingai.org/p/the-first-...
loading . . .
The first copyright ruling on generative AI training is a win for AI labs
New ruling provides a blueprint for AI companies to stay on the right side of the law.
https://www.understandingai.org/p/the-first-copyright-ruling-on-generative?r=ax4pb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
6 months ago
0
0
0
TIL: Gemini CLI is almost free to use
open.substack.com/pub/simonw/p...
loading . . .
Phoenix.new is Fly's entry into the prompt-driven app development space
Plus exploring the system prompts for Gemini CLI and Claude AI artifacts
https://open.substack.com/pub/simonw/p/phoenixnew-is-flys-entry-into-the?r=ax4pb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
6 months ago
0
1
0
Load more
feeds!
log in