Suraj Deshmukh | सुरज देशमुख
@suraj.io
📤 197
📥 290
📝 122
@Microsoft.com
| ex-@kinvolkio ex-@RedHat | bibliophile | He/Him | Opinions are my own. 🟥 🟩 🟦 🟨
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
www.youtube.com/watch?v=4Bef...
loading . . .
Gang Scheduling for Llama by Anca Agape and Andre Darabanov
YouTube video by @Scale
https://www.youtube.com/watch?v=4Beffz-HNsk
11 days ago
0
0
0
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
developer.nvidia.com/blog/how-to-...
#LMCache
loading . . .
How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo | NVIDIA Technical Blog
As AI models grow larger and more sophisticated, inference, the process by which a model generates responses, is becoming a major challenge. Large language models (LLMs) like GPT-OSS and DeepSeek-R1…
https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
11 days ago
0
0
0
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
www.infoq.com/articles/llm...
loading . . .
Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure
Large Language Model (LLM) inference faces a fundamental challenge: the same hardware that excels at processing input prompts struggles with generating responses, and vice versa. Disaggregated serving...
https://www.infoq.com/articles/llms-evolution-ai-infrastructure/
11 days ago
0
1
0
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
developer.nvidia.com/blog/cut-mod...
loading . . .
Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap | NVIDIA Technical Blog
Deploying large language models (LLMs) at scale presents a dual challenge: ensuring fast responsiveness during high demand, while managing the costs of GPUs. Organizations often face a trade-off…
https://developer.nvidia.com/blog/cut-model-deployment-costs-while-keeping-performance-with-gpu-memory-swap/
13 days ago
0
0
0
The Only Trait for Success in the AI Era—How to Build It
youtu.be/xWYb7tImErI?...
loading . . .
The Only Trait for Success in the AI Era—How to Build It | Carnegie Mellon University Po-Shen Loh
YouTube video by EO
https://youtu.be/xWYb7tImErI?si=JU6exneyjb7V724-
about 1 month ago
0
0
0
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM serving
youtu.be/WwJvecXOeUA?...
loading . . .
OSDI '24 - DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language...
YouTube video by USENIX
https://youtu.be/WwJvecXOeUA?si=pPBbxLak2QcQc5fh
about 2 months ago
0
0
0
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
youtu.be/S8rq3pYboZY?...
loading . . .
OSDI '24 - Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
YouTube video by USENIX
https://youtu.be/S8rq3pYboZY?si=6_rmrSnAV3eGYao9
about 2 months ago
0
0
0
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduling with Dynamic Resource Allocation
youtu.be/YqIHESG0suI?...
loading . . .
More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/YqIHESG0suI?si=MzauN6ZbtaELrj-N
about 2 months ago
0
1
0
Extending Kubernetes for AI | Lessons Learned From Platform Engineering
youtu.be/d9K5PSsHtDg?...
loading . . .
Extending Kubernetes for AI | Lessons Learned From Platform... - Susan, Lucy, Andrea, Etienne, Tim
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/d9K5PSsHtDg?si=XEBcpZXMfIm_fJqe
about 2 months ago
0
0
0
You Need to Be Bored. Here's Why.
www.youtube.com/watch?v=orQK...
loading . . .
You Need to Be Bored. Here's Why.
YouTube video by Harvard Business Review
https://www.youtube.com/watch?v=orQKfIXMiA8
about 2 months ago
0
1
0
You can use ChatGPT and other models on a flight using onboard free WiFi via WhatsApp. Use MetaAI out of the box or save these contacts: - ChatGPT 1800 242 8478 - Microsoft Copilot +1 (877) 224-1042
about 2 months ago
0
0
0
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion
developer.nvidia.com/blog/scaling...
loading . . .
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion | NVIDIA Technical Blog
The exponential growth in AI model complexity has driven parameter counts from millions to trillions, requiring unprecedented computational resources that require clusters of GPUs to accommodate.
https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion
about 2 months ago
0
0
0
Andrej Karpathy: Software Is Changing (Again)
youtu.be/LCEmiRjPEtQ?...
loading . . .
Andrej Karpathy: Software Is Changing (Again)
YouTube video by Y Combinator
https://youtu.be/LCEmiRjPEtQ?si=vafoLV7HtvyAZ2fX
about 2 months ago
0
0
0
Claude, Qwen and Google models
open.substack.com/pub/simonw/p...
loading . . .
Reverse engineering some updates to Claude
Plus Qwen 3 Coder Flash, Gemini Deep Think, kimi-k2-turbo-preview
https://open.substack.com/pub/simonw/p/reverse-engineering-some-updates?r=ax4pb
about 2 months ago
0
0
0
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
www.youtube.com/live/Bh-jlh5...
loading . . .
DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference
YouTube video by PyTorch
https://www.youtube.com/live/Bh-jlh5vlF0?si=AW68CxMCp70y2JMD
3 months ago
0
0
0
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
arxiv.org/html/2506.23...
loading . . .
The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
https://arxiv.org/html/2506.23628v1
3 months ago
0
0
0
This is a handy database to look at the pricing, supported input and context window size:
models.dev
loading . . .
Models.dev — An open-source database of AI models
Models.dev is a comprehensive open-source database of AI model specifications, pricing, and features.
https://models.dev/
3 months ago
0
0
0
Using LLMs to write meaningful commit messages from CLI
suraj.io/post/2025/ll...
loading . . .
Using LLMs to write meaningful commit messages
Learn how to use the llm CLI tool with GitHub Copilot models to generate meaningful commit messages directly from your terminal.
https://suraj.io/post/2025/llm-commit-messages/
3 months ago
0
0
0
Reverse engineering claude code:
simonwillison.net/2025/Jun/2/c...
loading . . .
claude-trace
I've been thinking for a while it would be interesting to run some kind of HTTP proxy against the Claude Code CLI app and take a peek at how it …
https://simonwillison.net/2025/Jun/2/claude-trace/
3 months ago
0
0
0
A Model Context Protocol (MCP) server that provides browser automation capabilities using Playwright. This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.
github.com/microsoft/pl...
loading . . .
GitHub - microsoft/playwright-mcp: Playwright MCP server
Playwright MCP server. Contribute to microsoft/playwright-mcp development by creating an account on GitHub.
https://github.com/microsoft/playwright-mcp
3 months ago
2
1
0
InfiniBand Multilayered Security Protects Data Centers and AI Workloads
developer.nvidia.com/blog/infinib...
loading . . .
InfiniBand Multilayered Security Protects Data Centers and AI Workloads | NVIDIA Technical Blog
In today’s data-driven world, security isn’t just a feature—it’s the foundation. With the exponential growth of AI, HPC, and hyperscale cloud computing, the integrity of the network fabric is more…
https://developer.nvidia.com/blog/infiniband-multilayered-security-protects-data-centers-and-ai-workloads
3 months ago
0
0
0
AI for you CI CD:
github.com/githubnext/a...
loading . . .
GitHub - githubnext/awesome-continuous-ai: An awesome list of Continuous AI Actions and Frameworks
An awesome list of Continuous AI Actions and Frameworks - githubnext/awesome-continuous-ai
https://github.com/githubnext/awesome-continuous-ai
3 months ago
0
0
0
How Susceptible Are You to the Sunk Cost Fallacy?
hbr.org/2021/07/how-...
loading . . .
How Susceptible Are You to the Sunk Cost Fallacy?
Many managers are susceptible to the famous sunk cost effect, whereby they persist investing in a money-losing project even when it makes sense to invest the new money in alternative new projects. The...
https://hbr.org/2021/07/how-susceptible-are-you-to-the-sunk-cost-fallacy
3 months ago
0
0
0
Keep ublock origin working on Google Chrome:
www.reddit.com/r/Adblock/co...
loading . . .
Pasko13's comment on "whats currently the best way to force re-enable ublock origin in chrome?"
Explore this conversation and more from the Adblock community
https://www.reddit.com/r/Adblock/comments/1luqxs1/comment/n2kaxwr/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
3 months ago
1
0
0
The Speed of Thought: Navigate LLM Inference Autoscaling for a Gen AI Application Toward Production
www.nvidia.com/en-us/on-dem...
loading . . .
The Speed of Thought: Navigate LLM Inference Autoscaling for a Gen AI Application Toward Production DLIT71339 | GTC 2025 | NVIDIA On-Demand
Learn how to choose the autoscaling hyperparameters for your LLM applications by understanding the key metrics during inference
https://www.nvidia.com/en-us/on-demand/session/gtc25-dlit71339/
3 months ago
0
0
0
Making your own MCP server in VS Code
youtu.be/SYcQXozpb_E?...
loading . . .
Making your own MCP server in VS Code
YouTube video by Microsoft Developer
https://youtu.be/SYcQXozpb_E?si=BCm5Im00LwXBdvhT
3 months ago
0
0
0
Benchmarking LLM Inference Costs for Smarter Scaling and Deployment
developer.nvidia.com/blog/benchma...
loading . . .
Benchmarking LLM Inference Costs for Smarter Scaling and Deployment | NVIDIA Technical Blog
This is the third post in the large language model latency-throughput benchmarking series, which aims to instruct developers on how to determine the cost of LLM inference by estimating the total cost…
https://developer.nvidia.com/blog/benchmarking-llm-inference-costs-for-smarter-scaling-and-deployment
3 months ago
0
0
0
Use OpenAI's Codex with Grok on Azure:
suraj.io/post/2025/de...
loading . . .
Deploying Grok-3 on Azure: A Complete Guide to Running xAI's Latest Model
Learn how to deploy and configure Grok-3 on Azure AI Foundry with this step-by-step guide. Set up your own instance of xAI's powerful language model in the cloud.
https://suraj.io/post/2025/deploying-grok-3-on-azure/#using-grok-with-openai-codex
3 months ago
0
0
0
Reference: H100 Inference Performance - Max Throughput Llama v3.1 70B and 8B
developer.nvidia.com/deep-learnin...
loading . . .
Inference Performance for Data Center Deep Learning
Deliver great user experiences by lowering latency.
https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference#:~:text=H100%20Inference%20Performance%20%2D%20Max%20Throughput
4 months ago
0
0
0
LLM Inference Benchmarking: Fundamental Concepts
developer.nvidia.com/blog/llm-ben...
loading . . .
LLM Inference Benchmarking: Fundamental Concepts | NVIDIA Technical Blog
This is the first post in the large language model latency-throughput benchmarking series, which aims to instruct developers on common metrics used for LLM benchmarking, fundamental concepts…
https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/
4 months ago
0
0
0
The “S” in MCP Stands for Security
elenacross7.medium.com/%EF%B8%8F-th...
loading . . .
The “S” in MCP Stands for Security
Spoiler: it doesn’t. But it should.
https://elenacross7.medium.com/%EF%B8%8F-the-s-in-mcp-stands-for-security-91407b33ed6b
4 months ago
1
1
0
The first copyright ruling on generative AI training is a win for AI labs
www.understandingai.org/p/the-first-...
loading . . .
The first copyright ruling on generative AI training is a win for AI labs
New ruling provides a blueprint for AI companies to stay on the right side of the law.
https://www.understandingai.org/p/the-first-copyright-ruling-on-generative?r=ax4pb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
4 months ago
0
0
0
TIL: Gemini CLI is almost free to use
open.substack.com/pub/simonw/p...
loading . . .
Phoenix.new is Fly's entry into the prompt-driven app development space
Plus exploring the system prompts for Gemini CLI and Claude AI artifacts
https://open.substack.com/pub/simonw/p/phoenixnew-is-flys-entry-into-the?r=ax4pb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
4 months ago
0
1
0
Deploying
#grok
3 on
#Azure
AI Foundry
suraj.io/post/2025/de...
loading . . .
Deploying Grok-3 on Azure: A Complete Guide to Running xAI's Latest Model
Learn how to deploy and configure Grok-3 on Azure AI Foundry with this step-by-step guide. Set up your own instance of xAI's powerful language model in the cloud.
https://suraj.io/post/2025/deploying-grok-3-on-azure/
4 months ago
0
0
0
Intercepting Claude code requests
simonwillison.net/2025/Jun/2/c...
loading . . .
claude-trace
I've been thinking for a while it would be interesting to run some kind of HTTP proxy against the Claude Code CLI app and take a peek at how it …
https://simonwillison.net/2025/Jun/2/claude-trace/
4 months ago
0
0
0
Yuval Noah Harari on the Dangers of AI
youtu.be/uuBLxWowDqI?...
loading . . .
Yuval Noah Harari on the Dangers of AI
YouTube video by Reid Hoffman
https://youtu.be/uuBLxWowDqI?si=vNEVJijaGeTiqMY9
4 months ago
0
2
0
Seven replies to the viral Apple reasoning paper – and why they fall short
open.substack.com/pub/garymarc...
loading . . .
Seven replies to the viral Apple reasoning paper – and why they fall short
Also: another paper that seals the deal
https://open.substack.com/pub/garymarcus/p/seven-replies-to-the-viral-apple?r=ax4pb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
4 months ago
0
0
0
Speed up any video to more than the defined playback-speed-control. Paste this in your browser devtools console: ``` document.querySelector('video').playbackRate = 2.5 ```
simonwillison.net/2025/Jun/19/...
loading . . .
playbackrate
Here's a tip that works on YouTube and almost any other web page that shows you a video. You can increase the playback rate beyond the usually-exposed 2x by running …
https://simonwillison.net/2025/Jun/19/playbackrate/?utm_source=substack&utm_medium=email
4 months ago
0
0
0
Model pricing per input & output tokens
www.llm-prices.com
loading . . .
LLM pricing calculator
https://www.llm-prices.com/
4 months ago
0
0
0
Mark Moyou, PhD - Understanding the end-to-end LLM training and inference pipeline
youtu.be/V2L6hufE2X4?...
loading . . .
Mark Moyou, PhD - Understanding the end-to-end LLM training and inference pipeline
YouTube video by PyData
https://youtu.be/V2L6hufE2X4?si=CMvcGkHt69HpSAlx
4 months ago
0
1
0
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
www.youtube.com/watch?v=FPr3...
loading . . .
vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024
YouTube video by Neural Magic
https://www.youtube.com/watch?v=FPr37jCOvrA
4 months ago
0
0
0
reposted by
Suraj Deshmukh | सुरज देशमुख
Simone
4 months ago
If anyone wants to sign up & join the Azure terraform community call this is the form . They also ask for speakers if you want to submit a topic. I usually just catch the recordings but they sometimes do APAC timeslots.
forms.office.com/Pages/Respon...
loading . . .
Microsoft Forms
https://forms.office.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbRwMQdLijD21DuWxhs1KqAhlUMlIyOUYwWVQ3VzQzWjVOWEVFNkxONjRVVS4u
0
0
3
reposted by
Suraj Deshmukh | सुरज देशमुख
Rory McCune
4 months ago
This is a good awareness video to show people about the challenges of AI imagery and the scams that are now easier to create.
youtu.be/xyaSVBXF1K8?...
loading . . .
The Deepfake Scams You're Not Ready For (Made with Google Veo 3)
YouTube video by @soapsoupproductions
https://youtu.be/xyaSVBXF1K8?si=6P9xjp19qYPfFRDo
0
8
8
Deploying OpenAI Text-to-Speech (TTS) Model on Azure: A Step-by-Step Guide
suraj.io/post/2025/op...
loading . . .
Deploying OpenAI Text-to-Speech (TTS) Model on Azure: A Step-by-Step Guide
Deploying OpenAI Text-to-Speech (TTS) Model on Azure: A Step-by-Step Guide Azure Cognitive Services provides a straightforward way to deploy OpenAI models, including powerful text-to-speech capabiliti...
https://suraj.io/post/2025/opeai-tts-on-azure/
4 months ago
0
0
0
Run Llama 4 Maverick on AKS:
github.com/surajssd/llm...
loading . . .
llm-k8s/configs/llama-4/maverick at main · surajssd/llm-k8s
Contribute to surajssd/llm-k8s development by creating an account on GitHub.
https://github.com/surajssd/llm-k8s/tree/main/configs/llama-4/maverick
5 months ago
0
0
0
Interview with NVIDIA Dynamo Architect Kyle Kranen
www.youtube.com/watch?v=02aR...
loading . . .
Interview with NVIDIA Dynamo Architect Kyle Kranen
YouTube video by NVIDIA Developer
https://www.youtube.com/watch?v=02aR_BJROt0
5 months ago
0
0
0
Networking Optimizations for Multi-Node Deep Learning on Kubernetes - Rajat Chopra & Erez Cohen
youtu.be/CL71kbZ72iU?...
loading . . .
Networking Optimizations for Multi-Node Deep Learning on Kubernetes - Rajat Chopra & Erez Cohen
YouTube video by CNCF [Cloud Native Computing Foundation]
https://youtu.be/CL71kbZ72iU?si=OdZTyQfVYAcG7W0k
5 months ago
0
1
0
Turbocharging AI/ML workloads: Revving Up Speed and Resilience | Lerna Ekmekcioglu
youtu.be/eL8kw1y-SJ8?...
loading . . .
Turbocharging AI/ML workloads: Revving Up Speed and Resilience | Lerna Ekmekcioglu
YouTube video by @Scale
https://youtu.be/eL8kw1y-SJ8?si=WgK4GHSWjIDYS2yo
5 months ago
0
0
0
www.mcpoogle.com
loading . . .
McPoogle: Search engine for MCP Servers
Search engine for MCP (Model Context Protocol) Servers and Tools. Powered by Graphlit.
https://www.mcpoogle.com/
5 months ago
0
1
0
LLM Benchmarking: Fundamental Concepts
developer.nvidia.com/blog/llm-ben...
loading . . .
LLM Benchmarking: Fundamental Concepts | NVIDIA Technical Blog
The past few years have witnessed the rise in popularity of generative AI and large language models (LLMs), as part of a broad AI revolution. As LLM-based applications are rolled out across…
https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/?mkt_tok=MTU2LU9GTi03NDIAAAGZnDzGtwe53HGei2zFCrpyTzNJ0JiHkAH4BnWOqREOKMNcRCMqVra3_fRNZSd7fKFQrWay09HArHAlZ69cnoza2pOsGJEmbXKF9LKCbMMkNDXB3B-AjBUL
5 months ago
0
0
0
Load more
feeds!
log in