#VLLM

@sarubot.bsky.social

1 day ago

Paged Attention - vLLM ``` float accs[NUM_ROWS_PER_THREAD]; float accs[NUM_ROWS_PER_THREAD];for ... { // Iteration over different blocks. for ... { // Iteration over different blocks. logits_vec = ... logits_vec = ... for ... { // Iteration over different rows. for ... { // Iteration over different rows. v_vec =

LLM推論のメモリ効率を劇的に変えたPagedAttention。メモリ断片化を仮想記憶の概念で解決する発想が凄い。

・KVキャッシュ浪費を理論上ゼロに
・動的ブロック割り当てでバッチを最大化

推論エンジンの設計思想を変えた金字塔的技術。

#vLLM #LLM

0 0 0 0

AI Daily Post

@aidailypost.com

5 days ago

vLLM’s new PagedAttention slashes latency, cranks up GPU inference, and lets you batch continuously for production LLM workloads. Curious how it beats the OpenAI API? Dive in! #vLLM #PagedAttention #GPUInference

🔗 aidailypost.com/news/vllm-bo...

0 0 0 0

Burkhard Ringlein

@0xcaffee.bsky.social

6 days ago

vLLM Triton Attention Backend Deep Dive This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend. Explore p...

Maintaining separate attention kernels for every GPU platform doesn't scale.

Hence, for the #vLLM #Triton #attention backend, we took a different approach: ~800 LoC Triton for NVIDIA and AMD GPUs, with SOTA performance on both.

📖 Deep dive: blog.vllm.ai/2026/03/04/v...

@pytorch.org #OpenSourceAI

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

1 week ago

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

awesomeagents.ai/news/vllm-0-17-0-flashat...

#Vllm #Inference #OpenSource

0 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

1 week ago

Собственная облачная LLM на 16 ГБ VRAM — часть 1: базовая сборка, tools и MCP Привет, Хабр! На фоне ажиотажа вокруг ней...

#langchain #langgraph #python #vllm #qwen3 #localai #selectel #MCP #ии-агенты #API-сервис

Origin | Interest | Match

1 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

1 week ago

Из коробки не работает: запускаем свежие большие LLM В последнее время открытых моделей сверхбольшого размера развелось неимоверное количество, даже не просто моделей, а производителей. Вариации GLM, Kimi, DeepSeek занимают по нескольку строк в топ...

Из коробки не работает: запускаем свежие большие LLM В последнее время открытых моделей сверхбольшого разме...

#Kimi-K2.5 #DeepSeek-v3.2 #GLM-5 #Qwen3.5 #vllm #B200

Origin | Interest | Match

0 0 0 0

roxsross

@roxsross.bsky.social

2 weeks ago

🚀 Docker Model Runner lleva vLLM a macOS con Apple Silicon

vLLM, el motor de inferencia líder, ahora en macOS gracias a vllm-metal.

www.docker.com/blog/docker-model-runner...

#vLLM #AppleSilicon #MLOps #Docker #RoxsRoss

0 0 0 0

Rost Glukhov

@rosgluk.bsky.social

3 weeks ago

LLM Hosting in 2026: Local, Self-Hosted & Cloud Infrastructure Compared Complete guide to LLM hosting in 2026. Compare Ollama, vLLM, Docker Model Runner, LocalAI and cloud providers. Learn cost, performance, and infrastructure trade-offs.

Complete guide to LLM hosting in 2026. Compare Ollama, vLLM, Docker Model Runner, LocalAI and cloud providers. Learn cost, performance, and infrastructure trade-offs:
www.glukhov.org/llm-hosting/
#AI #LLM #hosting #Self-Hosting #SelfHosting #ollama #vllm #infrastructure

2 0 0 0

GPU CLI

@gpucli.bsky.social

3 weeks ago

Setup an open source model with #Ollama or #vLLM, but unsure how to connect it to Claude Code?

Don't worry, we've got you covered 💪

1 1 1 0

GPU CLI

@gpucli.bsky.social

3 weeks ago

Then run 'gpu llm run' from your terminal of choice, select whether you want to use #Ollama or #vLLM for inference and choose the model you want to use.

Here we're opting for the #Z.ai model GLM-4.7 Flash.

1 0 1 0

@techlife-blog.bsky.social

1 month ago

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

techlife.blog/posts/llm-in...

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

0 0 0 0

Thinkronicity ™

@thinkronicity.bsky.social

1 month ago

Synthetic | Run LLMs, privately Chat with open-source models privately

Where can you prepro #Dev test leading #OpenSource #LLM #AI models that are not 'walled garden' & US monitored #AmericanAI?
Synthetic.new has #PrivacyFirst runnable model choices like #KIMIK2-Thinking, #MiniMax2.1, #Quen3 ++. #vLLM support & use as in #OpenAI tools via #Roo #Cline ++

1 0 0 0

GPU CLI

@gpucli.bsky.social

1 month ago

Remote #GPU network volumes shouldn't require a config file, a cloud console, and 20 minutes of your life.

With GPU CLI, adding a volume is as simple as yes or no.

#Ollama #vLLM #ComfyUI

1 1 0 0

Cedric Clyburn

@cedricclyburn.com

1 month ago

Today kicks off @jfokus.se in Stockholm 🇸🇪 and we just delivered our workshop on building with open source AI models using:

⚡️ #vLLM serve local LLM’s as a local API endpoint

🦜 @langchain4j.dev for adding LLM capabilities in our Java application

Was a huge hit! Slides ⬇️

3 1 1 0

AI포스트(AIPOST) | 인공지능 전문언론

@aipostkorea.bsky.social

1 month ago

“훈련은 돈만 쓴다, 진짜 돈은 추론서 벌어”…2000억 실탄 챙긴 ‘인퍼랙트’ CEO의 일침 vLLM의 주역들이 설립한 인퍼랙트가 1.5억 달러의 실탄을 확보하며 AI 경제학의 패러다임 전환을 선언했습니다. 수조 원을 쏟아붓는 모델 훈련은 결국 ‘비용’일 뿐, AI가 사용자에게 정보를 전달하고 가치를 창출하는 유일한 순간은 ‘추론’이라는 일침입니다. AI포스트 핵심 요약 ✅

📉 "훈련은 밑 빠진 독에 물 붓기?"
시드 투자로만 2,182억 원 챙긴 '인퍼랙트'의 독설

오픈소스 추론 엔진의 끝판왕 'vLLM' 팀이 만든 인퍼랙트가 전장에 뛰어들었습니다. 이제 AI 산업의 승자는 '누가 더 큰 모델을 가졌느냐'가 아니라 '누가 더 효율적으로 추론하느냐'에서 갈릴 것입니다.
www.aipostkorea.com/news/article...

#인퍼랙트 #Inferact #vLLM #사이먼모 #AI인프라 #추론의경제학 #시드투자 #a16z #테크트렌드

1 0 0 0

The Daily Tech Feed

@thedailytechfeed.com

1 month ago

Inferact raises $150M to commercialize vLLM, enhancing AI inference efficiency. Backed by Andreessen Horowitz & Lightspeed. #AI #Inference #TechFunding #vLLM #Inferact Link: thedailytechfeed.com/inferact-rai...

0 0 0 0

AI Daily Post

@aidailypost.com

1 month ago

Andreessen Horowitz just pumped $150M into Inferact’s seed round, pushing its valuation to $800M. The startup’s open‑source vLLM engine could reshape AI model inference. Curious? Dive in. #Inferact #vLLM #SeedFunding

🔗 aidailypost.com/news/andrees...

0 0 0 0

CloudThrill

@cloudthrill.bsky.social

1 month ago

Nice example of a production #vLLM setup on 𝗡𝗲𝗯𝗶𝘂𝘀 with terraform, managed K8s, inference, and observability all in one place.

This can be a ref stack builders can use without reinventing the basics 💡.
👨🏻‍💻 full code on our repo.
github.com/CloudThrill/vllm-production-stack-terraform

1 0 0 0

Kosseila (CloudDude)

@clouddude.bsky.social

1 month ago

vLLM Production Stack on Nebius K8s with Terraform🧑🏼‍🚀 - Cloudthrill This terraform stack delivers a production-ready vLLM serving environment On Nebius Cloud managed Kubernetes supporting Highly optimized GPU inference with operational best practices.

📢 𝗡𝗲𝘄 𝘁𝗲𝗿𝗿𝗮𝗳𝗼𝗿𝗺 #vLLM 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸 𝗔𝗰𝗿𝗼𝘀𝘀 𝗖𝗹𝗼𝘂𝗱𝘀 🧑🏼‍🚀 | 𝗣𝗮𝗿𝘁 𝟰: 𝗡𝗲𝗯𝗶𝘂𝘀 𝗖𝗹𝗼𝘂𝗱 💚

🔎 𝗪𝗵𝗮𝘁 𝘆𝗼𝘂'𝗹𝗹 𝗱𝗲𝗽𝗹𝗼𝘆:
✅ Enterprise-grade GPU inference
✅ Secure vllm endpoints (LetsEncrypt)
✅ Full observability: Grafana + vLLM dashboards
✅ Lightning-fast deployment

👉 read the guide: tinyurl.com/Nebiusvllm

2 0 0 1

Jay Lye

@jnlsk.bsky.social

2 months ago

Open Responses: What you need to know We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Opinion: another step forward for scalable agentic workloads in 2026

#huggingface #vllm #openai #llm #ai #artificial-intelligence #langchain #llama-index #vllm #sglang

0 0 0 0

Cedric Clyburn

@cedricclyburn.com

2 months ago

@jfokus.se is BACK for its 20th year and I’m so happy to be hosting a workshop on open source models & how to scale them up on #Kubernetes! We’ll feature projects including #vLLM + @langchain4j.dev + @promptfoo.bsky.social and more for enterprise AI deployment, app dev, and testing 🔥

4 2 0 0

LLMs

@llms.activitypub.awakari.com.ap.brid.gy

2 months ago

[Перевод] Как работает кэширование промптов — PagedAttention и автоматическое кэширование префикса плюс практиче...

#prompt #caching #префилл #декодинг #инференс #LLM #vLLM #PagedAttention #prefix #caching #фрагментация

Origin | Interest | Match

0 0 0 0

CloudThrill

@cloudthrill.bsky.social

2 months ago

🏆Ranked #2 most-read in 2025 - #vLLM for Beginners (Key features)
2️⃣ Here’s the most exhaustive list of VLLM features you wish you knew. 👇
📖 cloudthrill.ca/what-is-vllm...

Learn what makes #vllm the 𝗥𝗼𝗹𝗹𝘀 𝗥𝗼𝘆𝗰𝗲 of Inference in production✨. #vLLM #AIForBeginners

0 0 1 0

Raphaël "raphiki" Semeteys

@raphiki.bsky.social

2 months ago

The LLM inference landscape is exploding.

Should you use the data center standard #vLLM, local favorite #Ollama or the radical newcomer #ZML?
I applied the rigorous #QSOS method to compare these engines on features, performance and operational ease.
Link to full article in comment.
#TechAtWorldline

1 0 1 0

LLMs

@llms.activitypub.awakari.com.ap.brid.gy

2 months ago

Meeting-LLM: Транскрипция + ИИ-анализ совещаний в одном окне своими руками (T-One + GPT-OSS-20B) В интернете огромное колич...

#Сезон #ИИ #в #разработке #GPT-OSS-20B #транскрипция #STT #T-One #vLLM #LLM #совещания

Origin | Interest | Match

0 0 0 0

Docker

@docker.com

2 months ago

Docker Model Runner just got two big upgrades:
- Run vLLM on Windows with WSL2 + NVIDIA GPUs
- Now included in Universal Blue (Bluefin + Aurora)

Read more: https://bit.ly/3Y7WabG

Run LLMs with a single command , no setup, no GPU headaches.

#vLLM #UniversalBlue #Bluefin #Aurora

2 1 1 0

Rob Richardson

@robrich.bsky.social

2 months ago

Docker Model Runner Adds vLLM Support on Windows | Docker Run vLLM with GPU acceleration on Windows using Docker Model Runner and WSL2. Fast AI inference is here.

www.docker.com/blog/docker-... - setting up #vLLM on #Windows with #Docker #Model Runner. Great tutorial Dorin Geman.

1 0 0 0

Kosseila (CloudDude)

@clouddude.bsky.social

2 months ago

vLLM Production Stack on GCP GKE with Terraform🧑🏼‍🚀 - Cloudthrill This terraform stack delivers a production-ready vLLM serving environment On Google Cloud GKE supporting both CPU/GPU inference with operational best practices embedded in Terraform Kubernetes Engine Modules.

📢 𝗡𝗲𝘄 𝘁𝗲𝗿𝗿𝗮𝗳𝗼𝗿𝗺 #vLLM 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗦𝘁𝗮𝗰𝗸 𝗔𝗰𝗿𝗼𝘀𝘀 𝗖𝗹𝗼𝘂𝗱𝘀
𝗣𝗮𝗿𝘁 𝟭: GCP 𝗚𝗞𝗘 🔵🔴🟢
🔎 𝗪𝗵𝗮𝘁 𝘆𝗼𝘂'𝗹𝗹 𝗱𝗲𝗽𝗹𝗼𝘆:
✅ Enterprise-grade infra
✅ Switch between CPU/ GPU inference with a single flag
✅ Full observability: Grafana + vLLM dashboards
✅ OpenAI-compatible API

👉 read the guide: cloudthrill.ca/vllm-product...

0 0 0 1

Abdel Sghiouar

@boredabdel.bsky.social

3 months ago

Want to run self-hosted AI agents on Kubernetes? 🛠️

Check out this guide on deploying #ADK #Agents on #GKE #Autopilot using #vLLM to serve #Llama3.

boredabdel.medium.com/adk-agents-o...

#Kubernetes #GKE #AI #vLLM #Llama3 #ADK

3 0 0 0

Posts tagged #VLLM