Nawaf Alageel (@nawafalageel)

Yes, in some cases CPU and GPU users can be different. However, it wasn't the issue. Oversubscribing CPU memory was a concern initially, but again we never encountered such an issue.

19.07.2025 15:35 👍 0 🔁 0 💬 0 📌 0

Yes, that was the main setup. But soon the question came up: if Project X (container/person) isn’t using GPUs this week, shouldn’t we allocate them to other projects? From then on, allocation/monitoring moved from node level to container level, tied directly to a person or project.

16.07.2025 17:08 👍 1 🔁 0 💬 1 📌 0

Yep yup, you are right! The team used to run "--gpus device=0,2", and we would flag each container with gpu ID in the name. But as the workload grew, they started to do "--gpus all". That’s when things started getting messy.

16.07.2025 11:16 👍 1 🔁 0 💬 1 📌 0

You can find the tool that I built here: github.com/nawafalageel...

I hope it can helps you squeeze more out of your expensive GPUs.

#Nvidia #GPU #Docker #MachineLearning #MLOps #CloudDev #DataScience #OpenSource #LLM #AI #GenAI #AIagents

15.07.2025 11:36 👍 4 🔁 1 💬 1 📌 0

Now, instead of guessing or jumping through hoops to find the answer, we can see the tool can tell us:
"Container X is occupied 12GB on GPU #1 with Y memory utilization"

We went from blindfolded resource to actual insight.

And our question is finally answered 🥳🎉

15.07.2025 11:36 👍 0 🔁 0 💬 1 📌 0

- Nvidia tools (e.g., nvidia-smi) show processes, but not container names.
- Docker tools (e.g, docker status) show CPU and memory, but no GPU data.

We would still be blindfolded. And our question is not answered yet!

15.07.2025 11:36 👍 0 🔁 0 💬 1 📌 0

When it comes to monitoring GPU usage in containerized environment, Nvidia and Docker both of them provide good out-of-the-box tools, but they aren't compatible.

None of them can answer my simple question:
"Which container uses which GPU?"

15.07.2025 11:36 👍 0 🔁 0 💬 1 📌 0

If you're training ML models with Docker containers and Nvidia GPUs, especially on-prem, you've likely seen wasted compute.

GPUs sit idle while its occupies memory (aka. VRAM), but without observation tools, that leads to poor utilization and waised compute.

#Nvidia #GPU #Docker #MachineLearning

15.07.2025 11:36 👍 1 🔁 1 💬 1 📌 0

Next a Lightning talk from Nawaf Alageel!

Wasted Compute: Why GPU Utilization Is Slipping Through the Cracks

Docker and Nvidia’s tools don’t work well together, but this solution lets you know which container needs which GPU, massively increasing observability for his team.

01.07.2025 19:28 👍 3 🔁 3 💬 1 📌 0

Nawaf Alageel

Latest posts by Nawaf Alageel @nawafalageel