Yes, in some cases CPU and GPU users can be different. However, it wasn't the issue. Oversubscribing CPU memory was a concern initially, but again we never encountered such an issue.
Yes, in some cases CPU and GPU users can be different. However, it wasn't the issue. Oversubscribing CPU memory was a concern initially, but again we never encountered such an issue.
Yes, that was the main setup. But soon the question came up: if Project X (container/person) isnβt using GPUs this week, shouldnβt we allocate them to other projects? From then on, allocation/monitoring moved from node level to container level, tied directly to a person or project.
Yep yup, you are right! The team used to run "--gpus device=0,2", and we would flag each container with gpu ID in the name. But as the workload grew, they started to do "--gpus all". Thatβs when things started getting messy.
You can find the tool that I built here: github.com/nawafalageel...
I hope it can helps you squeeze more out of your expensive GPUs.
#Nvidia #GPU #Docker #MachineLearning #MLOps #CloudDev #DataScience #OpenSource #LLM #AI #GenAI #AIagents
Now, instead of guessing or jumping through hoops to find the answer, we can see the tool can tell us:
"Container X is occupied 12GB on GPU #1 with Y memory utilization"
We went from blindfolded resource to actual insight.
And our question is finally answered π₯³π
- Nvidia tools (e.g., nvidia-smi) show processes, but not container names.
- Docker tools (e.g, docker status) show CPU and memory, but no GPU data.
We would still be blindfolded. And our question is not answered yet!
When it comes to monitoring GPU usage in containerized environment, Nvidia and Docker both of them provide good out-of-the-box tools, but they aren't compatible.
None of them can answer my simple question:
"Which container uses which GPU?"
If you're training ML models with Docker containers and Nvidia GPUs, especially on-prem, you've likely seen wasted compute.
GPUs sit idle while its occupies memory (aka. VRAM), but without observation tools, that leads to poor utilization and waised compute.
#Nvidia #GPU #Docker #MachineLearning
Next a Lightning talk from Nawaf Alageel!
Wasted Compute: Why GPU Utilization Is Slipping Through the Cracks
Docker and Nvidiaβs tools donβt work well together, but this solution lets you know which container needs which GPU, massively increasing observability for his team.