Darryl Ruggles's Avatar

Darryl Ruggles

@darryl-ruggles.cloud

Principal Cloud Solutions Architect @ Ciena - AWS Community Builder Serverless, Event-Driven Architecture, AWS, Kubernetes, Rust, Terraform, Security, DevOps, FinOps, MLOps, Maker https://darryl-ruggles.cloud https://www.linkedin.com/in/darryl-ruggles

1,250
Followers
406
Following
4,776
Posts
22.10.2024
Joined
Posts Following

Latest posts by Darryl Ruggles @darryl-ruggles.cloud

https://lckhd.eu/K32CEi

๐Ÿ†• Here is my latest blog article! I built a voice assistant for my family recipe collection using Amazon Nova Sonic v2 and #Strands.

07.03.2026 15:00 ๐Ÿ‘ 1 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Cenkay Yaman has put together a thorough walkthrough that discusses each in detail, including persistence across reboots and common troubleshooting fixes. A useful reference for anyone bringing GPU nodes into Kubernetes.

10.03.2026 20:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The full stack from nouveau blacklisting, NVIDIA driver and CUDA setup, containerd configuration for RKE2, device plugin deployment, and MIG profile creation for multi-tenant isolation on 141 GB of HBM3e memory are discussed.

10.03.2026 20:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

These all need to line up correctly. The article here covers the setup from start to finish and is worth a read.

Starting with kernel drivers and container runtimes to device plugins and MIG partitioning it is a journey. Missing one step can leave expensive hardware sitting idle.

10.03.2026 20:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/Ci0r7Y

Setting a large GPU with sharing in a local #Kubernetes cluster is something many teams would like to do. Getting an #H200 NVL #GPU properly working inside a bare-metal #RKE2 cluster involves a lot of layers.

10.03.2026 20:25 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

production at scale. (4/4)

10.03.2026 16:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

built-in metrics endpoint for request concurrency, token throughput, latency percentiles, and KV cache pressure.

Abhinav Pratap has created a clear approach that connects the infrastructure plumbing to the metrics that helps when you are running inference workloads in (3/4)

10.03.2026 16:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The article below describes setting up observability for both GPU hardware and #vLLM inference metrics using Prometheus and Grafana.

The guide covers two distinct monitoring layers: DCGM Exporter for GPU telemetry like temperature, power, and utilization, and vLLM's (2/4)

10.03.2026 16:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/EyeHQG

LLMs are being used everywhere and can cost a lot of $$$s. Many teams are trying to run these from their own #Kubernetes clusters and having #observability of this is key. ๐Ÿงต

10.03.2026 16:37 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/glVDwD

๐Ÿ†• My New blog post is live! I wrote up everything I've learned building with Amazon Elastic Container Service (ECS) - the service I default to for running containers on #AWS.

10.03.2026 03:00 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

It walks through the full setup including exit code handling, scheduled runs, and notification logic.

BISINET has put together a good guide. If you manage infrastructure at any real scale, the approach outlined is worth looking at and adapting to your own tooling and workflows.

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This is a solid breakdown on a topic that a lot of AWS teams are quietly wrestling with right now. (5/5)

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The longer drift goes unnoticed, the harder it becomes to trust the state of your environments.

The article covers what drift is, why it happens, and how to build a practical detection pipeline using Terraform, Jenkins, and Slack.

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

detail from the start. (4/5)

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

This article from Brian McNamara walks through the #Lambda vs #Fargate cost comparison with a practical calculator and a clear framework for thinking about total cost of ownership. It also covers how to reduce switching risk by treating compute as an implementation (3/5)

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/lgkX5E

Infrastructure drift is one of those quiet risks that grows over time. Manual changes during incidents, overlapping Infrastructure as Code (IaC) configs, and gaps in process all contribute.

10.03.2026 05:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

conversation. (2/5)

10.03.2026 05:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/YuiO5l

Choosing between Lambda and Fargate often comes down to cost. The real question is what "cost" actually means. Cloud spend is only part of the equation, and engineering hours behind each get overlooked in the ๐Ÿงต

10.03.2026 05:30 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The article includes #Terraform examples and real-world architecture patterns. The highlights of using ECS are: no control plane cost, solid AWS integration, and a lot less operational overhead than Kubernetes for most workloads. Check it out!

10.03.2026 03:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

This covers the practical stuff: how the compute options compare, what networking mode to use, how deployments actually work, where the hidden costs are, and when #ECS makes more sense than EKS or Lambda.

10.03.2026 03:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/glVDwD

๐Ÿ†• My New blog post is live! I wrote up everything I've learned building with Amazon Elastic Container Service (ECS) - the service I default to for running containers on #AWS.

10.03.2026 03:00 ๐Ÿ‘ 0 ๐Ÿ” 1 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

If you manage more than a couple of #AWS accounts and want a cleaner way to handle credentials from the terminal, I hope this gives you something useful to work with. I would love to hear how others are approaching this.

09.03.2026 16:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The walkthrough covers configuring your #SSO session, linking multiple account and role profiles in your config file, and using simple shell aliases to switch contexts in seconds. It's nothing fancy, just a practical workflow that holds up day to day. I've been using this approach for years.

09.03.2026 16:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/bpM5JR

I wrote this to show how I handle switching between accounts and roles using IAM Identity Center and AWS #CLI. If you spend your day moving between envs, this can quietly save you lots of repetitive work.

09.03.2026 16:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

David B Chase lays out each scenario with examples and practical commands to help you identify these issues in your own clusters. Check it out if you're working with Kubernetes.

09.03.2026 05:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

It covers the full range of scale-down blockers, from stuck pods and PDB misconfigurations to hostPath volumes and anti-affinity rules. It also addresses node sprawl scenarios like topology constraints and max pod limits that quietly inflate your infrastructure over time.

09.03.2026 05:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/yC9Jjx

Most teams tuning #Kubernetes costs focus on right-sizing requests and limits. But if your node #autoscaler can't actually scale down nodes, those optimizations may never translate into real savings. This article digs into why that gap exists and what to do about it.

09.03.2026 05:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0

It uses ML and pattern matching to identify PII and PHI automatically.

The article below from Vikas Banage shows all three policy types with sample JSON and a Lambda function to test them.

09.03.2026 03:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Message Data Protection policies address this. You can audit and log findings to CloudWatch, de-identify by masking fields before delivery, or deny and block the publish entirely.

09.03.2026 03:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

https://lckhd.eu/ySmqO9

Event-driven architectures using #SNS are great for scalability, but sensitive data can accidentally leak into messages like patient details, credit card numbers, dates of birth. Encryption and access controls help, but they don't prevent PII in the payload itself.

SNS

09.03.2026 03:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0