https://lckhd.eu/K32CEi
๐ Here is my latest blog article! I built a voice assistant for my family recipe collection using Amazon Nova Sonic v2 and #Strands.
@darryl-ruggles.cloud
Principal Cloud Solutions Architect @ Ciena - AWS Community Builder Serverless, Event-Driven Architecture, AWS, Kubernetes, Rust, Terraform, Security, DevOps, FinOps, MLOps, Maker https://darryl-ruggles.cloud https://www.linkedin.com/in/darryl-ruggles
https://lckhd.eu/K32CEi
๐ Here is my latest blog article! I built a voice assistant for my family recipe collection using Amazon Nova Sonic v2 and #Strands.
Cenkay Yaman has put together a thorough walkthrough that discusses each in detail, including persistence across reboots and common troubleshooting fixes. A useful reference for anyone bringing GPU nodes into Kubernetes.
The full stack from nouveau blacklisting, NVIDIA driver and CUDA setup, containerd configuration for RKE2, device plugin deployment, and MIG profile creation for multi-tenant isolation on 141 GB of HBM3e memory are discussed.
These all need to line up correctly. The article here covers the setup from start to finish and is worth a read.
Starting with kernel drivers and container runtimes to device plugins and MIG partitioning it is a journey. Missing one step can leave expensive hardware sitting idle.
https://lckhd.eu/Ci0r7Y
Setting a large GPU with sharing in a local #Kubernetes cluster is something many teams would like to do. Getting an #H200 NVL #GPU properly working inside a bare-metal #RKE2 cluster involves a lot of layers.
production at scale. (4/4)
built-in metrics endpoint for request concurrency, token throughput, latency percentiles, and KV cache pressure.
Abhinav Pratap has created a clear approach that connects the infrastructure plumbing to the metrics that helps when you are running inference workloads in (3/4)
The article below describes setting up observability for both GPU hardware and #vLLM inference metrics using Prometheus and Grafana.
The guide covers two distinct monitoring layers: DCGM Exporter for GPU telemetry like temperature, power, and utilization, and vLLM's (2/4)
https://lckhd.eu/EyeHQG
LLMs are being used everywhere and can cost a lot of $$$s. Many teams are trying to run these from their own #Kubernetes clusters and having #observability of this is key. ๐งต
https://lckhd.eu/glVDwD
๐ My New blog post is live! I wrote up everything I've learned building with Amazon Elastic Container Service (ECS) - the service I default to for running containers on #AWS.
It walks through the full setup including exit code handling, scheduled runs, and notification logic.
BISINET has put together a good guide. If you manage infrastructure at any real scale, the approach outlined is worth looking at and adapting to your own tooling and workflows.
This is a solid breakdown on a topic that a lot of AWS teams are quietly wrestling with right now. (5/5)
The longer drift goes unnoticed, the harder it becomes to trust the state of your environments.
The article covers what drift is, why it happens, and how to build a practical detection pipeline using Terraform, Jenkins, and Slack.
detail from the start. (4/5)
This article from Brian McNamara walks through the #Lambda vs #Fargate cost comparison with a practical calculator and a clear framework for thinking about total cost of ownership. It also covers how to reduce switching risk by treating compute as an implementation (3/5)
https://lckhd.eu/lgkX5E
Infrastructure drift is one of those quiet risks that grows over time. Manual changes during incidents, overlapping Infrastructure as Code (IaC) configs, and gaps in process all contribute.
conversation. (2/5)
https://lckhd.eu/YuiO5l
Choosing between Lambda and Fargate often comes down to cost. The real question is what "cost" actually means. Cloud spend is only part of the equation, and engineering hours behind each get overlooked in the ๐งต
The article includes #Terraform examples and real-world architecture patterns. The highlights of using ECS are: no control plane cost, solid AWS integration, and a lot less operational overhead than Kubernetes for most workloads. Check it out!
This covers the practical stuff: how the compute options compare, what networking mode to use, how deployments actually work, where the hidden costs are, and when #ECS makes more sense than EKS or Lambda.
https://lckhd.eu/glVDwD
๐ My New blog post is live! I wrote up everything I've learned building with Amazon Elastic Container Service (ECS) - the service I default to for running containers on #AWS.
If you manage more than a couple of #AWS accounts and want a cleaner way to handle credentials from the terminal, I hope this gives you something useful to work with. I would love to hear how others are approaching this.
The walkthrough covers configuring your #SSO session, linking multiple account and role profiles in your config file, and using simple shell aliases to switch contexts in seconds. It's nothing fancy, just a practical workflow that holds up day to day. I've been using this approach for years.
https://lckhd.eu/bpM5JR
I wrote this to show how I handle switching between accounts and roles using IAM Identity Center and AWS #CLI. If you spend your day moving between envs, this can quietly save you lots of repetitive work.
David B Chase lays out each scenario with examples and practical commands to help you identify these issues in your own clusters. Check it out if you're working with Kubernetes.
It covers the full range of scale-down blockers, from stuck pods and PDB misconfigurations to hostPath volumes and anti-affinity rules. It also addresses node sprawl scenarios like topology constraints and max pod limits that quietly inflate your infrastructure over time.
https://lckhd.eu/yC9Jjx
Most teams tuning #Kubernetes costs focus on right-sizing requests and limits. But if your node #autoscaler can't actually scale down nodes, those optimizations may never translate into real savings. This article digs into why that gap exists and what to do about it.
It uses ML and pattern matching to identify PII and PHI automatically.
The article below from Vikas Banage shows all three policy types with sample JSON and a Lambda function to test them.
Message Data Protection policies address this. You can audit and log findings to CloudWatch, de-identify by masking fields before delivery, or deny and block the publish entirely.
https://lckhd.eu/ySmqO9
Event-driven architectures using #SNS are great for scalability, but sensitive data can accidentally leak into messages like patient details, credit card numbers, dates of birth. Encryption and access controls help, but they don't prevent PII in the payload itself.
SNS