If you do #HPC on #AWS via #ParallelCluster then check your versions. A ton of old HPC clusters are gonna fall over in June 2026 when AWS deprecates an old python runtime in Lamda. I've done 3 client upgrades this month and expect more to come ...
Latest posts tagged with #ParallelCluster on Bluesky
If you do #HPC on #AWS via #ParallelCluster then check your versions. A ton of old HPC clusters are gonna fall over in June 2026 when AWS deprecates an old python runtime in Lamda. I've done 3 client upgrades this month and expect more to come ...
Used the US long holiday weekend to run #AWS #ParallelCluster build-pipeline and cluster deploy through a logging squid HTTP proxy to document internet destinations used -- this helps set firewall rules for orgs that block egress by default in research VPCs
I do a lot of #HPC clusters running #AWS #Parallelcluster for computational chemistry, specifically integrating it with the www.schrodinger.com small molecule & molecular dynamics tools.
Wallclock time to create a new on-demand GPU node & scale it in for a job was 4min 9sec. Fastest yet!
That's the thread. If you ever see a strange #AWS #Parallelcluster failure where HeadNode works but compute fleet fails instantly and no useful logs .. than go to #CloudTrails and look for KMS accessDenied errors!
The solution is twofold - IAM policy on HeadNode that allows KMS actions AND similar permissions loaded into the KMC CMK key policy itself
And you gotta do wildcard like stuff in the key policy to handle the uuid-style roles that #Parallelcluster creates by default:
So its UNUSUAL for me to find a #Parallelcluster deploy error that is not instantly obvious in the logs
But this is one I've seen 3x now:
- HeadNode deploys fine
- Zero compute nodes deploy (instant failure)
- Zero helpful log entries
- Stack fails due to cloudformation waitcondition timeout
1st off lets talk about how awesome #AWS #Parallelcluster is -- log output is VERY complete & it's easy to ship log streams to Cloudwatch log group
Bootstrap error? -- Clear in the logs
"ec2 insufficient capacity" error? -- Clear in log files
"ec2 insufficient quota" error? -- Clear in log files
I used to post google bait on that other site that would allow people experiencing the same problem I had to find a potential solution.
Lets see if this works on bsky. Gonna post a multi-post thread on the most opaque #AWS #Parallelcluster #HPC debugging hassle I (very occasionally) encounter ...
I miss the "job arrays seem complicated so I just scripted a loop to sbatch 80,000 individual jobs ..." conversations. T
Today most of my requests are due to #aws #parallelcluster auto-scaling failing for quota or "insufficient ec2 capacity" errors -- both error modes not easily visible to users
SUCCESSFUL HPC!
Note to self: "Cross-account sharing of an #aws #parallelcluster custom AMI hosted in the SharedServices AWS account to all other workload AWS accounts should be fast and simple .."
WRONG. I don't do enough cross-account KMS CMK crypto stuff to nail the IAM/policy bits correctly
Spoiled by how good #aws #parallelcluster is at telling exactly how I've fucked up an HPC config so it was a surprise when I couldn't find easy log msg describing instant compute fleet deploy failure.
Had to dig into cloudtrails to find KMS cross-account shared AMI "kms:Recrypt" permission error.
Interested in how others bootstrap #AWS #parallelcluster #HPC environments.
I tend to hack bash customAction scripts together to create a few SSM parameters that ansible later queries and then we clone an ansible repo and run ansible against localhost inventory to complete final config steps ...
#CMASconf2023 #UNCresearchweek Ready to run AQMs on the Cloud! CMAQ on AWS Workshop! Let’s go! #AWS #ParallelCluster #HPC