#ParallelCluster

2 days ago

If you do #HPC on #AWS via #ParallelCluster then check your versions. A ton of old HPC clusters are gonna fall over in June 2026 when AWS deprecates an old python runtime in Lamda. I've done 3 client upgrades this month and expect more to come ...

1 0 0 0

Chris Dagdigian

@dagdigian.com

5 months ago

ParallelCluster, External Destinations & Firewall Rules - BioTeam When Internet egress is blocked by default inside your cloud environment, what external sites and destinations need to be allowed for AWS Parallelcluster?

Used the US long holiday weekend to run #AWS #ParallelCluster build-pipeline and cluster deploy through a logging squid HTTP proxy to document internet destinations used -- this helps set firewall rules for orgs that block egress by default in research VPCs

2 0 0 0

Chris Dagdigian

@dagdigian.com

1 year ago

I do a lot of #HPC clusters running #AWS #Parallelcluster for computational chemistry, specifically integrating it with the www.schrodinger.com small molecule & molecular dynamics tools.

Wallclock time to create a new on-demand GPU node & scale it in for a job was 4min 9sec. Fastest yet!

7 0 0 0

Chris Dagdigian

@dagdigian.com

1 year ago

That's the thread. If you ever see a strange #AWS #Parallelcluster failure where HeadNode works but compute fleet fails instantly and no useful logs .. than go to #CloudTrails and look for KMS accessDenied errors!

0 0 0 0

Chris Dagdigian

@dagdigian.com

1 year ago

The solution is twofold - IAM policy on HeadNode that allows KMS actions AND similar permissions loaded into the KMC CMK key policy itself

And you gotta do wildcard like stuff in the key policy to handle the uuid-style roles that #Parallelcluster creates by default:

1 0 1 0

Chris Dagdigian

@dagdigian.com

1 year ago

So its UNUSUAL for me to find a #Parallelcluster deploy error that is not instantly obvious in the logs

But this is one I've seen 3x now:

- HeadNode deploys fine
- Zero compute nodes deploy (instant failure)
- Zero helpful log entries
- Stack fails due to cloudformation waitcondition timeout

0 0 1 0

Chris Dagdigian

@dagdigian.com

1 year ago

1st off lets talk about how awesome #AWS #Parallelcluster is -- log output is VERY complete & it's easy to ship log streams to Cloudwatch log group

Bootstrap error? -- Clear in the logs
"ec2 insufficient capacity" error? -- Clear in log files
"ec2 insufficient quota" error? -- Clear in log files

1 0 1 0

Chris Dagdigian

@dagdigian.com

1 year ago

I used to post google bait on that other site that would allow people experiencing the same problem I had to find a potential solution.

Lets see if this works on bsky. Gonna post a multi-post thread on the most opaque #AWS #Parallelcluster #HPC debugging hassle I (very occasionally) encounter ...

5 3 1 0

Chris Dagdigian

@dagdigian.com

1 year ago

I miss the "job arrays seem complicated so I just scripted a loop to sbatch 80,000 individual jobs ..." conversations. T

Today most of my requests are due to #aws #parallelcluster auto-scaling failing for quota or "insufficient ec2 capacity" errors -- both error modes not easily visible to users

3 0 0 0

Chris Dagdigian

@dagdigian.com

1 year ago

SUCCESSFUL HPC!

Note to self: "Cross-account sharing of an #aws #parallelcluster custom AMI hosted in the SharedServices AWS account to all other workload AWS accounts should be fast and simple .."

WRONG. I don't do enough cross-account KMS CMK crypto stuff to nail the IAM/policy bits correctly

3 0 1 0

Chris Dagdigian

@dagdigian.com

1 year ago

Spoiled by how good #aws #parallelcluster is at telling exactly how I've fucked up an HPC config so it was a surprise when I couldn't find easy log msg describing instant compute fleet deploy failure.

Had to dig into cloudtrails to find KMS cross-account shared AMI "kms:Recrypt" permission error.

0 0 2 0

Chris Dagdigian

@dagdigian.com

1 year ago

Interested in how others bootstrap #AWS #parallelcluster #HPC environments.

I tend to hack bash customAction scripts together to create a few SSM parameters that ansible later queries and then we clone an ansible repo and run ansible against localhost inventory to complete final config steps ...

2 0 0 0

Liz Adams

@lizadams.bsky.social

2 years ago

#CMASconf2023 #UNCresearchweek Ready to run AQMs on the Cloud! CMAQ on AWS Workshop! Let’s go! #AWS #ParallelCluster #HPC

0 0 0 0

Posts tagged #ParallelCluster