Scale is an illusion. Every system runs the same relational physics.
#MicroMacro #SystemsAtScale #RelationalAnthropology #SurvivorLiteracy
Latest posts tagged with #SystemsAtScale on Bluesky
Scale is an illusion. Every system runs the same relational physics.
#MicroMacro #SystemsAtScale #RelationalAnthropology #SurvivorLiteracy
Nobody livetweeted or liveblogged it, but that's okay! There now is video of me giving my talk at #SystemsAtScale about #o11y and how singular metrics and graphs need the ability to mash them up with their context to be successful! https://t.co/SZYXr7vp2W
TIL there were reporters at #SystemsAtScale who tried to get the highlights of the talks -- www.cnet.com/news/how-facebook-operat...
And #SystemsAtScale hosted by @fb_engineering is a wrap! Hope people enjoyed the livetweeting and that I held down the fort okay; few other folks here were heavy Twitter users, for obvious reasons ;)
Q: why not k8s? A: "yes, we do read." but K8s won't scale to the number of machines, and trying to force scaling changes in may not meet the k8s community's needs. (same reason that Netflix released its own orchestrator instead of using k8s or mesos) #SystemsAtScale
But the work for now is just to get everything standardized and out of private pools to remove human element from fleet operations. [fin] #SystemsAtScale
Also have a desire to create in far future update domains that allow evenly spreading out services evenly across a standardized set of failure domains. #SystemsAtScale
Don't accept maintenance events until schedulers are confident that things are sufficiently out of the way that they won't suffer impact; however, potentially years out from being able to do that fully automatically. #SystemsAtScale
New datacenter tooling to reason about and schedule maintenances to not conflict with each other and not cause undue impact on services. #SystemsAtScale
Resource allowance system to be created to avoid humans being involved in horse-trading individual racks and maintaining overall capacity limits; need a fleet ledger to know what's assigned to whom.
Shard by failure domains, stay consistent for each shard #SystemsAtScale
but multiple schedulers allowed as long as allocation is centrally handled. Most will still use Tupperware but others may want to vary. #SystemsAtScale
Goal to have a container allocation frontend that lets you request any size of container (whether whole machine or less than machine). #SystemsAtScale
Binpacking multiple unrelated containers onto a single machine is not a priority due to concerns about noisy neighbors; many teams are using entire single machines and optimized for it. #SystemsAtScale
Showing an incomplete architecture diagram, pieces of which *will* be built over time. [ed: interested to see how this does turn out over time; this is definitely an unvarnished look into FB's state of infra] #SystemsAtScale
Bare metal, bit twiddling, and no virtualization starting to run head into the problem of trying to migrate. #SystemsAtScale
Phases: moving everything tupperware into the twshared pool, then migrating the large customized scheduling solutions into tupperware. #SystemsAtScale
Good news is that Tupperware has existed 7+ years. Most services by count but not by fleet size are using it, but ~45% of it is all different private pools, ~45% is non-tupperware; only ~10% is on the "twshared" shared pool. #SystemsAtScale
Compute as a service needs to bridge between service and batch scheduling on one side, and maintenance/network on the other.
Storage regarded as out of scope for now. #SystemsAtScale
It shouldn't matter to services why they're being moved, as long as they can be moved elsewhere with intent-based automation. #SystemsAtScale
Goal is to avoid many:many interactions between constraints, and having a single set of schedulers and compute and maintenance as a service. #SystemsAtScale
There's no automated long-term migration scheme for longer-term removal of capacity. #SystemsAtScale
Still need an orchestration layer on top to avoid creating SPOFs and tripping over them. #SystemsAtScale
For instance, services have to implement their handlers for being drained by automation, but they may not necessarily be done correctly and across the board. #SystemsAtScale
"You don't wake up one day and realize that having humans do it by hand is a bad idea, you already know." -- but it's difficult to prioritize doing automation in a consistent manner. #SystemsAtScale
Rack network hotspotting and saturation also is a limit because network-heavy loads congregate on the same rack. #SystemsAtScale
Solving it involves manually arranging machine swaps with other services OR re-racking machines [ed: ow ow ow ow] #SystemsAtScale
Currently Facebook doesn't spread services across fault domains; services often have contiguous racks within a power domain [ed: ow ow ow ow] #SystemsAtScale
Trying to move from hundreds of dedicated pools of machines to having a single fleet. [ed: it's remarkable that Facebook has gotten to where it is today *without* Borg-equivalent] #SystemsAtScale
Key problems: resizing, moving, turnups, and disaster testing requires too much time and attention. #SystemsAtScale
Finally, @benjchristensen of @fb_engineering on Facebook's compute IaaS. #SystemsAtScale