Trending

#SystemsAtScale

Latest posts tagged with #SystemsAtScale on Bluesky

Latest Top
Trending

Posts tagged #SystemsAtScale

Preview
Relational Anthropology – A Bigger Picture Chapter Eighteen explores the interconnectedness of various systems, asserting that internal, familial, political, and ecological structures operate under the same relational laws. It emphasizes that dysfunctions arise from misalignments within these systems, highlighting the importance of coherence, truth, and adaptability in fostering healthy relationships and ecosystems.

Scale is an illusion. Every system runs the same relational physics.
#MicroMacro #SystemsAtScale #RelationalAnthropology #SurvivorLiteracy

0 0 0 0
Systems @Scale 2018 - Resolving Outages Faster with Better Debugging Strategies | At Scale Conferences Liz Fong-Jones, Staff Site Reliability Engineer at Google, explains why building more dashboards isn’t the solution — using dynamic query evaluation and integrating tracing is.

Nobody livetweeted or liveblogged it, but that's okay! There now is video of me giving my talk at #SystemsAtScale about #o11y and how singular metrics and graphs need the ability to mash them up with their context to be successful! https://t.co/SZYXr7vp2W

0 0 0 0
Preview
Facebook updates its service 10 times faster than a decade ago, even though it's 10 times bigger Turns out it's hard to run a social network with 2.2 billion people.

TIL there were reporters at #SystemsAtScale who tried to get the highlights of the talks -- www.cnet.com/news/how-facebook-operat...

0 0 0 0

And #SystemsAtScale hosted by @fb_engineering is a wrap! Hope people enjoyed the livetweeting and that I held down the fort okay; few other folks here were heavy Twitter users, for obvious reasons ;)

0 0 0 0

Q: why not k8s? A: "yes, we do read." but K8s won't scale to the number of machines, and trying to force scaling changes in may not meet the k8s community's needs. (same reason that Netflix released its own orchestrator instead of using k8s or mesos) #SystemsAtScale

0 0 0 0

But the work for now is just to get everything standardized and out of private pools to remove human element from fleet operations. [fin] #SystemsAtScale

0 0 1 0

Also have a desire to create in far future update domains that allow evenly spreading out services evenly across a standardized set of failure domains. #SystemsAtScale

0 0 1 0

Don't accept maintenance events until schedulers are confident that things are sufficiently out of the way that they won't suffer impact; however, potentially years out from being able to do that fully automatically. #SystemsAtScale

0 0 1 0

New datacenter tooling to reason about and schedule maintenances to not conflict with each other and not cause undue impact on services. #SystemsAtScale

0 0 1 0

Resource allowance system to be created to avoid humans being involved in horse-trading individual racks and maintaining overall capacity limits; need a fleet ledger to know what's assigned to whom.

Shard by failure domains, stay consistent for each shard #SystemsAtScale

0 0 1 0

but multiple schedulers allowed as long as allocation is centrally handled. Most will still use Tupperware but others may want to vary. #SystemsAtScale

0 0 1 0

Goal to have a container allocation frontend that lets you request any size of container (whether whole machine or less than machine). #SystemsAtScale

0 0 1 0

Binpacking multiple unrelated containers onto a single machine is not a priority due to concerns about noisy neighbors; many teams are using entire single machines and optimized for it. #SystemsAtScale

0 0 1 0

Showing an incomplete architecture diagram, pieces of which *will* be built over time. [ed: interested to see how this does turn out over time; this is definitely an unvarnished look into FB's state of infra] #SystemsAtScale

0 0 1 0

Bare metal, bit twiddling, and no virtualization starting to run head into the problem of trying to migrate. #SystemsAtScale

0 0 1 0

Phases: moving everything tupperware into the twshared pool, then migrating the large customized scheduling solutions into tupperware. #SystemsAtScale

0 0 1 0

Good news is that Tupperware has existed 7+ years. Most services by count but not by fleet size are using it, but ~45% of it is all different private pools, ~45% is non-tupperware; only ~10% is on the "twshared" shared pool. #SystemsAtScale

0 0 1 0

Compute as a service needs to bridge between service and batch scheduling on one side, and maintenance/network on the other.

Storage regarded as out of scope for now. #SystemsAtScale

0 0 1 0

It shouldn't matter to services why they're being moved, as long as they can be moved elsewhere with intent-based automation. #SystemsAtScale

0 0 1 0

Goal is to avoid many:many interactions between constraints, and having a single set of schedulers and compute and maintenance as a service. #SystemsAtScale

0 0 1 0

There's no automated long-term migration scheme for longer-term removal of capacity. #SystemsAtScale

0 0 1 0

Still need an orchestration layer on top to avoid creating SPOFs and tripping over them. #SystemsAtScale

0 0 1 0

For instance, services have to implement their handlers for being drained by automation, but they may not necessarily be done correctly and across the board. #SystemsAtScale

0 0 1 0

"You don't wake up one day and realize that having humans do it by hand is a bad idea, you already know." -- but it's difficult to prioritize doing automation in a consistent manner. #SystemsAtScale

0 0 1 0

Rack network hotspotting and saturation also is a limit because network-heavy loads congregate on the same rack. #SystemsAtScale

0 0 1 0

Solving it involves manually arranging machine swaps with other services OR re-racking machines [ed: ow ow ow ow] #SystemsAtScale

0 0 1 0

Currently Facebook doesn't spread services across fault domains; services often have contiguous racks within a power domain [ed: ow ow ow ow] #SystemsAtScale

0 0 1 0

Trying to move from hundreds of dedicated pools of machines to having a single fleet. [ed: it's remarkable that Facebook has gotten to where it is today *without* Borg-equivalent] #SystemsAtScale

0 0 1 0

Key problems: resizing, moving, turnups, and disaster testing requires too much time and attention. #SystemsAtScale

0 0 1 0

Finally, @benjchristensen of @fb_engineering on Facebook's compute IaaS. #SystemsAtScale

0 0 0 0