#SystemsAtScale

@protyusagendher.bsky.social

2 weeks ago

Relational Anthropology – A Bigger Picture Chapter Eighteen explores the interconnectedness of various systems, asserting that internal, familial, political, and ecological structures operate under the same relational laws. It emphasizes that dysfunctions arise from misalignments within these systems, highlighting the importance of coherence, truth, and adaptability in fostering healthy relationships and ecosystems.

Scale is an illusion. Every system runs the same relational physics.
#MicroMacro #SystemsAtScale #RelationalAnthropology #SurvivorLiteracy

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Systems @Scale 2018 - Resolving Outages Faster with Better Debugging Strategies | At Scale Conferences Liz Fong-Jones, Staff Site Reliability Engineer at Google, explains why building more dashboards isn’t the solution — using dynamic query evaluation and integrating tracing is.

Nobody livetweeted or liveblogged it, but that's okay! There now is video of me giving my talk at #SystemsAtScale about #o11y and how singular metrics and graphs need the ability to mash them up with their context to be successful! https://t.co/SZYXr7vp2W

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Facebook updates its service 10 times faster than a decade ago, even though it's 10 times bigger Turns out it's hard to run a social network with 2.2 billion people.

TIL there were reporters at #SystemsAtScale who tried to get the highlights of the talks -- www.cnet.com/news/how-facebook-operat...

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

And #SystemsAtScale hosted by @fb_engineering is a wrap! Hope people enjoyed the livetweeting and that I held down the fort okay; few other folks here were heavy Twitter users, for obvious reasons ;)

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Q: why not k8s? A: "yes, we do read." but K8s won't scale to the number of machines, and trying to force scaling changes in may not meet the k8s community's needs. (same reason that Netflix released its own orchestrator instead of using k8s or mesos) #SystemsAtScale

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

But the work for now is just to get everything standardized and out of private pools to remove human element from fleet operations. [fin] #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Also have a desire to create in far future update domains that allow evenly spreading out services evenly across a standardized set of failure domains. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Don't accept maintenance events until schedulers are confident that things are sufficiently out of the way that they won't suffer impact; however, potentially years out from being able to do that fully automatically. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

New datacenter tooling to reason about and schedule maintenances to not conflict with each other and not cause undue impact on services. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Resource allowance system to be created to avoid humans being involved in horse-trading individual racks and maintaining overall capacity limits; need a fleet ledger to know what's assigned to whom.

Shard by failure domains, stay consistent for each shard #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

but multiple schedulers allowed as long as allocation is centrally handled. Most will still use Tupperware but others may want to vary. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Goal to have a container allocation frontend that lets you request any size of container (whether whole machine or less than machine). #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Binpacking multiple unrelated containers onto a single machine is not a priority due to concerns about noisy neighbors; many teams are using entire single machines and optimized for it. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Showing an incomplete architecture diagram, pieces of which *will* be built over time. [ed: interested to see how this does turn out over time; this is definitely an unvarnished look into FB's state of infra] #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Bare metal, bit twiddling, and no virtualization starting to run head into the problem of trying to migrate. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Phases: moving everything tupperware into the twshared pool, then migrating the large customized scheduling solutions into tupperware. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Good news is that Tupperware has existed 7+ years. Most services by count but not by fleet size are using it, but ~45% of it is all different private pools, ~45% is non-tupperware; only ~10% is on the "twshared" shared pool. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Compute as a service needs to bridge between service and batch scheduling on one side, and maintenance/network on the other.

Storage regarded as out of scope for now. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

It shouldn't matter to services why they're being moved, as long as they can be moved elsewhere with intent-based automation. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Goal is to avoid many:many interactions between constraints, and having a single set of schedulers and compute and maintenance as a service. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

There's no automated long-term migration scheme for longer-term removal of capacity. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Still need an orchestration layer on top to avoid creating SPOFs and tripping over them. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

For instance, services have to implement their handlers for being drained by automation, but they may not necessarily be done correctly and across the board. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

"You don't wake up one day and realize that having humans do it by hand is a bad idea, you already know." -- but it's difficult to prioritize doing automation in a consistent manner. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Rack network hotspotting and saturation also is a limit because network-heavy loads congregate on the same rack. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Solving it involves manually arranging machine swaps with other services OR re-racking machines [ed: ow ow ow ow] #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Currently Facebook doesn't spread services across fault domains; services often have contiguous racks within a power domain [ed: ow ow ow ow] #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Trying to move from hundreds of dedicated pools of machines to having a single fleet. [ed: it's remarkable that Facebook has gotten to where it is today *without* Borg-equivalent] #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Key problems: resizing, moving, turnups, and disaster testing requires too much time and attention. #SystemsAtScale

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

7 years ago

Finally, @benjchristensen of @fb_engineering on Facebook's compute IaaS. #SystemsAtScale

0 0 0 0

Posts tagged #SystemsAtScale