“Shoutout to Brad, who’s still using Perl” #Monitorama
Latest posts tagged with #Monitorama on Bluesky
“Shoutout to Brad, who’s still using Perl” #Monitorama
any of you weird nerds gonna be at #monitorama next week
It was a drunk guy, and our neighbor seated at the bar (also a #monitorama attendee!) and the bartender both had our backs.
The dude got cut off, and would have gotten thrown out had he pushed it any further. Good times.
How bad of an outage do we need to have before people are happy to pay for the software we build? [fin] #monitorama
(5) but there's a lot of demand for good product design: large-scale, heterogenous data, and low-latency feedback paths (including over longer periods of time, not just instantaneous), and nobody wants to pay. #monitorama
(3) Use boring technology and combine it in innovative ways. [ed: although I... worry about older storage engines, and think that their cost economics may not be up to snuff]
(4) It's a crowded market, and the "best" product may not win. #monitorama
Business models and technical innovation are evolving. a lot of things do build on each other e.g. lucene, ELK stacks, etc... rather than being invented whole cloth #monitorama
(2) New query languages are rarely the solution. A new query language is not likely to succeed. Everyone uses SQL, use it too unless you have a good reason. [ed: and this is why I made @honeycombio's builder says VISUALIZE _ WHERE _ GROUP BY _ ORDER BY _ LIMIT _] #monitorama
So the lessons:
(1) logs (inverted indices, little aggregation) and timeseries (data loss tolerant, compresses well, just a bucket of numbers) are different challenges for storage. Often you need separate engines. [ed: although he thinks @honeycombio is interesting] #monitorama
[ed: because I think we're finally at the end of the journey.]
So now we're at Google. So what does Stackdriver use to measure itself/GCP and do planet-scale observability? "mostly good," he says, [ed: and I'd agree based on my 8-month-stale knowledge] #monitorama
So enter InfluxDB, and suddenly being able to measure everything and high cardinality dimensions. (which could have solved loggly's problems)
every generation can monitor the previous generation, but not itself...
[ed: & why I demoed Honeycomb debugging itself] #monitorama
so he wound up going to work at Loggly in 2012. There was a huuuge volume of logs to index, but at least there was partial visibility.
And by 2014 it still wasn't solved :/ #monitorama
In this thread, I help a Google engineer with using Google Slides to project slides and have separate speaker notes.
This repeatedly happens and is one of the most painful UX issues of Google Slides that bites every non-professional presenter. :( :( ( #monitorama
Finally is @general_order24 on building o11y platforms over the past 10 years, and how it brought him eventually to Stackdriver... #monitorama
Homin concedes: instead of pre-training your data, maybe just label everything explicitly and correctly to begin with. This functionality is for the situation that you have org drift, different schemas, lack of labels, etc. #monitorama
They automatically build the graphs, don't surface them directly to the user, but instead intend to make them part of Datadog's next generation of ML-powered assistive features. #monitorama
[ed: by that point...to me, that says, have fewer alerts/alarms...]
We then use an o11y graph to get to the bottom of what's happening and expose relevant info.
conclusion: we need to study how people interact with data to improve system o11y. [fin] #monitorama
[ed: I think this talk is... probably a fine talk for people with many potential-cause-based metrics/alerts, but I'm finding it hard to translate to my symptom-and-trace based debugging world]
We're trying to hypothesize why all our 5+ alerts are going off... #monitorama
"Dashboards have gotten a bad rap recently, but they're one of the more useful tools out there if well constructed c.f. @gphat's talk" --Homin
We can find out what metrics matter by how often they're seen/dashboarded. #monitorama
There's some useful data on which service is implicated in an alert, which team it goes to, and what metric it's on.
And so on for all our alerts. And we might be able to find temporal leading indications/correlations. #monitorama
Homin says that we need to use ML on messy data, do unsupervised/semi-supervised training, plus our existing known nodes and relationships between them, and construct a knowledge graph.
One concrete example: #monitorama
How can we do RCA with our imperfect, unlabeled data? [ed: my Allspaw introject is objecting]
Our messy data does have lots of data on real user interactions, and patterns of how components behave. #monitorama
Gore (of Gore-Tex)'s hypothesis: there's an upper limit to how many working relationships you can efficiently juggle. Similar to Dunbar's number.
If you have >150 employees, your standards for o11y will diverge, says Homin. #monitorama
Legacy instrumentation and new instrumentation often have to live side by side, you can't just magically snap fingers and replace it all. #monitorama
Often our observability data looks like a pile of garbage.
So a segue to organizational dynamics, do we ship our org chart in our o11y a la Conway?
Far away teams' metrics don't intuitively make sense to your team! #monitorama
[ed: yay <3 contributing factorS rather than RCA]
but this is imperfect because... not all services are real services! automatic inference can fail to find gaps in our tracing propagation, point out disconnected things, or identify red herring intern projects. #monitorama
So how can we trace an increase in failures to something downstream in the stack (e.g. ooming database)? [ed: so far, what's being demonstrated is timing correlation rather than _trace_ causation/dependency analysis]
so ideally we surface all the contributing factors #monitorama
[ed: I... am dubious about high cardinality metrics, but okay :)]
and then you have detailed tracing data that are hard to roll up, plus operator events (e.g. deploys), logs, etc.
but how do we get from total information awareness/overload to RCA? #monitorama
Starting off with ye olde o11y definition -- what are we observing?
Homin argues we're observing our services and containers... and our services might be distributed across microservices, serverless, part-monolith, etc.
"Metrics should reflect high cardinality..." #monitorama
Next up is Homin Lee of Datadog, on automated infrastructure graphs. [ed: I'm looking forward to this one, because I'm hoping it will change my mind about service maps] #monitorama