#Monitorama

Joshua Wood

@joshuawood.net

1 year ago

“Shoutout to Brad, who’s still using Perl” #Monitorama

0 0 0 0

austin

@aparker.io

1 year ago

any of you weird nerds gonna be at #monitorama next week

2 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

It was a drunk guy, and our neighbor seated at the bar (also a #monitorama attendee!) and the bartender both had our backs.

The dude got cut off, and would have gotten thrown out had he pushed it any further. Good times.

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

How bad of an outage do we need to have before people are happy to pay for the software we build? [fin] #monitorama

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

(5) but there's a lot of demand for good product design: large-scale, heterogenous data, and low-latency feedback paths (including over longer periods of time, not just instantaneous), and nobody wants to pay. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

(3) Use boring technology and combine it in innovative ways. [ed: although I... worry about older storage engines, and think that their cost economics may not be up to snuff]
(4) It's a crowded market, and the "best" product may not win. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Business models and technical innovation are evolving. a lot of things do build on each other e.g. lucene, ELK stacks, etc... rather than being invented whole cloth #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

(2) New query languages are rarely the solution. A new query language is not likely to succeed. Everyone uses SQL, use it too unless you have a good reason. [ed: and this is why I made @honeycombio's builder says VISUALIZE _ WHERE _ GROUP BY _ ORDER BY _ LIMIT _] #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

So the lessons:
(1) logs (inverted indices, little aggregation) and timeseries (data loss tolerant, compresses well, just a bucket of numbers) are different challenges for storage. Often you need separate engines. [ed: although he thinks @honeycombio is interesting] #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

[ed: because I think we're finally at the end of the journey.]

So now we're at Google. So what does Stackdriver use to measure itself/GCP and do planet-scale observability? "mostly good," he says, [ed: and I'd agree based on my 8-month-stale knowledge] #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

So enter InfluxDB, and suddenly being able to measure everything and high cardinality dimensions. (which could have solved loggly's problems)

every generation can monitor the previous generation, but not itself...

[ed: & why I demoed Honeycomb debugging itself] #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

so he wound up going to work at Loggly in 2012. There was a huuuge volume of logs to index, but at least there was partial visibility.

And by 2014 it still wasn't solved :/ #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

In this thread, I help a Google engineer with using Google Slides to project slides and have separate speaker notes.

This repeatedly happens and is one of the most painful UX issues of Google Slides that bites every non-professional presenter. :( :( ( #monitorama

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Finally is @general_order24 on building o11y platforms over the past 10 years, and how it brought him eventually to Stackdriver... #monitorama

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Homin concedes: instead of pre-training your data, maybe just label everything explicitly and correctly to begin with. This functionality is for the situation that you have org drift, different schemas, lack of labels, etc. #monitorama

0 0 0 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

They automatically build the graphs, don't surface them directly to the user, but instead intend to make them part of Datadog's next generation of ML-powered assistive features. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

[ed: by that point...to me, that says, have fewer alerts/alarms...]

We then use an o11y graph to get to the bottom of what's happening and expose relevant info.

conclusion: we need to study how people interact with data to improve system o11y. [fin] #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

[ed: I think this talk is... probably a fine talk for people with many potential-cause-based metrics/alerts, but I'm finding it hard to translate to my symptom-and-trace based debugging world]

We're trying to hypothesize why all our 5+ alerts are going off... #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

"Dashboards have gotten a bad rap recently, but they're one of the more useful tools out there if well constructed c.f. @gphat's talk" --Homin

We can find out what metrics matter by how often they're seen/dashboarded. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

There's some useful data on which service is implicated in an alert, which team it goes to, and what metric it's on.

And so on for all our alerts. And we might be able to find temporal leading indications/correlations. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Homin says that we need to use ML on messy data, do unsupervised/semi-supervised training, plus our existing known nodes and relationships between them, and construct a knowledge graph.

One concrete example: #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

How can we do RCA with our imperfect, unlabeled data? [ed: my Allspaw introject is objecting]

Our messy data does have lots of data on real user interactions, and patterns of how components behave. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Gore (of Gore-Tex)'s hypothesis: there's an upper limit to how many working relationships you can efficiently juggle. Similar to Dunbar's number.

If you have >150 employees, your standards for o11y will diverge, says Homin. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Legacy instrumentation and new instrumentation often have to live side by side, you can't just magically snap fingers and replace it all. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Often our observability data looks like a pile of garbage.

So a segue to organizational dynamics, do we ship our org chart in our o11y a la Conway?

Far away teams' metrics don't intuitively make sense to your team! #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

[ed: yay <3 contributing factorS rather than RCA]

but this is imperfect because... not all services are real services! automatic inference can fail to find gaps in our tracing propagation, point out disconnected things, or identify red herring intern projects. #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

So how can we trace an increase in failures to something downstream in the stack (e.g. ooming database)? [ed: so far, what's being demonstrated is timing correlation rather than _trace_ causation/dependency analysis]

so ideally we surface all the contributing factors #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

[ed: I... am dubious about high cardinality metrics, but okay :)]

and then you have detailed tracing data that are hard to roll up, plus operator events (e.g. deploys), logs, etc.

but how do we get from total information awareness/overload to RCA? #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Starting off with ye olde o11y definition -- what are we observing?

Homin argues we're observing our services and containers... and our services might be distributed across microservices, serverless, part-monolith, etc.

"Metrics should reflect high cardinality..." #monitorama

0 0 1 0

Liz Fong-Jones (方禮真)

@lizthegrey.com

6 years ago

Next up is Homin Lee of Datadog, on automated infrastructure graphs. [ed: I'm looking forward to this one, because I'm hoping it will change my mind about service maps] #monitorama

0 0 1 0

Posts tagged #Monitorama