Part 3/5: Dear Operations Engineers ...

It’s time to shrug off the last vestiges of that martyr complex we’ve been trudging around with since the bad old days of the BOFH. We’ve got better things to do with our lives than being assholes to everyone.

Stop trying to predict every possible failure — you can’t, anyhow – and stop toiling away half your life creating dashboards for people, dashboards of dashboards, and ways to auto-generate dashboards and metrics (that nobody can ever seem to find when they most need them).

Honeycomb grew out of the best of the operations and data disciplines. We are grounded in the fervent belief in the power and necessity of raw events, in the belief that it’s better to be whip-fast, interactive and exploratory — and “close-enough” than to claim or aim for 100%.

Systems engineers should have nice things

Honeycomb was built by ops engineers, for ops engineers. Because we love you, and we want your lives to be better.

Business Intelligence teams have had nice things for years, because the fiscal consequences have always been clear. You could always draw a straight line from “more business intelligence” to “making more money.”

We believe this is increasingly clear for systems too, and we can start arguing for this convincingly. How much money gets lost when a site is down or shopping carts conversion rates are failing for 5 minutes, or when a page takes two seconds to load? A lot, actually.

Old Way New Way
Over-page yourselves on symptoms, because you don’t trust yourselves to debug complex problems without paging on a lot of symptoms, which are inherently flappy and unreliable. ❌ Create two lanes for alerts – things that are worth waking people up for at 3 am, and things that need to be dealt with eventually … like when you roll in to work at 11 am after a good night’s sleep. Have the confidence in yourselves and your debugging tools that you don’t need to wake up and judge every flap for yourself; align your pain with customer pain.

Operations teams are chronically over-paging themselves and burning themselves out. Often this is because they are paging on symptoms rather than end-to-end code paths or top-level metrics. But engineering pain should be strongly aligned with customer pain; if customers are unaffected, engineers shouldn’t be woken up either.

Honeycomb solves this by giving you the confidence to debug complex problems in a fraction of the time. Teams over-page themselves because they lack confidence in their tooling. They have to throw the dice and monitor for symptoms because they don’t or can’t trust tools to page them only when customers are affected.

We experienced this firsthand at Parse, with Facebook’s Scuba. By getting our events into datasets that let us focus on ad hoc queryability, we were able to systematically eliminate category after category of unreproducible errors that collectively added up to a huge impact on our reliability. And we were able to debug practically anything in seconds or minutes, not minutes or hours.

Drop MTTR With One Weird Trick

We know that most systems problems happen as a result of a human taking action upon that system. That’s why we’ve made vertical markers one of the earliest and best features of Honeycomb.

Old Way New Way
Spend a lot of time puzzling over an unexpected spike or change that has no apparent cause; hours later track it down to a human event, almost inevitably. ❌ Draw colorful, dotted vertical lines any time a person runs a command or a script gets run from cron, etc. Wrap your command lines and cron jobs so it’s easy to trace down human actions.

Any time a script runs from cron, any time you canary a deploy, any time you run a one-off — Honeycomb can draw a vertical link with a URL to the code tarball, include your name, post it to a slack room, whatever.

This is really helpful for distributed teams, too. Honeycomb contains a lot of stuff aimed for helping distributed teams collaborate on a debugging problem, share their work, hand off between oncall shifts, etc.

Part 2/5: Dear Software Engineers ...

Observability is not a thing for operations or some other team to care about. Software engineers, you are increasingly the primary owners of your own services … and this is a great thing.

Developers, meet my friend devops

The devops movement is years old now. Ever since the very beginning, devops has spent a lot of energy wagging its fingers at ops and telling ops to get better at writing software. We’ve spent far less energy helping software engineers internalize the need to own and instrument and understand the consequences of the software they write and the services you own.

It’s only recently that we’re really starting to bring the dev to devops … finally.

  • “Isn’t exception tracking enough?” Nope. Exception tracking is awesome, but it still keeps you thinking about syntax errors and lines of code, not systems that interact with each other in complicated and unpredictable ways. It’s good to catch your exceptions, but the scope is limited in value.
  • “Aren’t metrics enough?” Nope. With the statsd-type model, you only get to work with counters and gauges — which is fast and scalable, but achieves speed by sacrificing context and detail. A pretty heartbreaking choice to make. With Honeycomb you don’t have to.

In the glorious future, observability is just as important as unit tests, and operational skills – debugging between services and components, degrading gracefully, writing maintainable code and valuing simplicity – are a non-negotiable skill for senior software engineers. Even mobile SWEs or front-end SWEs.

Who owns your availability?

YOU own your availability.

This means that software engineers need to feel comfortable and confident breaking systems, understanding them, experimenting, and fixing them. Instead of being scared to break things, be hungry to risk breaking it … 1% at a time, under controlled circumstances, where you can watch or add a trigger to roll back to disable the feature.

Because you know what sucks? Having all the responsibility and none of the tools to do your job, or having to maintain two technical and cultural stacks.

We make it way too hard for software engineers to own their services when you have to learn a separate code base and environment and toolset for infra as for core services. If you have to learn how to use Chef, AWS, and Graphite just to add a metric, it may or may not be reasonable to expect all your SWEs to provision their own metrics. Particularly if it’s prone to spilling over and causing outages if they make a mistake.

Old Way New Way
Pick one: context or speed. For example, you could either pick raw logs, which are horrifyingly slow and don’t scale but let you track things like latency, lock percentages, raw queries, query families .. or you could pick counters, which tell you what the count of errors is per time window, or other ticks of data. ❌ Don’t pick. <3 Have the best of both worlds. The reason context used to be slow is because you were dealing with unstructured logs, string processing, and not wielding vertical sampling to compensate for the increased data of horizontally wide events.

Adding a new detail about an event should be just as trivial as adding a new comment to your code — and just as risk-free. Get rid of the friction that prevents understanding by instrumenting your own systems. Take ownership for yourselves, and become way more badass engineers in the process. Honeycomb empowers you to take control over your own availability.

High or low cardinality fields, sparse or rich datasets

Forget your hangups when it comes to cardinality and just log everything, because we can take it and you can use it. Filter on high-cardinality fields (like millions of unique UUIDs), run aggregates, and dip down to the original raw events when you need to. Generate a unique request id and trace it up and down the stack, even if it loops back in multiple times.

Distributed grep is no longer your sole tool for tracking down edge cases. Honeycomb is much faster than traditional log aggregation because of our column-oriented datastore and read-time aggregation — we only read what we need to to get you answers asap.

If you’re undertaking a project with a high degree of difficulty and subtle problems, like an API rewrite or a massive migration, surface health checks simply aren’t enough. Having deep confidence in your power to debug and instrument your own software is transformational. Relying on another team to detect and inform you when your own code has problems is not good enough.

Make data-driven decisions and know the ripple effects of the code that you write, by diving straight in and adding instrumentation without fear.

Part 1/5: Asking Better Questions

Any mature production system is likely to have hundreds of thousands if not millions of metrics, most of which never get looked at by a human.

Metrics and logs are no longer human-scale, they are machine scale.

We have typically coped with this madness by crafting lots of human-scale dashboards — artisanal, bespoke, handcrafted dashboards made of metrics. And then dashboards of dashboards to track your dashboards … sigh.

Dashboards all the way down

Hey, dashboards are awesome. We totally need dashboards. They tell us lots of important things in a glance, and they help us cognitively process lots of extremely detailed information about the state of the world.

But most dashboards are fixed, static views that were built to surface particular failures. You predict a component is going to break, so you do what? You make a dashboard to help you visualize it or debug it when it does. Every postmortem you’ve ever been in has probably had an action item called “create a dashboard to find this problem faster.”

Dashboards are a terrific view on reality, but they are not a good debugging tool, because they lock you into a set of assumptions.

So what happens when you can no longer predict all (or even most) of the failures? Well … stop trying. Instead, get used to asking questions about your systems. Interactivity is no longer an optional, nice-to-have feature.

Start asking questions

Instead of scrolling through static dashboards, get used to interacting with your systems — asking questions, refining them, and treating it like an interactive service instead of a flat view on a TV screen.

Microservices, containers, ephemeral instances, schedulers, serverless models, functions as a service, third-party hosted databases, polyglot persistence, platforms connected to other platforms with variety of glue and balancers; today’s infrastructure is exponentially more complex than yesterday’s.

Honeycomb addresses this inviting you to play around and explore your data with wide, rich events. We want you to toss in as many attributes and as much context as possible for every dataset — you don’t pay a performance penalty with more details, even tens of thousands per event, whether sparse or full.

If you can’t predict what convergence of problems will cause a user-impacting event, you definitely can’t predict what you’re going to need in order to diagnose it and solve it either. So just store everything! Sample vertically to control costs. And get used to asking questions.

For example, let’s contrast the processes:

Old Way New Way
You scroll down and skim page after page of auto-generated dashboards or handwritten dashboards … or, best possible case, copy/paste a custom query using the vendor’s proprietary query language, or maybe type in a bunch of dot-delimited metric names by construct a dashboard or graph. ❌ Using Honeycomb, start with simple entry points (like req/sec) and start adding more attributes to aggregate on, perform calculations, sort, limit, and filter. You can always get back to the raw events for the current query and eyeball the results, looking for any correlative patterns to explore visually.

You might be groaning and thinking this sounds harder, but OMG no, it is not! It may not be what you’re used to, but it’s not harder, and it saves SO much time and energy once you’re used to it. It prevents dashboard blindness, where we tend to forget a thing exists if we didn’t visualize it.

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

Honeycomb can make you a better engineer.

Interacting with your systems this way will make you a better engineer. It builds your spidey-sense about how your complex systems are going to interact and behave. And this is why you should start thinking this way from the beginning.

Yeah, you can go back and reinstrument your code and build pipelines for your logs to Honeycomb later. You can convince your teams to learn a new way of interacting with their systems later. But it is SO MUCH EASIER than constructing a massive edifice of an ELK stack or a Graphite install or OpenTSDB or configuring tons of plugins.

We get so wrapped up sometimes in telling you why this is more powerful for complex systems that sometimes we forget to tell you, it’s easier to start out this way from the beginning too! SO much easier than maintaining an ELK stack, or a Graphite install, or OpenTSDB and complicated static dashboards or some other great-great descendant of RRD.

Get used to interrogating your systems from day one. Everyone who joins your team will have a rich set of reference points to start composing harder questions and looking for tricky black swan events later on.

Why Honeycomb? Black Swans, Unknown-Unknowns, and the Glorious Future of Doom

Hello friends! We need to talk – about Honeycomb, you, and the future.

We’ve built this thing to help ourselves and one another deal with the future of software. It isn’t Yet Another Monitoring Tool, or Another Metrics Tool, or Another Log Aggregator. Frankly the world doesn’t need any more of those. The world does need Honeycomb, and rather badly.

We spend a fairly large percentage of time obsessing over the future of technology and how we can help people futureproof their systems with better tools. We are somewhat opinionated (ha!), and believe quite modestly that Honeycomb is better than anything else out there for preparing your services and your teams to meet the future.

But it doesn’t really do anyone else much good to have all this locked away in our heads. So … let’s get swinging. :)

What is Honeycomb?

Honeycomb is an event-driven observability tool for debugging systems, application code, and databases. Honeycomb uses structured data and read-time query aggregation to support ultra-rich datasets, no indexes or schemas, and a fast, interactive interface.

Honeycomb is for debugging systems the way gdb or pprof are for debugging code. Only instead of stepping through lines and between modules, you’re now instrumenting services and tracing the full life-cycle of events between services and code and systems and storage layers.1

“Systems: we have a problem.”

Software is becoming exponentially more complex. On the infra side we have convergence of patterns like microservices, polyglot persistence, containers that continue to decompose monoliths into agile, complex little systems. Great for products; hard on humans.

On the product side we have an explosion of platforms and creative ways for empowering humans to do cool new stuff. Great for users, hard to make using stable, predictable infra components.

Our team has worked at Google, Facebook, Dropbox, and other platforms that were consistently years ahead of the pack. We’ve seen the future, and frankly it looks a lot like a freight train of complexity bearing down on you just around the next curve. And we can help.

Honeycomb is designed to help your team answer unpredictable new questions — quickly, accurately, painlessly. It provides real-time, interactive introspection for your data at a scale that would drown or bankrupt other apps.

Because the fundamental difference between predictable systems and complex systems is the number of new questions you need to craft and answer on a regular basis. 2

Old Way New Way
With a classic LAMP stack, you might have one big database, an app tier, web layer and a caching layer, with software load balancing. You can predict most of the failures, and craft a few expert dashboards that will answer nearly every performance root cause analysis you have over the course of a year. Great! Your team isn’t going to spend a lot of time chasing unknown unknowns, and that’s what matters. ❌ With a platform, or a microservices architecture, or millions of unique users or apps, you may have a long, fat tail of unique questions to answer every week. They may be variations on a theme (e.g. a user writes in and incorrectly reports “the site is down”), but you need the ability to drill down and answer unpredictable questions accurately and rapidly without expending a lot of cognitive energy.

At Honeycomb, all of us have been through that rocketship growth phase into uncharted territory and managed chaos repeatedly, so we know how to bridge the gap with tooling and techniques. 3

A key factor in our success — and yours — is how you adjust your mental model from reliance on a fixed set of questions and checks (“monitoring”) to the more fluid approach of systems observability.

What is “Observability”?

“Observability” is an awesome concept, borrowed from control theory. It describes the kind of tooling we all need once systems outpace our ability to predict what’s going to break.

“In control theory, observability is a measure for how well internal states of a system can be inferred by knowledge of its external outputs. The observability and controllability of a system are mathematical duals.”

Observability is what we need in a world where most problems are either caused by humans or black swan events, the convergence of three, five, 10+ different things failing at once. Platforms that incorporate multiple components will always produce a long, fat tail of new questions to ask your systems about on a regular basis.

Let’s compare it to some classic options such as monitoring, metrics, and log aggregation.

  • “Monitoring” is an umbrella term for operational visibility. It generally means you have a set of automated checks (often centralized) that run against systems to ensure none of those things that signify trouble are happening (in any of the ways you predicted). Monitoring and alerting are things that Honeycomb can couple with in lots of ways.
  • “Metrics” are usually a tick or datapoint, often a number, with optional tags. Metrics are usually bucketed by rollups over intervals, which sacrifices precious detail about individual events in exchange for cheap storage. Most companies are drowning in metrics, most of which never get looked at again. You cannot track down complex intersectional root causes without context, and metrics lack context.
  • “Log aggregation” is the most like Honeycomb, because “Logs” are clumsy little linear stories about events. But log aggregation involves a lot of string processing (not getting any faster), regexps (not getting more maintainable), and the need to predictively index on anything you might want to search on (or you’re straight back to distributed grep). Structured data is what gives Honeycomb its unparalleled flexibility and performance.

These can all be nice things, but they’re not what we mean by observability. Observability is about building predictable systems that you can reason about.

Observability: not your mama’s monitoring

In the future, instrumentation is just as important as unit tests. Running complex systems means you can’t model the whole thing in your head, and you shouldn’t even try because it’s a crutch that’s becoming impossible anyway. Instead, focus on making every component consistently understandable.

Yes, of course you should have dashboards. Your dashboards must be flexible and interactive, focused on helping you tease out breadcrumbs and follow the trail. If your dashboards lock you into a rigid set of pre-defined lanes, they’ve cut off your creative problem-solving superpowers.

And to provide this kind of ad hoc questioning, you need rich, wide, event-driven data stores that incentivize and empower you to store as much context as possible for each event.

Just about anything can help you find the known unknown problems; observability tooling helps you tease out the unknown-unknowns. Results aren’t very interesting either; what’s interesting is how you got there from the problem statement, preserving the context of the run, and carefully stashing that nugget of wisdom away in your library for future-you to learn from it.

The future is awesome. Welcome.

~ The Honeys.


  1. “But what about Zipkin?? Have you heard about opentracing.io or LightStep???” Yeah!! Big fans! We love power tools that focus on tracing unique request IDs first. It’s a really neat tool for some scenarios. The request ID tracing methodology is inherently depth-first, whereas ours is breadth-first. [return]
  2. Something to think about: why are you still building your own metrics? It’s harder than ever to attract and retain world class engineers. If you have gotten engineers to join your team, why would you waste them on projects that are ancillary to your core business value? Why waste your team’s team building out Yet Another Dashboard when you can outsource the job to someone else who can do it better and cheaper? We gave up maintaining our own postfix and imap systems a decade ago, and the same transition is underway for metrics, albeit more slowly. [return]
  3. Does it sound like we’re promising impossible magical things? No, there are tradeoffs. This is already quite meaty though, we’ll have to give you a look under the hood separately. [return]