MySQL and Honeycomb: My First 10 Minutes

As part of the process of building our RDS connector for Honeycomb, we ran it on our own database logs.

A few neat things came out of the first 10 minutes of looking at the graphs.

Lock times and normalized queries

Our most frequent queries all come from the API server (rather than the UI or other jobs). This makes sense, as the API receives a high sustained rate of events. We have some caching for these queries, and we can actually tell that the caching is working based on the periodic queries run on cache expiration.

For example, if we dive into a specific type of query (select... from bounces, which tracks rate limit/blacklist status) and break it down by client IP address, we can see a clear pattern of individual API servers asking the same question every 30 seconds (the period of the in-memory cache).

Bounce queries by client IP (As an aside, we also got to file a bug against ourselves for using SELECT * …)

Next up, we took a look at the slowest queries by looking more closely at the p95(query_time).

Slow queries p95

As the label indicates, that query is flush logs. That’s really not very interesting, and it pollutes the overall data. Let’s just filter those out so that we can see the relevant slow queries.

Slow queries 95th percentile, minus flush logs

The P95(query_time) graph is much less cluttered now, and the spikes it shows are real and relevant queries.

It’s also interesting that there really seems to be one query dominating the P95(lock_time). Let’s take a look at that more closely, by swapping out the P95 for a SUM(lock_time), which will provide a better sense of the overall load on the server. And if we order by SUM(lock_time) desc, then our summary table pops the culprit right to the top:

Sum lock times by normalized query family

The query holding the most lock time (by 5x!!) is actually the 5th most frequent query (1/40th the volume as the most frequent query). When we need to optimize our MySQL usage in the future, this gives us some terrific places to start.

Learn More

Check out all of the ways you can send data to Honeycomb.

Honeycomb and the Five Whys: Summary Post

Anchor post at the top of this week’s long series of “vision” posts, so everything doesn’t just appear backwards. :)

This week we set forth the case for why Honeycomb is the future. We’ve seen the future, we’ve built massive cotenancy platforms, we’ve now built the observability product we never ever want to live without again as users.

And we want to be the first thing you think of when you’re yanking your infra into the future, too. Or when you have tough problems with your visibility into your databases and storage tiers, or unexplained behavior to debug, or you just really want to make it easier for your teams to collaborate and gain more systems-y skills.

In order, here are links to the posts:

Five ways this matters.

In the next five (short!) posts I will lay out what is unique about Honeycomb, the problems it solves, and the tradeoffs we’ve had to make to fulfill our promises. Those five sections are:

Cheers. We look forward to seeing your data flow. :)

~ The Honeys

Part 5/5: Building Badass Engineers and Badass Teams

No matter how much we love technology, it is always a means to an end. The mission comes first – we don’t do tech for its own sake, we use tech to get the mission done.

What we really care about is building badass engineers who love their work, who love collaborating, who have tools that help them succeed and get out of the way. We love creating the circumstances necessary for powerful teams to emerge.

Engineers are humans too (99.96% likelihood)

We hold the somewhat controversial opinion that engineers are people too. (Hang on … hear us out.)

We’re never going to brag that we have some magical anomaly detection or machine-generated automated root cause analysis, because that would be stupid (thanks Allspaw!)

We are telling you we can empower your humans and make them better at engineering, and make your teams stronger too. We are big believers in Kathy Sierra’s approach to building Badass Users by building tools that quite literally increase User Badassity the longer they are used.

Consumer-quality developer tools

This isn’t just a theory, we’ve seen it play out repeatedly. Developer tools have a sordid history of building tools for ourselves like we aren’t prone to the same biases and nudges as everyone else. Building for utility and building for delight are not opposing objectives.

We are also building ways to capture and curate an institutional “brain” for your team … because people forget things, people leave, people get distracted. Honeycomb always helps you follow not only what the answer is but how you or your team members arrived there

At Honeycomb we have very intentionally chosen against creating yet another query language, or making you type in lengthy dotted metrics. These might be shortcuts to building a power tool, but they are fierce obstacles to natural learning. Our user experience places playfulness and exploration front and center.

Old Way New Way
Developer tools have historically been ugly, offputting things that you were forced to use because there was no other option. ❌ We want to build tools that you become hooked on, that feel like a part of you, because you are so much more powerful when you use them. We want them to be as intuitive and usable and delightful as any consumer product.

For example: Every team I’ve ever been on has had metric nerds, a couple of people who really love geeking out over graphs. (The rest of us are normal human beings.)

You’ll see them up late at night peering over graphs and constructing elaborate views. I’ve never been one of those people … but I’ve always appreciated that working with them made me better. I would carefully bookmark and save some of the work that they did, and it made me better at my own work.

At Honeycomb, we’ve baked this into the product. Sharing, reusing, iterating, bookmarking, and so forth are all first class features. I’m so excited about building tools that won’t fight me on this and make it harder, but will actually make it easier for those awesome metrics nerds to expand their reach and increase their impact.

Old Way New Way
Everybody had to learn to debug every single thing from scratch, for themselves, even if they are the 20th person on the team to do so, even if it’s 4 am and the service is down. ❌ Leverage the work of each other so your time is spent on the highest-impact work. Empower the metrics nerds to have wider impact by creating all those views and attributes you can all reuse and learn from.

Can you imagine? A tool that …

  • Lets you gracefully hand off from one on-call shift to the other, so your buddy can go back and see how you debugged critical problems from beginning to end
  • Creates a production runbook just by using it that you can use to see what everyone else on your team is trying to understand and debug
  • Lets you capture and trace every appearance of a unique request ID as it traverses the stack, hits various service and data stores, and returns to the user
  • Creates a shared reality for everyone – from C-level to mobile engineers, to backend software engineers or SREs, even marketing, sales and technical support
  • Helps you onboard and train new hires and junior engineers … at their own speed, as needed
  • Lets your senior engineers actually go on vacation by capturing their thought process around critical processes
  • Posts a complete debugging run back to a Jira ticket or Asana task once you finish debugging … so the person who filed it can debug it themselves next time.
  • Helps distributed teams collaborate effectivaly, by posting results and comments to Slack channels, where other people can click to iterate off any point in the run.

We want to make it easy to get up and going, to integrate with any tools you already love and gradually replace the ones you use and don’t love, by piggybacking on the work you’ve already done. Our API just takes JSON blobs, from any source whatsoever. We value your time, so you should have to redo as little work as possible.

Old Way New Way
Build everything custom and in-house. Spend your precious engineering cycles building the same solutions over and over again at every job, from email to log aggregation to monitoring and alerting and metrics, hell why not write your own storage and filesystems while you’re at it? Good lord.. ❌ Engineering talent is rare; don’t waste it on anything that’s not core to your business, that isn’t a key differentiator. Spend your core cycles on your own products, and let experts in the field do everything that isn’t the critical path. I guarantee they can do it more cost-effectively and better than you can.

In summary …

The key to becoming a better engineer is to get your hands dirty. Don’t be terrified to break things. Know how to break them controlledly, in small chunks, in reversable ways.

Build guard rails, not fences.

Get used to interacting with your observability tooling every day. As part of your release cycle, or just out of curiosity. Honestly, things are broken all the time — you don’t even know what normal looks like unless you’re also interacting with your observability tooling under “normal” circumstances.

The best engineers I’ve ever worked with are people who spent nearly as much time in their tooling and their observability as their IDE or their code. I’d be cool with a slogan like “code less, think more.”

Consider the children.

It has never been harder than it is now to recruit, hire and retain top engineering talent. Why do companies insist on whiling away those skills, on engineering problems that have nothing to do with their core differentiators?

Honeycomb is about the future: a version of the future where most people have woken up to this reality, and have stopped trying to waste scarce engineering cycles on their own observability tooling, just like most companies no longer run their own email and spam filters.

Let us help you build better engineers, and take challenging problems off your hands so you can focus on your key work.

Part 4/5: Everyone is a DBA

DBAs may be the last remaining priesthood in our industry. But software engineers and operations engineers are increasingly finding themselves responsible for precious company data, and DBAs are increasingly adopting generalist skillsets and best practices.

Not everyone is thrilled about this

There is a surprising amount of fear and trembling among non-DBA engineers when it comes to managing data, because it has an alien feel and errors can be so permanent.

Honeycomb can help. :) Storage is one of our specialties. We’ve run massive, world-class clusters in MongoDB and MySQL and have extensive experience in plenty of other databases.

This might feel like an oddly practical entry among the other idealistic messages in the list, but we’re incredibly excited about it because there’s such a big gap in the market around this problem set.

We don’t even have to build anything special for it, just log connectors and then post a bunch of examples that show you how to track down which queries are eating up your lock percentages, fighting each other to acquire the lock, when each query family was deployed and what line of code it lives on, which tables or collections are doing full table scans and are going to tip over a limit soon unless you add an index.

This is going to be so much fun. You’ll see!

Old Way New Way
DBAs exist in siloes with their own tools, languages, and alerts that nobody else understands. Nobody can pinch hit for them, or vice versa, because it feels like a foreign land. ❌ DBAs are acting more like Database Reliability Engineers, where they share tools and techniques with the other engineering teams. Software engineers and systems engineers frequently find themselves in charge of the company data as part of their normal jobs.

Example:

For example, one of the hardest problems in distributed systems is … “everything’s getting a little bit slower.” When you have hundreds of backend services and storage clusters, where do you even start?

Is it a particular deploy, or a single AWS availability zone, or a particular instance type, or linked to a deploy of an ancillary service? Is it one user dominating the slow queries? Is it only slow during garbage collection? Is there an offline process being launched from cron that’s piling up? Disk saturation, NIC saturation, a bad link on a switch? Is it write endpoints, or read endpoints, or writes to a particular collection sharded across multiple replica sets? Does the slowness correspond to a snapshot process that is still serving read queries? Some intersection of the above?

Who knows? Who cares? log everything. Just toss it in as more attributes to your data set. :)

Data isn’t a silo

Data doesn’t have to be a silo anymore. You can all use each other’s tools. It’s soooo much easier and more friendly when it feels familiar. And it hooks right back in to your instrumented code, too

With Honeycomb, you can use your familiarity debugging your own code or stateless services, and transfer it directly over to finding outliers and issues with your storage systems, just by ingesting logs and tailing them into Honeycomb.

Unlike metric-based analytics, with Honeycomb you always have access to the original raw query and associated events — not just the normalized query family. This is underappreciatedly powerful — once you’ve tried it you will never, ever want to live without it.