Honeycomb Customer Profile: Airtime

We recently connected with Honeycomb users over at Airtime, a new social video platform for iOS, Android, and desktop. Like many companies using Honeycomb, Airtime relies on a complex system of infrastructure and a large variation of tools for monitoring their systems.

Jesse R. (VP of Engineering) & Dana M. (Lead Backend Engineer) were nice enough to share why they turn to Honeycomb when their monitoring tools only take them so far:

HNY: What is your mission at Airtime?

JR & DM: Founded by Sean Parker and backed by several major VCs, Airtime is a new social video platform for iOS, Android and desktop (coming soon). Designed for genuine togetherness, Airtime is a group video app where you can simultaneously have conversations and share media in real time.

At the core of the experience is the ‘room’, an intimate space for real friends to feel like they’re actually together. In this room, you can do just about everything you do with your friends as you would in real life - talk about weekend plans, share photos from last night, watch a YouTube livestream, vibe out to music together, and so much more.

HNY: What’s in your stack today?

JR & DM: We use quite a few technologies in our stack. We’ll try to keep it to the high-level items and focus on the backend: Nodejs, SocketIO, Docker, AWS (ECS/Container Registry, S3, SNS, SQS, Lambda, EC2, Elastic Transcoder, and Cloudfront), and WebRTC.

HNY: What tools or services do you use for metrics, monitoring, logging and debugging?

JR & DM: Sumo Logic, New Relic, Cloudfront, Indicative, PagerDuty.. and now, Honeycomb.

HNY: What are your hard problems, from a technical point of view?

JR & DM: Building services on the backend in such a way that we can iterate quickly on product features while still keeping it robust.

We had the luxury of throwing one away so our microservices based backend scales well and errors only cause partial service degradation in the rare instance they occur.

HNY: What kind of user problems do you have to untangle and/or track down?

JR & DM: Most of our user problems come through our community management team and are mostly related to how to do things in the app. We have built admin consoles for our community management team for handling ToS enforcement.

HNY: Why did you decide to start using Honeycomb?

JR & DM: In one word: Observability

We didn’t have a solution for combing through MongoDB’s Management Service logs for performance issues on MongoDB queries. Before – when we had performance issues on, lets say, a certain endpoint – NewRelic would tell us that Mongo was the bottleneck and it would tell us what collection. From there on, we were pretty much on our own.

HNY: Who uses Honeycomb at Airtime and how do you ship data to Honeycomb? Is it mostly coming from software you’ve written yourselves, or existing logs/software?

JR & DM: Our backend team, which does both ops and dev. We use the Honeycomb agent.

HNY: How has Honeycomb helped solve Airtime’s problems?

JR & DM: Thanks to Honeycomb, we are able to drill down on specific queries, collections, and metrics to see what’s exactly causing an issue and push a fix confidently.

In addition, Honeycomb also helps us proactively prevent issues. We’ve set up triggers on metrics like docs examined, and average query duration, and can know ahead of time when things are heading in the wrong direction.

Big thanks to Jesse & Dana of Airtime for sharing their Honeycomb experience!

If you resonate with Airtime’s story here & would like to give Honeycomb a spin, grab yourself an account or contact us anytime at support@honeycomb.io.

Cheers!

Ben Explains Things: how your nginx logs can magically turn query strings into columns

If you’ve ever corresponded with us over email or Intercom, chances are you’ve chatted with Ben Hartshorne. Sometimes we see him send off pearls of wisdom and we think “damn! why is this buried in email, it should really be a blog post!”. Now it is.

SCENE: did you know that if you JUST stream your logs into Honeycomb from whatever load balancer you are running at the edge (haproxy/nginx/etc), and pass some timing information back from your app via HTTP headers, you basically have a functional full stack observability view with very little work??!? IT’S TRUE.

But most people don’t know the crazy fun tricks you can do with edge proxies like nginx. For example, a poor man’s structured data pipeline looks like this:

To: Celebrated Honeycomb Customer
From: Ben
Subject: Re: Your nginx config

Ooh, if I may suggest, as a first step, add some values to the RequestQueryKeys option in the config file (or --request_query_keys flag) to enumerate some of the query parameters in your URL structure. Or, if you're brave, change the RequestParseQuery (--request_parse_query flag) from whitelist to all (only do this on a dataset you can delete - it can quickly blow up the number of columns to something unusable).

This change will pull fields out of the query strings in your URL structure and create columns for them, allowing you to do all the filters, breakdowns, and calculations on them as you would other fields. For example, adding lang to the RequestQueryKeys parameter would create a column named request_query_lang with the value 'en' when it sees a URL like /foo/bar?lang=en. Once there, you could break down by language.

This is a super cheap way to get all sorts of information about your application in to Honeycomb without making any changes to your existing configuration. For more details on the URL parsing bits of honeytail, see https://honeycomb.io/docs/connect/agent/#parsing-url-patterns

Let us know how it goes!

… and a followup, explaining more of the landscape:

To: Most Celebrated Honeycomb Customer #1
From: Ben
Subject: Re: Re: Your nginx config

 Hooray for good timing!! Yeah, I was really hopeful for the 'all' setting that just says "take every query parameter you see and make it a column" until the terrible chaff of the internet hits your website with every webserver exploit known in hopes that one sticks and suddenly you have a hundred extra columns with names like ';cat../../../../../../etc/passwd'. That doesn't help anybody. So the alternative is to just list out the query parameter keys that you want and go from there. More annoying to config, but protects your data. I will say though, a really easy way to get a list of the most common query parameter fields is to enable all and send the data to a temporary dataset, look at the data, choose the right keys, and then delete the temp dataset.

Thanks Ben!!

Sign up and try Honeycomb today!

Save Useful Queries with Honeycomb Playlists

Here at the Hive, we’ve been hard at work on a better way for folks to collaborate while debugging their services. We’re on a mission to help teams to not just move faster with better data, and a lot of that manifests by making shared knowledge accessible, like:

  • How did Ben figure out which customers would be most impacted by yesterday’s unrolling-nested-JSON fix?
  • How is Eben tracking performance for that one customer struggling with Logstash delays?
  • What queries did Charity run last week to dig into that sudden increase in MySQL latency?

To that end, we’re introducing Playlists: a way to create collections of useful Honeycomb queries for both your own use and your teammates’ benefit.

new home page, with playlists and recent team activity

We think Playlists are particularly helpful for two use cases:

  1. Post-mortems: capturing the trail produced by an active investigation
  2. Entry points: capturing a set of high-level “how’s my system doing?” query results.

Queries can be added to a playlist within the query interface itself, or by editing a playlist directly. Playlist permalinks are eminently shareable, just like query permalinks themselves!

add to playlist options: the sandbox and edit playlist ui

Create ‘em, send ‘em around, dig in and get your hands dirty. Honeycomb isn’t for passive consumption :) so use these playlists to capture interesting jumping-off points and dive into queries to iterate and ask new questions.

Sign up and try Honeycomb today!

Instrumenting High Volume Services: Part 3

This is the last of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data. The first one was an introduction to sampling and the second described simple methods to explore dynamic sampling.

In part 2, we explored partitioning events based on HTTP response codes, and assigning sample rates to each response code. That worked because of the small key space of HTTP status codes and because it’s known that errors are less frequent than successes. What do you do when the key space is too large to easily enumerate, or varies in a way you can’t predict ahead of time? The final step in discussing dynamic sample rates is to build in server logic to identify a key for each incoming event, then dynamically adjust the sample rate based on the volume of traffic for that key.

In all the following examples, the key used to determine the sample rate can be as simple (e.g. HTTP status code or customer ID) or complicated (e.g. concatenating the HTTP method, status code, and user-agent) as is appropriate to select samples that can give you the most useful view into the traffic possible. For example, here at Honeycomb we want to make sure that, despite a small set of customers sending us enormous volumes of traffic, we’re still able to see our long-tail customers’ traffic in our status graphs. We use a combination of the Dataset ID (to differentiate between customers) and HTTP method, URL, and HTTP status code (to identify different types of traffic they send).

The following methods all work by looking at historical traffic for a key and using that historical pattern to calculate the sample rate for that key. Specifically, the library linked at the end of this post implements all of these examples—it takes snapshots of current traffic for a short time period (say, 30 seconds) and uses the pattern of that traffic to determine the sample rates for the next period.

Enough with the preamble. Let’s sample!

Constant Throughput

For the constant throughput method, you specify the maximum number of events per time period you want to send for analysis. The algorithm then looks at all the keys detected over the snapshot and gives each key an equal portion of the throughput limit. It sets a minimum sample rate of 1, so that no key is completely ignored.

Example: for a throughput limit of 100, given 3 keys, each key should get to send 33 samples. Based on the level of traffic, the sample rate is calculated to try and get each key as close as possible to sending 33 events. For key a, 900 events divided by 33 rounds down to a sample rate of 27; for b: 90 events divided by 33 gives a sample rate of 3, and for c 10 events divided by 33 is less than one so rounds up to 1. During the next cycle, assuming the incoming event numbers are the same, these sample rates are used. 900 events at a sample rate of 27 will actually send 33 events; 90 events at a sample rate of 3 will send 30 events, and finally 10 events at a sample rate of 1 will send 10 events.

I’ll use a table like this in all the following examples. The key and traffic are used to calculate the sample rate. During the following iteration, the key, traffic, and sample rate will determine the number of actual events sent.

key traffic sample rate events sent
a 900 27 33
b 90 3 30
c 10 1 10

Advantages: If you know you have a relatively even split of traffic among your keys, and that you have fewer keys than your desired throughput rate, this method does a great job of capping the amount of resources you will spend sending data to your analytics.

Disadvantages: This approach doesn’t scale at all. As your traffic increases, the number of events you’re sending in to your analytics doesn’t, so your view in to the system gets more and more coarse, to the point where it will barely be useful. If you have keys with very little traffic (Key c, as seen above in the chart), you wind up under-sending the allotted samples for those keys and wasting some of your throughput limit. If your keyspace is very wide, you’ll end up sending more than the allotted throughput due to the minimum sample rate for each key.

Overall, this method can be useful as a slight improvement over the static map method because you don’t need to enumerate the sample rate for each key. It lets you contain your costs by sacrificing resolution in to your data. It breaks down as traffic scales in volume or in the size of the key space.

Constant Throughput Per Key

This is a minor tweak on the previous method to let it scale a bit more smoothly as the size of the key space increases (though not as volume increases). Instead of defining a limit on the total number of events to be sent, this algorithm’s goal is a maximum number of events sent per key. If there are more events than the desired number, the sample rate will be set to correctly collapse the actual traffic into the fixed volume.

Example: set the desired throughput per key to 50. Each key will send up to 50 events per time cycle, with the sample rate set to approximate the actual amount of traffic. The chart here is the same as before—key and traffic are used to compute the sample rate, then the traffic and sample rate are used to show how many events would be sent during the next iteration (assuming the incoming traffic is the same):

key traffic sample rate events sent
a 900 18 50
b 90 2 45
c 10 1 10

Advantages: Because the sample rate is fixed per key, you retain detail per key as the key space grows. When it’s simply important to get a minimum number of samples for every key, this is a good method to ensure that requirement.

Disadvantages: In order to avoid blowing out your metrics as your keyspace grows, you may need to set the per key limit relatively low, which gives you very poor resolution into the high volume keys. And as traffic grows within an individual key, you lose visibility into the details for that key.

This would be a good algorithm for something like an exception tracker, where more copies of the same exception don’t give you additional information (except that it’s still happening), but you want to make sure that you catch each different type of exception. When the presence of each key is the most important aspect, this works well.

Average Sample Rate

With this method, we’re starting to get fancier. The goal for this strategy is achieve a given overall sample rate across all traffic. However, we want to capture more of the infrequent traffic to retain high fidelity visibility. We accomplish both these goals by increasing the sample rate on high volume traffic and decreasing it on low volume traffic such that the overall sample rate remains constant. This gets us the best of both worlds - we catch rare events and still get a good picture of the shape of frequent events.

Here’s how the sample rate is calculated for each key: we count the total number of events that came in and divide by the sample rate to get the total number of events to send along to the analytics system. We then give each key an equal portion of the total number of events to send, and work backwards to determine what the sample rate should be.

Sticking with the same example traffic as the previous two methods, we have keys a, b, and c with traffic of 900, 90, and 10 events coming in. Let’s use a goal sample rate of 20. (900+90+10) / 20 = 50. Our goal for the total number of events to send in to Honeycomb is 50 events. We have 3 keys, so each key should get 50 / 3 = 17 events. What sample rate would we need for key a to send 17 events? 900 / 17 = 52 (rounded). For key b, 90 / 17 = 5 and for key c, 10 / 17 = 1. We now have our sample rates.

Here is the same table we used in the two previous examples:

key traffic sample rate events sent
a 900 52 17
b 90 5 18
c 10 1 10

Advantages: When rare events are more interesting than common events, and the volume of incoming events across the key spectrum is wildly different, the average sample rate is an excellent choice. Picking just one number (the target sample rate) is as easy as constant sampling but you magically get wonderful resolution into the long tail of your traffic while still keeping your overall traffic volume manageable.

Disadvantages: high volume traffic is sampled very aggressively.

At Honeycomb (and in the library below) we actually apply one additional twist to the Average Sample Rate method. The description above weights all keys equally. But shouldn’t high volume keys actually have more representation than low volume keys? We choose a middle ground by using the logarithm of the count per key to influence how much of the total number of events sent into Honeycomb are assigned to each key—a key with 10^x the volume of incoming traffic will have x times the representation in the sampled traffic. For more details, take a look at the implementation for the average sample rate method linked below.

Average Sample Rate with Minimum Per Key

To really mix things up, let’s combine two methods! Since we’re choosing the sample rate for each key dynamically, there’s no reason why we can’t also choose which method we use to determine that sample rate dynamically!

One disadvantage of the average sample rate method is that if you set a high target sample rate but have very little traffic, you will wind up over-sampling traffic you could actually send with a lower sample rate. For example, consider setting a target sample rate of 50 but then only actually having 30 events total! Clearly there’s no need to sample so heavily when you have very little traffic. So what should you do when your traffic patterns are such that one method doesn’t always fit? Use two!

When your traffic is below 1,000 events per 30 seconds, don’t sample. When you exceed 1,000 events during your 30 second sample window, switch to average sample rate with a target sample rate of 100.

By combining different methods together, you mitigate each of their disadvantages and keep full detail when you have the capacity, and gradually drop more of your traffic as volume grows.

Conclusion

Sampling is great. It is the only reasonable way to keep high value contextually aware information about your service while still being able to scale to a high volume. As your service increases, you’ll find yourself sampling at 1001, 10001, 500001. At these volumes, statistics will let you be sure that any problem will eventually make it through your sample selection, and using a dynamic sampling method will make sure the odds are in your favor.

We’ve implemented the sampling methods mentioned here as a go library and released them at https://github.com/honeycombio/dynsampler-go. We would love additional methods of sampling as contributions!

Instrument your service to create wide, contextual events. Sample them in a way that lets you get good visibility into the areas of your service that need the most introspection, and sign up for Honeycomb to slice up your data today!