Deprecating the StatsD Telemeter


#1

As of the upcoming Linkerd 1.2.0 release, we are deprecating the io.l5d.statsd telemeter.

We’ve been considering removing the statsd telemeter because it doesn’t work the way most people expect, which can cause big surprises in production. Specifically, the way that the statsd telemeter samples counter increments means that you can miss out on very important information in your runtime metrics (like failures). To avoid missing important events, the sample rate on the telemeter can be increased, but this will lead to increased latency.

We’ve migrated most of our other datadog/statsd users onto the influx telemeter, which is much less lossy. These users tend to use telegraf to export higher fidelity stats to datadog (https://github.com/influxdata/telegraf/tree/release-1.3/plugins/outputs/datadog).


#2

Thanks for announcing this ahead of time. Could you please give me a bit more context on what this means? Specifically, why does this cause a loss of data?

We’re actually not ingesting metrics from linkerd into this pipeline right now, but we were planning to in the near future. Having linkerd’s metrics outside our “normal” metrics system is a bit problematic so we’re keen to unify them, but now I see this is being removed…


#3

I’m happy to give a bit more context. The statsd telemeter is sampled which means that for each event (such as a counter increment) a message is pushed to statsd at some sample rate. Linkerd metrics contain high velocity stats (such as request count) which must be sampled with a very low sample rate to avoid an excessive number of network requests to statsd. At the same time, Linkerd metrics also contains very low velocity stats (such as individual failure type count) where any amount of sampling dramatically decreases the usefulness or may cause you to miss the event altogether.

Because of this, we recommend using the influxdb telemeter instead which can be adapted to a statsd backend using telegraf. I hope this helps!


#4

Can’t you just use different sampling strategies for the different types of events?


#5

@Kosta Potentially! But it’s complicated. The telemeter itself has no knowledge of the semantics of the metrics that it tracks, so it doesn’t have enough information to determine a sampling strategy.


#6

This topic was automatically closed after 90 days. New replies are no longer allowed.