Forem: Bébhinn Egan

Pandora's Flask: Monitoring a Python web app with Prometheus

Bébhinn Egan — Mon, 19 Aug 2019 11:08:37 +0000

Originally posted on Metricfire.

We eat lots of our own dogfood at Metricfire, monitoring our services with a dedicated cluster running the same software.

This has worked out really well for us over the years: as our own customer, we quickly spot issues in our various ingestion, storage and rendering services. It also drives the service status transparency our customers love.

Recently we’ve been working on a new Prometheus offering. Eating the right dogfood here means integrating Prometheus monitoring with our (mostly Python) backend stack.

This post describes how we’ve done that in one instance, with a fully worked example of monitoring a simple Flask application running under uWSGI + nginx. We’ll also discuss why it remains surprisingly involved to get this right.

A little history

Prometheus' ancestor and main inspiration is Google's Borgmon.

In its native environment, Borgmon relies on ubiquitous and straightforward service discovery: monitored services are managed by Borg, so it’s easy to find e.g. all jobs running on a cluster for a particular user; or for more complex deployments, all sub-tasks that together make up a job.

Each of these might become a single target for Borgmon to scrape data from via /varz endpoints, analogous to Prometheus’ /metrics. Each is typically a multi-threaded server written in C++, Java, Go, or (less commonly) Python.

Prometheus inherits many of Borgmon's assumptions about its environment. In particular, client libraries assume that metrics come from various libraries and subsystems, in multiple threads of execution, running in a shared address space. On the server side, Prometheus assumes that one target is one (probably) multi-threaded program.

Why did it have to be snakes?

These assumptions break in many non-Google deployments, particularly in the Python world. Here it is common (e.g. using Django or Flask to run under a WSGI application server that spreads requests across multiple workers, each of which is a process rather than a thread.

In a naive deployment of the Prometheus Python client for a Flask app running under uWSGI, each request from the Prometheus server to /metrics can hit a different worker process, each of which exports its own counters, histograms, etc. The resulting monitoring data is garbage.

For example, each scrape of a specific counter will return the value for one worker rather than the whole job: the value jumps all over the place and tells you nothing useful about the application as a whole.

A solution

Amit Saha discusses the same problems and various solutions in a detailed writeup. We follow option #2: the Prometheus Python client includes a multiprocess mode intended to handle this situation, with gunicorn being the motivating example of an application server.

This works by sharing a directory of mmap()'d dictionaries across all the processes in an application. Each process then does the maths to return a shared view of the whole application's metrics when it is scraped by Prometheus.

This has some "headline" disadvantages listed in the docs: no per-process Python metrics for free, lack of full support for certain metric types, a slightly complicated Gauge type, etc.

It's also difficult to configure end-to-end. Here's what's necessary & how we achieved each part in our environment; hopefully this full example will help anyone doing similar work in the future.

The shared directory must be passed to the process as an environment variable, prometheus_multiproc_dir. No problem: we use uWSGI's env option to pass it in: see uwsgi.ini.
The client’s shared directory must be cleared across application restarts. This was a little tricky to figure out. We use one of uWSGI's hardcoded hooks, exec-asap, to exec a shell script right after reading the configuration file and before doing anything else. See uwsgi.ini. Our script removes & recreates the Prometheus client's shared data directory. In order to be sure of the right permissions, we run uwsgi under supervisor as root and drop privs within uwsgi.
The application must set up the Python client’s multiprocess mode. This is mostly a matter of following the docs, which we did via Saha's post: see metrics.py. Note that this includes some neat middleware exporting Prometheus metrics for response status and latency.
uWSGI must set up the application environment so that applications load after fork(). By default, uWSGI attempts to save memory by loading the application and then fork()'ing. This indeed has copy-on-write advantages and might save a significant amount of memory. However, it appears to interfere with the operation of the client's multiprocess mode - possibly because there's some locking prior to fork() this way? uWSGI's lazy-apps option allows us to load the application after forking, which gives us a cleaner environment.

So altogether, this results in a working /metrics endpoint for our Flask app running under uWSGI. You can try out the full worked example in our pandoras_flask demo.

Note that in our demo we expose the metrics endpoint on a different port to the app proper - this makes it easy to allow access for our monitoring without users being able to hit it.

In our deployments, we also use the uwsgi_exporter to get more stats out of uWSGI itself.

Futures

Saha's blog post lays out a series of alternatives, with pushing metrics via a local statsd as the favoured solution. That’s not really a hop we prefer to take.

Ultimately, running everything under container orchestration like kubernetes would provide the native environment in which Prometheus shines, but that’s a big step just to get its other advantages in an existing Python application stack.

Probably the most Promethean intermediate step is to register each sub-process separately as a scraping target. This is the approach taken by django-prometheus, though the suggested “port range” approach is a bit hacky.

In our environment, we could (and may yet) implement this idea with something like:

Running a webserver inside a thread in each process, listening on an ephemeral port and serving /metrics queries;
Having the webserver register and regularly refresh its address (e.g. hostname:32769) in a short-TTL etcd path—we use etcd already for most of our service discovery needs;
Using file-based service discovery in Prometheus to locate these targets and scrape them as individuals.

We think this approach is less involved than using the Python client’s multiprocess mode, but it comes with its own complexities.

It’s worth noting that having one target per worker contributes to something of a time series explosion. For example, in this case a single default Histogram metric to track response times from the Python client across 8 workers would produce around 140 individual time series, before multiplying by other labels we might include. That’s not a problem for Prometheus to handle, but it does add up (or likely, multiply) as you scale, so be careful!

Conclusion

For now, exporting metrics to Prometheus from a standard Python web app stack is a bit involved no matter which road you take. We hope this post will help people who just want to get going with their existing nginx + uwsgi + Flask apps.

As we run more services under container orchestration—something we intend to do—we expect it will become easier to integrate Prometheus monitoring with them.

Established Prometheus users might like to look at our Hosted Prometheus offering—if you’d like a demo, please get in touch!

Written by Cian Synnott, Programmer at Metricfire.

Continuous self-testing at Hosted Graphite — why we send external canaries, every second

Bébhinn Egan — Wed, 31 Jul 2019 09:43:20 +0000

Originally posted on the Hosted Graphite blog.

At Hosted Graphite, internal system monitoring is something our engineers are always concerned about. To detect any degradation we continuously self-test all our endpoints — every second, every day. It gives us an early warning sign of small changes that could be an indicator of larger problems and lets us know how well our service is running. We try to measure what our customers are actually experiencing and this is one of the main metrics of service quality that we grade ourselves on.

How does it work?

Each canary service sends a value of '1' once per second with the appropriate timestamp to each endpoint that we accept metric traffic on.
In the case of any TCP connections we re-establish a new connection each time (and record the time taken) to allow measurement of network delays and services being overloaded.
We can then continuously monitor these metrics for any drop in value.

This system lets us know if something’s wrong: when the amount of traffic we record drops below an acceptable loss amount for a particular service, our alerting tools raise the alarm. We can then check on the exact number of failed data-points by looking at our sum dataviews presented on a dashboard. These figures are aggregated and grouped by protocol and source network, allowing us to determine the extent of the degradation at a glance. We can answer the "Just one AWS zone?" or "All UDP services?" questions extremely quickly because of this continuous, distributed monitoring.

‍A snippet of the canaries dashboard — all green right now!

The juicy details

We test all our endpoints. That includes TCP, TCP w/ TLS, UDP, Pickle, Pickle w/ TLS, and HTTPS. Our test sends a datapoint every second to make sure we pick up on any degradation of service. To make sure we're not falsely covering-up a delay in accepting the data or establishing a connection, the canary has been carefully engineered to send one datapoint per second and to abandon the attempt if it takes longer than a second, ensuring a very reliable continuous rate of one value per second no matter what.

If it takes longer than a second to connect and submit a datapoint, we'd record that as a dropped datapoint, because we care about the data delivery rate as well as how long it takes for us to accept it.

A single datapoint gets sent for each protocol and these go to our Graphite-compatible endpoints. The sum rate of these metrics is then rendered in our alerting service and if the value drops below acceptable loss for a particular service, an on-call engineer is paged.

Canary sumrates dashboard — we offset the line at 2.0 from 1.0 for a clear view of both on the same graph

Locations

Our canary services are located in three places:

External

The external canaries are all located outside of the main Hosted Graphite architecture/pipeline. They’re tagged and sent from different places, all over the world. We do this so that we can understand if, for example, there’s a problem with traffic from a specific region.

Internal

We run more canaries that sit within our own data centre. This allows us to test our local network so we can differentiate a failure that is likely due to some mishap in our own service as opposed to external connectivity issues.

On-machine

While we send canary data from many locations outside and inside our network, we also send a continuous stream of canary data locally from each of our ingestion services to itself so failures of a single machine or service are immediately apparent. This data is also used automatically by our clustering and service discovery tools, allowing auto-remediation of most common individual machine failure types.

Status page (includes full incident history)

For transparency, we publish all the details on our status page so our customers have full visibility of any blip in our service and real time status information. We include a full breakdown of what caused the issue, who it affected, and how the issue has been resolved.

Written by Dan Fox, SRE at Hosted Graphite.

Why we’re teaching our staff how to get a pay rise

Bébhinn Egan — Tue, 30 Jul 2019 18:59:04 +0000

Originally posted on the Hosted Graphite blog.

At Hosted Graphite, we’re open about how we do things: both internally and externally. We publish the status of our internal systems (and share a full history of all incidents), we share our weekly Baremetrics reports on revenue and churn rate with everyone in the company and, most recently, we told all our staff how to ask for a pay rise.

For us, being transparent about pay doesn’t mean publishing everyone’s salary (though some companies do go that far). It’s more about making sure everyone has an understanding about how salary increases are decided on and what steps to take to prepare for a review. Every company periodically reviews and adjust salaries. We just make it clear to everyone what our approach is and how to best prepare.

To do this, we laid out a clear process for salary reviews and wrote down the steps involved to make sure everyone gets the best deal. It includes:

Preparing a list in advance with points on:

What went well this year, and the impact this had on the company.
What went badly, and what you learned from it.
The salary range you think is fair, and why.

Remembering every project you've contributed to and every bit of impact you've had over a whole year is impossible. To help with this, we encourage everyone to write "snippets"— a short daily or weekly note about what you've done and what decisions were made. When it's time for an annual review, often the only source of data needed is these short activity logs. The key is doing it at regular intervals.

Then we meet, discuss what’s been written, and talk about the proposal. The intention here is to teach every employee how to critically review what they've done in the last year, to identify weak spots, to think about how to counter those, and how to use this data to justify the salary they deserve.

Learning to negotiate a salary is a very difficult thing, and nobody is going to go out of their way to teach you. That’s why being clear on how to ask for an increase fits in with our goal of making sure we develop the people we have, especially those who come to us early in their careers. Most people learn to negotiate by getting a raw deal a few times early in their career, and that’s something we’d like to avoid.

It also ties in to our commitment to tackling the tech diversity problem (in our own small way). We think that just saying we’re committed to equality and diversity isn’t enough and have put considerable energy into being more inclusive in our hiring process and the way we operate. Being open about salary negotiations is another way we battle inequality. It’s our way of subverting the current system where women and minorities are more likely to be uncomfortable negotiating salary and therefore tend to be paid less. It’s about having systems in place to make sure everyone gets a fair deal.

Of course, it works in our interest too. If we don't adjust salaries correctly, people will feel undervalued and will leave. We could be sneaky or strict about this process like other companies are, but that just leads to an odd and adversarial environment where people can feel undervalued for a long time, or where people who aren't sure about the right way to justify their new salary won't get a good deal. A very happy team is worth a lot more than shaving a bit off everyone's salary and being Grinches about it. We're convinced that if we teach our staff this skill, and generally work on developing the already fantastic people we work with, they’ll be happier, produce better work, stick around longer and invite their talented friends to work with us too. It's not selfless, but we think being nicer and more human than other companies is going to benefit everyone in the long run, even if in the short term we're basically telling our employees how to get more money out of us.

Surviving On-Call: Tips from a Hosted Graphite SRE

Bébhinn Egan — Mon, 29 Jul 2019 18:23:28 +0000

Originally posted on the Hosted Graphite blog.

On-call is pain, and anyone who says otherwise is trying to sell you something. That said, there are lots of ways to make on-call a better experience, necessary evil that it is. I’ve been an SRE at Hosted Graphite since 2016, so have done my fair share of on-call. A lot has already been written about how companies can make on-call a better experience for teams, and lucky for us we get a mandatory day-off after an oncall shift, flexible working hours and a remote friendly office that goes a long way to making the on-call experience suck less. These are obviously organisation specific, but for the purposes of this post I’d like to focus on things that you can do to make your individual experience better, based on what I’ve learned.

Notification Hygiene

‍

“It's important to practice good hygiene
At least if you wanna run with my team”
Del The Funky Homosapien (If you Must)
‍

Notification hygiene is something that most of us are pretty terrible at – in and outside of work. Cleaning up your notifications is a very useful tactic for on-call, as chances are you'll be answering the pager and spending more time looking at screens than you normally would. If you use Pagerduty, I’d recommend grabbing a relevant Pagerduty V-Card for your region. Next you’ll want to set up your Do Not Disturbs: for me that is close family, my team-mates’ phone numbers and all of the numbers relevant to the Pagerduty V-Card. After that, I turn off all other push notifications and pop my phone into Do Not Disturb mode. This setup makes sure the only audible notifications I get are either from close family, phone alarms, Pagerduty notifications or emergency work calls. We use Slack at Hosted Graphite, so I also enable Slack notifications between working hours and generally allow all calls from 7am-9pm to account for things like doctor’s appointments, package deliveries, or most importantly, takeaway deliveries.

This might seem like a lot of effort, particularly if you use your phone pretty much as is (where you’re inundated with a barrage of notifications). If that works for you, that’s cool, but the goal I have during on-call shifts is to reduce the time I’m actively thinking about getting paged, unless the pager is currently yelling at me.

Bonus tip: When you’re setting up Pagerduty notification channels, make sure to leave a minute or two delay before your first notification. This means that if you see a page happen in real time while you’re working, you can acknowledge it through the web app before it blows up your phone.

Note: This is going to depend entirely on your response times for the services you are on call for. If it’s an incredibly high priority service, that requires response times in the seconds, this strategy isn’t going to work. However, if your services are resilient enough, and allow for longer response times for incidents, this strategy can make your incident management more effective, as you’re able to properly deal with issues without your phone constantly going off.

In that case I’d recommend just making your initial notifications be a push notification directly to your laptop, combined with one going to your phone–then you’ll get instant feedback before your phone goes off, because you’re looking directly at the screen and can acknowledge it from there.

Personal Care

‍

“I know you've had a rough time
Here I've come to hijack you (hijack you), I'll love you while
I'm making the most of the night”
Carly Rae Jepsen (Making the Most of the Night)

Like it or not, most on-call shifts are going to tire you out. The exception is when absolutely nothing happens at all, which we aspire to by doing proper incident follow-up, post-mortems, and focusing on infrastructure and software improvements to make on-call suck less. That said, the best laid plans of mice and men are nothing when faced with a faulty router in a data-centre at 3am.

So you’ve been woken up at 3am, you’ve sorted your incident, but now you’re awake and can’t get back to sleep. The temptation is to get up, power through, finish work early, and try to grab some sleep later. However that way lies danger, as it’s easy to get stuck in the pattern of staying up late, and missing most of the first half of the day. Prioritising regular sleep is vital, as is trying to stick to a reasonable routine. I’m a night owl, and because we have flexible hours at Hosted Graphite, my hours tend to be something like 11am to 7pm, and I usually fall asleep around 2am. So even if I get paged at 8am and I could just get up and start my day, I’m always going to take that sleep.

When I first started doing on-call I didn’t do this, and tried to power through...exactly once. You would not think that an hour or two of extra sleep makes that much difference, but it does. I remember spending that day making tonnes of small mistakes and felt my brain was working at quarter-speed (and for the people who know me, that’s definitely saying something).

An incident heavy week can also reach its tendrils into other aspects of your personal life – if you find yourself working late into the evening, the temptation emerges to order in some food or grab a pizza on your way home. I’ve found that if I don’t do some prep in advance of being on-call I end up mostly eating trash for the week, and feeling way worse than I would otherwise. When combined with potentially disturbed sleep, the temptation to skip breakfast so you can spend a little longer in bed is another decision which will likely put a damper on your on-call week.

To avoid this, before going on-call I’ll buy a box of some sort of cereal/breakfast bar and whatever fruit I feel like eating and set myself the rule that I’m not leaving the house til I eat breakfast. If I make it as easy as possible to eat a decent breakfast, there’s no excuse not to. I have the usual work lunch, and at dinner time make twice as much as I normally would, to cover me for the next day in case I get hit with a page. As for those extreme incident-laden weeks, anything you can set and forget til it’s done cooking is ideal, and it means you’ll still be eating well even if you’re dealing with a lot of incidents.

The last, and probably most important aspect of looking after yourself on-call is to take stock of your mental health. You need to do whatever it is that lets you relax, be it video games, quiet time with a book, or a trip to the gym. Whatever it is, you need to make time for something that isn’t thinking about work, so you can recharge your batteries. You may be on-call for the week, but the pager isn’t your life.

It’s always good to remember is that you have a team behind you to support you and back you up. At Hosted Graphite, the company wants well rested and happy engineers, because well rested and happy engineers make fewer mistakes, do better work, and make for a much stronger team.
‍

Out and About on Call

‍

“If you go down in the woods today
You're sure of a big surprise”
Anne Murray (Teddy Bear’s Picnic)

There’s a perception that when you’re doing an on-call shift, you need to be at home, bound to your phone and laptop, hunched over in anticipation for the next notification so you can spring into action. This is not the case. Although I’d generally advise against to the cinema or an evening at a fancy French restaurant, you can pretty much go about your regular routine–so long as you have a couple of things prepared:

A good 4G/LTE connection and phone to use as a hotspot

If you’re doing on-call, you should be provided with this. At Hosted Graphite, our phone bills are covered by the office and we have a choice between using our own devices, or a company provided phone. Whatever way your company does it, you’re going to need ready access to the internet for both notification delivery and actual incident response time.

Your laptop

Obviously, you’ll need a way to respond to incidents and you probably don’t have much choice here. If you’re lucky enough to be able to pick and choose the device you use for your on-call purposes, I’d recommend the 13” macbook air for most SRE type work that is heavily biased towards work on remote servers. It's lightweight and easily portable. Excellent battery life is also a big plus. For the windows inclined (or if you just want to run some straight up linux with no faff) I’ve also heard very good things about the Thinkpad X Series.

A power bank and assorted chargers

You don’t want to get caught with a dead phone or laptop, so a beefy power bank is worth picking up, for both peace of mind and to fill up any devices you may have if you’re caught somewhere without plug access, like a bus or train.

A solid backpack to hold all your on-call related stuff

The ideal here is something that's big enough to comfortably hold all of the above, but not too big to leave you hauling around a huge backpack everywhere. When I’m on-call I tend to use a simple leather messenger bag for trips to and from work, and for everywhere else I use this MATEIN backpack (amazon uk). It holds everything I need and collapses down fairly thin when all the space isn’t in use. It also has a handy usb port on the side, so you can plug in your power bank without needing to take it out.

You’ll probably have other specific needs, like a yubikey for auth or maybe a swipe card reader for vpn access or other similar things, so I tried to just focus on the bare minimum here. The most important thing is to do your best to just forget the pager is a thing until it pages out and go about your life.

Final note

If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier. These mobile meat sacks we inhabit are fragile at best, and it is both the responsibility of the person on-call and also the company’s leadership to ensure that people who are doing on-call on a regular basis are given the resources they need to succeed, and the time they need to stay well-rested, healthy and happy.

Written by Dave Fennell, SRE at Hosted Graphite.