Scaling Web Applications with NGINX, Part 2: Caching and Monitoring

At nginx.conf2015 Matt Williams of Datadog discussed scaling of web applications using NGINX load balancing and caching

This post is adapted from a talk given by Matt Williams at nginx.conf 2015, held in San Francisco in September. It is the second of two parts, and focuses on caching and monitoring; the first part focuses on load balancing. You can view the presentation slides and watch the complete talk on the NGINX, Inc. YouTube channel.

Table of Contents – Part 1, Load Balancing (previous post)

1:45 Benefits of Load Balancing/Caching
2:58 Load Balancing Methods
7:02 Which Method Should You Choose?
10:58 FYI Load Balancing
15:05 How to Ensure Session Persistence

Table of Contents – Part 2, Caching and Monitoring (this post)

17:08 Caching
19:33 FYI Caching
21:09 FYI Tuning
23:46 How to Find the Right Configuration
25:17 Why Monitor?
26:30 Datadog
27:50 NGINX Monitoring Tools
28:40 Tools to Test With
29:45 Key Metrics
30:29 Active Connections (Total and per Upstream)
31:00 Dropped Connections
31:33 Requests per Second
32:14 Error Rates
33:44 Request Processing Time
34:28 Available Servers per Upstream
35:15 Scaling Web Applications

17:08 Caching

Caching offloads serving of static content from web servers; NGINX stores cached items on disk [presentation by Matt Williams of Datadog at nginx.conf 2015]

Caching basically offloads static content from upstream web servers and those objects are cached to disk by NGINX to be retrieved and served more efficiently.

The 'proxy_cache_path' directive enables caching, and there are many directives available for fine-tuning [presentation by Matt Williams of Datadog at nginx.conf 2015]

Enabling caching is pretty easy. proxy_cache_path specifies the path on the file system to store all of my cached objects. Then, I can specify parameters such as keys_zone=name:size.

So keys_zone is an area in memory where my cache keys and metadata are going to be stored. I can give that a name and a size which specifies a size of how much memory to allocate to storing cache keys. After that I can limit the size of the cache itself with max_size – this specifies how big the cache can get before I start getting rid of stuff.

Then, how do I start storing the cache assets? That’s done with proxy_cache_key. By default that key is going to be something similar to $scheme$proxy_host$uri$is_args$args,
so for example http://www.datadog.com/myfavoriteintegrations?arguments, etc. The key should contain everything that is needed to determine a unique response.

proxy_cache_valid lets me specify how long a cached response stays valid. For example, if the request is OK and I have a 200 response code, I can make that valid for up to 10 minutes or so. But if the file is not found and I get a 404, I might want to drop that cache validity to just 10 seconds.

proxy_cache_min_uses is pretty straightforward and just specifies how many times this asset needs to be requested before it gets cached. So if I say proxy_cache_min_uses 2, then the first time this asset gets served, we don’t cache it, but the second time, it gets cached.

proxy_cache_methods specifies which HTTP methods to cache, so you could say you will cache anything with a GET or POST, or something else.

This is all done within the http block or server block, and in the location block I’ll have proxy_cache and specify which zone I want to use. This would be from the keys_zone that I defined in that first line.

I can also specify when I want this to expire. So (location) [on the proxy_cache directive] will be a specific path or specific type of asset. Maybe CSS files, or PNG files, or JPEG, or whatever it is. Then for all things that match that location, they can expire after 10 hours, 5 hours, years, or whatever time frame you want.

19:33 FYI Caching

Don't cache personal/private content and do set permissions; you can cache PHP results, override headers, and purge the cache [presentation by Matt Williams of Datadog at nginx.conf 2015]

Some things to keep in mind while caching.

Don’t cache personal or private content! Hopefully that’s obvious. If someone comes in and visits their personal account page, the last thing you want to do is cache that page so that the next user sees it. That would be bad.

Another thing to check is to ensure that permissions are set correctly on the cache path. You define this back in the NGINX configuration file. And then you have to create that directory. Just make sure that the user and group that’s running NGINX is the owner of that path.

So far we’ve talked about caching static assets. But you can also cache, for instance, the results of a PHP page. In order to do that, you would still use all the same directive, but replace the proxy_ prefix with fastcgi_ or uwsgi_ as appropriate. So for caching you’re just going to use fastcgi_cache_* or uwsgi_cache_* directives instead of proxy_cache_* directives.

With any cached asset you can override the headers, and NGINX Plus also offers the cache_loader process, which loads cache metadata when NGINX Plus starts, and the cache_manager process, which is a way of purging old assets automatically. You can also purge old assets with proxy_cache_purge.

21:09 FYI Tuning

NGINX provides many tools for tuning performance [presentation by Matt Williams of Datadog at nginx.conf 2015]

There’s a lot of tuning that you can do with NGINX, and there’s a great NGINX blog [Tuning NGINX for Performance] that pretty much goes through all of them.

  • Backlog Queue – One of these settings you can tune for performance. By default on NGINX the backlog queue’s number of maximum connections is really, really low. Normally NGINX will respond to requests super quickly – that’s the whole point of it – so you don’t normally need a backlog, but maybe you’re getting high traffic and reaching that maximum number of connections. In that case you might want to turn this on and increase the number of connections that go into that backlog queue.
  • Ephemeral Ports – Every time a request goes into the load balancer and is ported to an upstream web server, it sends the request out from the load balancer using another port. You might hit a situation where if you have many active connections going on at once, you potentially could run out of ports. You can deal with that by changing the ephemeral port settings which control how these ports are allocated and managed.
  • Worker Processes – A really easy setting to tune. Generally speaking, worker processes should be equal to the number of CPU cores on that box. Pretty easy to figure out and change, and you should get a little bit better performance out of it.
  • Logging – Logging takes a little bit of processing time, so if you turn that off for some requests or all requests, it will give you a little bump in performance.
  • Sendfile – Another setting to tune with NGINX. There is one super‑specific scenario where sendfile is really important: you happen to be using a development environment on your Mac and you happen to be doing it in Docker, which is running on top of VirtualBox, and you’re doing a shared volume and that shared volume is being served out on NGINX. Then, if you make any changes to any files that are being shared on that volume, NGINX won’t see those changes until you restart the Docker container, which totally sucks. So in order to avoid that, turn sendfile off and oh my God, everything is magic again and it just works!
  • Limits – Another great way to tune performance. I can limit the number of connections and other settings to control how many resources clients use.
  • Compression – Turning on compression will send responses to clients in compressed form to save bandwidth, but adds some processing overhead.

23:46 How to Find the Right Configuration

Finding the right configuration for your NGINX servers is an iterative process of learning about NGINX, deploying a server and monitoring it, setting up load balancing and monitoring it, and so on [presentation by Matt Williams of Datadog at nginx.conf 2015]

Now that we’ve talked about all these configuration options, how do you know if you’ve got the right setup? It’s basically an 8‑step process… well, an 8‑ to 800‑step process.

Step One: Read the documentation! Reading the user manual is always a good idea. Then read what’s on the rest of the web, because there’s all sorts of amazing stuff out there.

And then [Step 3] don’t just start off configuring the whole environment. Focus just on one web server. Just make sure that one web server’s working really well for your specific environment. Don’t go solving everything. Solve that one little problem first. Monitor it. Turn on some sort of monitoring solution, such as Datadog. Test it.

Once you’ve tested your web server, go back to Step 3, configure the server, and repeat that process. Keep repeating and iterating until you get something that really works for your environment.

Now once you’ve got that one web server going, go on and replicate that out for all your web servers. Then do the same thing with the load balancer. Monitor and test, and repeat those two steps. And keep doing that and iterating until you’ve got a really great environment.

When do you stop monitoring? You don’t. You keep monitoring because that monitoring is going to be super valuable in the future, because there will be a problem at some point.

25:17 Why Monitor?

Why do we monitor? You’ve got to know whether things are improving or not.

If you don’t monitor, all you’re relying on is some customer or CXO at some point saying “Hey the website’s broken!” which is not what you want to hear. So you want to have a monitoring solution that’s constantly monitoring your server and your environment to verify that it’s working well.

The built-in NGINX Plus dashboard provides detailed statistics about NGINX and the backend servers it is proxying and load balancing [presentation by Matt Williams of Datadog at nginx.conf 2015]

One existing, basic monitoring solution is the dashboard that’s part of NGINX Plus. It’s a beautiful dashboard, pretty simple – but it’s pretty awesome and there’s still a lot of things here. However, it’s only showing me the current status of the NGINX site. I’m just looking at what’s going on right now, but sometimes you need to see a little history.

26:30 Datadog

Datadog provides a dashboard for monitoring NGINX deployments [presentation by Matt Williams of Datadog at nginx.conf 2015]

So here’s Datadog, another monitoring solution. In the top right I’ve got connections to the load balancer which I’ve been testing. I’ve been hitting my server with a bunch of connections.

Ideally I want to make sure my tests last fairly long – an hour, two hours, or a day. Set your tests up and let them run. Make sure that things are working well and then make changes to the configuration, and then continue monitoring as it goes on.

You might be wondering what these vertical pink parts are within Datadog. Those vertical lines each represent some event. I’m saying, “show me all the events that have to do with benchmark” and every time I do a benchmark test, right before that I send an event to Datadog saying, “I’m starting a new benchmark and here are the parameters”. That way I can see why this spike is there, and I’ve got some explanation of what’s going on. Then I can correlate connections to each web server, load balancer, average response time, and so forth.

27:50 NGINX Monitoring Tools

There are many monitoring tools for NGINX, including its own live activity monitoring dashboard, Datadog, ngxtop, and luameter [presentation by Matt Williams of Datadog at nginx.conf 2015]

You don’t have to use Datadog. There are lots of other tools. ngxtop is a pretty cool‑looking one on Github. luameter is another neat one which looks pretty close to the previous generation of the NGINX Plus dashboard. It’s got sparklines to show hash performance – pretty cool stuff. And there’s a lot of others.

28:40 Tools to Test With

Tools for testing NGINX include ab, siege, curl-loader, blitz.io, httperf, jmeter, and tsung [presentation by Matt Williams of Datadog at nginx.conf 2015]

So what tools can you use to test?

There are lots of testing tools available. There’s ab (Apache bench), there’s siege, and there are lots of other tools as well, and just as many opinions about which ones to use or avoid. These tools are basically just going to pound your server with a ton of extra requests.

Two more interesting ones are Blitz and Tsung. Blitz is an online solution that’s gonna pound your server from lots of different testing servers. Tsung is a way of setting up a cluster of testing boxes, and they’re all managed from one place. I can start a job and all ten of my testing servers are going to start pounding my web server with a bunch of requests. Pretty cool.

You could use real customers. For instance, you make a configuration change and let it sit there for a day, while monitoring and watching real customers use your site. Has it improved or not?

29:45 Key Metrics

So now I know the general process of monitoring and testing, I know what testing tools I’m going to use, I know what kind of monitoring tools I’m going to use to verify things are working. And I know all the different options that are available to me to set up my load balancing and caching server.

Now, of the things that are being monitored, which metrics do I need to look at to verify that things are going well or not? For NGINX there’s potentially up to twenty or thirty different metrics that are being updated every second.

That’s a lot of stuff to look at. So what’s really important?

30:29 Active Connections (Total and Per Upstream)

An important metric to monitor is active connections; deviations from normal can indicate server overload or the wrong choice of load-balancing method [presentation by Matt Williams of Datadog at nginx.conf 2015]

Well, we think that the primary metrics you’re gonna want to look at are active connections – total connections overall and per upstream.

If there are deviations from what’s normal, it could indicate one of the servers is struggling to process requests, or you’re reaching saturation on one of the servers. Maybe that’s because the load balancing method you’re using is not the right one.

31:00 Dropped Connections

An important metric to monitor is dropped connections, which represents the difference between accepted and handled connections; a value other than zero is bad and might indicate resource saturation [presentation by Matt Williams of Datadog at nginx.conf 2015]

Another great metric to look at is dropped connections. Ideally this is will be zero, meaning you don’t have dropped connections. But hey, dropped connections happen sometimes, so try to just keep this close to zero.

If this number rises, look out for resource saturation. Resource saturation is never a good thing. You want to always make sure that you can always handle the load.

31:33 Requests per Second

An important metric to monitor is requests per second; in particular, a big drop might indicate a problem downstream from NGINX [presentation by Matt Williams of Datadog at nginx.conf 2015]

At a glance, this doesn’t tell you that much. Oh, I have 500 requests per second right now. That doesn’t give me a lot of information.

If there’s a spike, that could be good, that could be bad. It depends on what caused the spike. But if there’s constant flow and then all of a sudden a big drop, that’s something I should definitely be alerted to and check out. Those drastic changes could indicate a problem – probably not with NGINX, but maybe something before NGINX, such as your connection to the web, or something else.

32:14 Error Rates

An important metric to monitor is error rates (for 4xx and 5xx errors, for example), particularly as a percent of all connections [presentation by Matt Williams of Datadog at nginx.conf 2015]

Another metric to look out for are the error rates with response codes – so 400 and 500 errors.

Look out for those, but don’t just look at the raw numbers. If I see there are five hundred 500 errors, that doesn’t tell me that much. I want to see that error divided by total requests so I can see what percentage of my requests results in 500 errors. If that rate is climbing, that’s probably worth investigating. And if it’s a sharp increase, that’s going to need urgent attention.

It would be really cool if you had that available as a metric in NGINX, but you don’t. You only have that in the log files, so you’ll have to parse those log files to figure out what is the number of 400 errors and 500 errors. You can do that in Datadog with a tool called Dogstream, or you can use other tools like Splunk, which is a great one, or Sumo Logic. There’s lots of other great tools to process logs and bring that data into Datadog or into other monitoring software as well.

With NGINX Plus, those error rates are available as a metric. So that’s another cool thing with NGINX Plus.

33:44 Request Processing Time

An important metric to monitor is request processing time; an increase above the average can indicate an upstream problem [presentation by Matt Williams of Datadog at nginx.conf 2015]

How long is each request taking?

You probably don’t care about how long each single request is taking. You probably care more about what’s the average for all the requests coming in within a certain time period, or all the requests going to a certain server. How long on average do these requests take to process? If this is going up, it could point to some issue upstream on one of the web servers. You might be getting too many requests and as the number of requests go up, the request processing time might also increase.

The request processing time might also go up depending on what the server is doing. So that could point to some sort of problem, possibly with the configuration of that server.

34:28 Available Servers per Upstream

An important metric to monitor is the percentage of upstream servers that are functioning correctly [presentation by Matt Williams of Datadog at nginx.conf 2015]

If one of my servers has a problem, that kind of sucks, especially if I only have a few servers. But if I have ten upstream web servers, and one of them has a problem, we should fix it, but it’s not as big of a deal.

Now if 50% or 80% of my servers are having a problem, that’s a big deal – I better fix that. So available servers per upstream is another important metric to keep an eye on.

35:15 Scaling Web Applications

At nginx.conf2015 Matt Williams of Datadog discussed scaling of web applications using NGINX load balancing and caching

In this session I wanted to make sure that you know what the options are around scaling, load balancing, and caching.

We talked about how you should go about verifying that changes you make are having a positive impact by doing monitoring and testing, hitting the server and load balancer to verify things are working as you expect. Put real users on it to verify that they’re seeing what they should see. And once you do that, look at some key metrics to verify that things you really are as good as they should be, or at least heading on the right track.

As I mentioned, my name is Matt Williams, and I work at Datadog. You can reach me on Twitter at @technovangelist and my email is matt.williams@datadoghq.com

This post is the second of two parts. The first part focuses on load balancing. You can view the presentation slides and watch the complete talk on the NGINX, Inc. YouTube channel.

Cover image
Free O'Reilly Ebook
Your guide to everything NGINX