The open source NGINX software performs basic checks on responses from upstream servers, retrying failed requests where possible. NGINX Plus adds out‑of‑band application health checks (also known as synthetic transactions) and a slow‑start feature to gracefully add new and recovered servers into the load‑balanced group.

These features enable NGINX Plus to detect and work around a much wider variety of problems, significantly improving the reliability of your HTTP and TCP applications.

Health Checks in NGINX Plus

Health checks continually test your upstream servers and instruct NGINX Plus to avoid servers that have failed, thus ensuring that your users don’t see error pages when servers fail or are taken down for maintenance.

With application health checks, you can check against a wide range of failure types, and NGINX Plus can probe custom pages and applications. Common uses include:

  • Regular testing of the home page of a website to verify the response code, content type, and web content
  • Regular execution of health tests that run on the upstream servers to verify that key services (database, filesystem) are connected and accessible, and resources are not exhausted (disk, memory)

You can even use health checks to automate the removal and reintroduction of upstream servers from the load‑balanced group, in a clean and non‑disruptive fashion.

For more information, see In Detail – Health Checks.

Slow-Start in NGINX Plus

Slow‑start carefully ramps up traffic to a new or recovered server, avoiding a barrage of traffic that could overwhelm it.

Slow‑start is an important measure when adding or reintroducing a recovered server to the load‑balanced group. NGINX Plus slowly ramps up the load to a new or recovered server so that it doesn’t become overwhelmed by connections. This reduces the risk associated with failover and maintenance, and improves the reliability of your website.

For more information, see In Detail – Server Slow‑Start.


In Detail – Health Checks

HTTP health checks are out‑of‑band HTTP requests that are used to probe each server. You configure the simplest health check by including the health_check directive in a location block. The following example configured the default health check and satisfactory response – a request for the / (slash) URI is sent to each server in the upstream group every 5 seconds. Servers that return a well‑formed 2xx or 3xx response are considered healthy; otherwise they are marked as failed. The mandatory parameter requires newly added servers to first pass this health check before they are sent any traffic.

location / {
    proxy_pass http://backend;

    # Enable simple health checks
    health_check mandatory;
}

upstream backend {
    # Health-monitored upstream groups need shared memory zone
    zone backend 64k;

    server web-server1;
    server web-server2;
}

TCP and UDP “connect” health checks (and more sophisticated “send/expect” versions) are defined in the stream configuration block.

Tuning for More Sophisticated Health Checks

You can include optional parameters to the health_check directive for HTTP to define the frequency and content of the probing request. By including the match parameter and match directive, you can define multiple conditions that the servers in the upstream group must meet for the health check to succeed. Match blocks can test against the response status code, arbitrary headers, and the first 256 KB of body data.

server {
    listen 80;
    location / {
        proxy_pass http://backend;

        health_check mandatory interval=2s fails=1 passes=5 uri=/test.php match=statusok;

        # The health check inherits other proxy settings    
        proxy_set_header Host www.foo.com;
    }
}

match statusok {
    # Used for /test.php health check
    status 200;
    header Content-Type = text/html;
    body ~ "Server[0-9]+ is alive";
}

upstream backend {
    zone backend 64k;

    server web-server1;
    server web-server2;
}

Note: Take care when including runtime configuration variables in location blocks for which health checks are enabled. Such variables are generally set based on data in client requests, and they might not be set when a health‑check request is sent. If necessary, you can define a location block specifically for health checks, mark it as internal and define the request parameters unambiguously there.

The TCP upstream module supports sophisticated send/expect‑style health checks.

Use Case – Automating Maintenance

One common use case for health checks is automating the removal of servers from an upstream group. For example, if you routinely need to remove servers for maintenance (say for software upgrades), configure a health check that tests for the presence of a particular file in the documentation root directory (docroot), for example /alive.txt. When you delete the file before shutting down the server, NGINX Plus marks the server as down. Any existing connections are maintained until they complete, but NGINX Plus does not send new connections to the down server.

NGINX Plus will continue to probe the server. Once the maintenance is complete, you can “touch” (re‑create) the missing file and the health check will succeed. NGINX Plus will then gradually reintroduce the server into the load‑balanced cluster using slow‑start.

For more details, see the reference documentation for upstream health checks.

In Detail – Server Slow‑Start

Server slow‑start is an important measure when a failed server has recovered and is reintroduced into an upstream group as well as when a new server is added if health_check mandatory is configured. NGINX Plus slowly ramps up the load to the server over the defined period, allowing applications to “warm up” (populate caches, run just‑in‑time compilations, establish database connections, and so on). This prevents a from being overwhelmed by a sudden spike of connections, which might time out and cause the server to fail.

To configure slow‑start, include the slow_start parameter to the server directive in the upstream context.

upstream backend {
    zone backend 64k;

    server web-server1 slow_start=30s;
    server web-server2 slow_start=15s;
}

NGINX Plus applies slow‑start no matter the reason the server was marked as failed (that is, both when live transactions failed and when a health check failed).