Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 3: Application Version

This is the third of three articles in our series about using NGINX Plus to upgrade backend servers with zero downtime. In the first article, we describe the two NGINX Plus features you can use for backend upgrades with zero downtime – the on-the-fly reconfiguration API and application‑aware health checks – and discuss the advantages of each method.

In this third article, we cover several use cases of upgrading the version of an application on a group of upstream servers. For use cases around upgrading the software or hardware on individual server machines, see the second article, Using NGINX Plus for Backend Upgrades with Zero Downtime, Part 2 – Individual Servers.

When we switch to a new version of an application by adding a new set of servers to run it, we need a controlled way of taking the servers with the old version offline and bringing the servers with the new version online. We’ll go over a number of ways to achieve this, and there are a couple of factors to help choose among them:

Base Configuration for the Use Cases

For the API examples we will be making the API calls from the NGINX Plus instance, so they will be sent to localhost.

The base configuration for the use cases starts with two servers in a single upstream configuration block called demoapp. In the first server configuration block, we configure a virtual server listening on port 80 that load balances all requests to the demoapp upstream group.

We’re configuring an application health check, which is a best practice for reducing the impact of backend server errors on the user experience and for improving monitoring. Here we configure the health check to succeed if the server returns the file healthcheck.html with an HTTP 2xx or 3xx response code (the default success criterion for health checks).

Though it’s not strictly necessary for basic health checks, we’re putting the health_check directive in its own location block. This is a good practice as it allows us to configure different settings, such as timeouts and headers, for health checks versus regular traffic. For a use case where a separate location for the health_check directive is required, see Doing a Dark Launch.

# In the HTTP context
upstream demoapp {
zone demoapp 64k;
server 172.16.210.81:80;
server 172.16.210.82:80;
}

server {
listen 80;
status_zone demoapp;

location / {
proxy_pass http://demoapp;
}

location @healthcheck {
internal;
proxy_pass http://demoapp;
health_check uri=/healthcheck.html;
}
}

We also configure a second virtual server that listens on port 8080 for requests to locations corresponding to the dynamic reconfiguration API (/upstream_conf), the NGINX Plus status dashboard (/status.html), and the NGINX Plus status API (/status). Note that these location names are the conventional ones, but you can choose different names if you wish.

It is a best practice to secure all traffic to the reconfiguration and status APIs and the dashboard, which we do here by granting access only to users on internal IP addresses in the range 192.168.100.0 to 192.168.100.255. For stronger security, use client certificates, HTTP Basic authentication, or the Auth Request module to integrate with external authorization systems like LDAP.

# In the HTTP context
server {
listen 8080;
allow 192.168.100.0/24;
deny all;

location = / {
return 301 /status.html;
}

location /upstream_conf {
upstream_conf;
}

location = /status.html {
root /usr/share/nginx/html;
}

location /status {
status;
}
}

With this configuration in place, the base command for the API commands in this article is

http://localhost:8080/upstream_conf?upstream=demoapp

Using a Single Upstream Group

When we use a single upstream group, NGINX Plus forwards requests to the same upstream group for the new application version as for the old. One advantage is that we can continue to monitor the same upstream group name throughout the upgrade process. There are a few approaches to upgrading with a single upstream:

Using the API when Overlap Is Acceptable

If it is acceptable for clients to reach either version of the application for a short time, we follow these steps:

  1. Add new servers to the upstream group by sending an HTTP command with the &add= parameter for each one. By default they are immediately accessible to clients, but we can include the &down= parameter until we are ready to bring them online, at which point we run a command with the &up= parameter.

    In this example we add two servers which are running the new application version by appending these strings to the base command:

    ...&add=&server=172.16.211.83:80&down=
    ...&add=&server=172.16.211.84:80&down=
  2. Send the base command to see the IDs assigned to the new servers (as well as the IDs of the old servers):

    http://localhost:8080/upstream_conf?upstream=demoapp
    server 172.16.210.81:80; # id=0
    server 172.16.211.82:80; # id=1
    server 172.16.210.83:80 down; # id=2
    server 172.16.211.84:80 down; # id=3
  3. Mark the new servers up and the old servers down by appending their IDs and the appropriate parameter to the base command:

    ...&id=2&up=
    ...&id=3&up=
    ...&id=0&down=
    ...&id=1&down=
  4. Monitor the old servers and when their connection counts are zero, remove them from the upstream group.

    ...&id=0&remove=
    ...&id=1&remove=

Using the API when Overlap Is Not Acceptable

If it is not acceptable for clients to reach both old and new servers during the upgrade, one way to avoid it is by sending the API requests in reverse order, first marking the old servers as down and then adding the new servers. But there are two downsides to this approach:

  • Between the time that the last old server is marked down and the first new server is added, no servers are available and client requests are rejected with an error.
  • As we mark the old servers as down, capacity decreases. If the system is under heavy load, it might not be able to handle the load, causing errors until all the new servers are online.

To get around these downsides, we can use server weights to reduce – and in all likelihood eliminate – requests to the old servers even though they’re still up. For a complete discussion of weights, see Choosing an NGINX Plus Load‑Balancing Technique.

We still bring the new servers up before taking the old servers down and offline, as in the previous section, but we set very high weights on the new servers to make it very unlikely the old servers will receive any requests before we take them down. If the old servers don’t already have the default weight of 1, we start by resetting their weights to that value.

The following commands use the same ID numbers for servers as the previous example. As before, we’re showing only the string to append to the base command.

  1. Set the weights on the current (old) servers back to the default of 1, if that is not already the value:

    ...&id=0&weight=1
    ...&id=1&weight=1
  2. Set the weights on the new servers to a very high value as they are added in the down state:

    ...&add=&server=172.16.211.83:80&down=&weight=100000
    ...&add=&server=172.16.211.84:80&down=&weight=100000
  3. Mark the new servers as up and the old servers as down:

    ...&id=2&up=
    ...&id=3&up=
    ...&id=0&down=
    ...&id=1&down=
  4. When there are no connections to the old servers, remove them from the upstream group:

    ...&id=0&remove=
    ...&id=1&remove=
  5. Set the weights on the new servers to lower values. The following commands assume they have equal capacity, and return each one’s weight to 1.

    ...&id=2&weight=1
    ...&id=3&weight=1

Using Semaphore Health Checks

We recommend the configuration API as the cleanest and most efficient way to manage upstream groups, but your existing infrastructure might already rely on health checks for this. Here’s how to use health checks to switch from the old set of servers to the new set without overlap. It assumes that the presence or absence of a file called healthcheck.html causes health checks to succeed or fail respectively.

In the following steps we use the API to change the upstream servers (add and remove them, and set weights), but we could also use the live activity monitoring dashboard, or manually edit and reload the configuration file.

  1. Before adding the new servers to the upstream group, rename the health‑check file on each one (to fail-healthcheck.html, for example), so that when the new server is added it fails the health check and NGINX Plus takes it out of the load‑balancing rotation.
  2. Set the weights on the current (old) servers to very high values.

    By default, NGINX Plus immediately starts sending traffic to a newly added server. It also immediately sends a health check, but if the system is under heavy load and all servers have equal weights, NGINX Plus might send requests to a new server during the time it takes for the health check to complete. By setting high weights on the old servers and leaving the weights on the new servers at the default of 1, we divert most traffic away from the new servers as they are added. The appropriate weight to set on the old servers depends on the amount of load and can be found with testing.

    ...&id=0&weight=100000
    ...&id=1&weight=100000
  3. Add the new servers, and use the NGINX Plus live activity monitoring dashboard or status API to verify they are not receiving traffic (because they are marked as unhealthy).

    ...&add=&server=172.16.211.83:80
    ...&add=&server=172.16.211.84:80
  4. Reduce the weights of the current (old) servers back to their previous values.

    ...&id=0&weight=previous-value
    ...&id=1&weight=previous-value
  5. Set the weights on the new servers to very high values.

    ...&id=2&weight=100000
    ...&id=3&weight=100000
  6. Rename the health‑check files on the new servers to healthcheck.html so that they pass the health checks and start to receive traffic. Because their weights are so high compared to the old servers, NGINX Plus sends most traffic to them.
  7. Rename the health‑check files on the old servers to fail-healthcheck.html, so that they fail the health checks.
  8. Once the old servers have no active connections (as verified with the NGINX Plus live activity monitoring dashboard or status API), we can remove them from the upstream group.

    ...&id=0&remove=
    ...&id=1&remove=
  9. Reduce the weights on the new servers to their normal values (here we use the default of 1).

    ...&id=2&weight=1
    ...&id=3&weight=1

Using a Version Number in the Health Check

So far we’ve been relying on the default criterion for a successful health check – that the resource is returned with status code 2xx or 3xx – but we can define many other kinds of additional or alternative requirements for a health check to succeed.

We’ll take advantage of this feature to control the upgrade to a new application version, by requiring a specific version number in the body of the page returned by the server. In this example, we use the string Version: x.0 for the success criterion as we upgrade from version 1.0 to version 2.0, but you can define any text string you want.

To start, we add a match configuration block in the http context, to define the two criteria for a successful health check: the server returns a page with status 200, and the page includes the string Version: 1.0.

# In the HTTP context
match matchstring {
status 200;
body ~ "Version: 1.0";
}

We also modify the health_check directive in the first server block to refer to the match conditions:

# In the first server block
location @healthcheck {
internal;
proxy_pass http://demoapp;
health_check uri=/healthcheck.html match=matchstring;
}

We don’t immediately reload the configuration, but instead do so as the second step in the following procedure. As in the other use cases, we’re using the API to modify the upstream servers, but we could instead use the live activity monitoring dashboard or manually edit and reload the configuration.

  1. Set the version string in the healthcheck.html file on each current (old) server to Version: 1.0.
  2. Reload the configuration.
  3. Before adding the new servers to the upstream group, set the version string in the healthcheck.html file on each one to Version: 2.0, so that when the new server is added it fails the health check and NGINX Plus takes it out of the load‑balancing rotation.
  4. Add the new servers to the upstream group.

    ...&add=&server=172.16.211.83:80
    ...&add=&server=172.16.211.84:80
  5. When ready to do the upgrade, change the string in the match block to Version: 2.0.

    # In the HTTP context
    match matchstring {
    status 200;
    body ~ "Version: 2.0";
    }
  6. Reload the configuration. The new servers now pass the health check while the old servers fail it.
  7. Once the old servers have no active connections (as verified with the NGINX Plus live activity monitoring dashboard or status API), remove them from the upstream group.

    ...&id=0&remove=
    ...&id=1&remove=

For the brief time until the old servers fail their health checks, client requests can be sent to either an old or a new server. To skew the request distribution to the new servers, in Step 4 set high weights on the new servers and also set low weights on the old servers if they are not already low. After Step 5, reduce the weights on the new servers to the appropriate values. For detailed instructions for setting weights, see the previous section, Using Semaphore Health Checks.

Using a New Upstream Group

Now we look at options that utilize a new upstream group for the new servers. Compared to a single upstream group, we get more flexibility and can cut over to all the new servers at the same time. A downside is that we have to reconfigure our monitoring tools to direct them to the new upstream group. Again, there are a few approaches to choose from:

Doing a Simple Cutover

Except for the need to change which servers are monitored, a cutover is definitely the cleanest way to migrate to a new application version.

  1. Edit the configuration we created in Base Configuration for the Upgrade Use Cases, creating a new upstream group of servers (demoapp‑v2) that are running the new application version.

    # In the HTTP context
    upstream demoapp {
    zone demoapp 64k;
    server 172.16.210.81:80;
    server 172.16.210.82:80;
    }

    upstream demoapp-v2 {
    zone demoapp-v2 64k;
    server 172.16.210.83:80;
    server 172.16.210.84:80;
    }

  2. Change the status_zone and proxy_pass directives in the first server block to point to the new upstream group (demoapp-v2).

    # In the HTTP context
    server {
    listen 80;
    status_zone demoapp-v2;

    location / {
    proxy_pass http://demoapp-v2;
    }

    location @healthcheck {
    internal;
    proxy_pass http://demoapp-v2;
    health_check uri=/healthcheck.html match=matchstring;
    }
    }

  3. Reload the configuration. NGINX Plus immediately starts directing client traffic to the new servers.
  4. When the old servers no longer have any active connections, take them offline and (optionally) remove the upstream group from the configuration.

Doing a Dark Launch

Sometimes it’s safest to test the new version of an application on a small set of users to see how it performs in production, then gradually ramp up the proportion of traffic to the new servers until eventually all traffic is going to them. The split-clients feature in NGINX Plus (and NGINX) is perfect for this.

The split_clients configuration block directs fixed percentages of traffic to different upstream groups. In this example we start by directing 5% of the incoming requests to the new upstream group. If all goes well we can increase to 10%, then to 20%, and so on. When we decide it’s safe to move completely to the new version, we simply remove the split_clients block and change the proxy_pass directive to point to the new upstream group.

Note that this method is not compatible with session persistence, which requires that NGINX Plus direct traffic from a particular client to the same server that processed the client’s first request. The split_clients directive sends a strict proportion of traffic to each upstream group without considering its source, so it might send a client request to an upstream group that doesn’t include the correct server.

  1. Create a new upstream group, demoapp-v2, for the new application version (as in the previous section).

    # In the HTTP context
    upstream demoapp {
    zone demoapp 64K;
    server 172.16.210.81:80 slow_start=30s;
    server 172.16.210.82:80 slow_start=30s;
    }

    upstream demoapp-v2 {
    zone demoapp-v2 64K;
    server 172.16.210.83:80 slow_start=30s;
    server 172.16.210.84:80 slow_start=30s;
    }

  2. In the first server block we created in Base Configuration for the Upgrade Use Cases, change the proxy_pass block to use a variable to represent the upstream group name instead of a literal like demoapp (the variable gets set in the split_clients block, which we define in the next step).

    # In the first server block
    location / {
    proxy_pass http://$app_upstream;
    }
  3. Add a split_clients block in the http context. Here we tell NGINX Plus to set the variable $app_upstream to demoapp-v2 for 5% of incoming requests and to demoapp for all remaining requests. The variable value is passed to the proxy_pass directive (defined in Step 2) to determine which upstream group the request goes to.

    The first parameter to split_clients defines the request characteristics that are hashed to determine how requests are distributed, here the client IP address ($remote_addr) and port ($remote_port).

    # In the HTTP context
    split_clients $remote_addr$remote_port $app_upstream {
    5% demoapp-v2;
    * demoapp;
    }
  4. Previously we mentioned that in some cases the health check must be defined in a location block separate from the one for regular traffic, and this is such a case. NGINX Plus sets up health checks as it initializes and must know at that point which upstream groups it will send health checks to. When the configuration uses a runtime variable to select the upstream group, as in this case, NGINX Plus can’t determine the upstream group names. To provide the needed information at initialization, we create a separate location block for each upstream group that explicitly names it. In the current case, we have two upstream groups, so for each we have a location block in the server block.

    # In the first server block
    location @healthcheck {
    internal;
    proxy_pass http://demoapp;
    health_check uri=/healthcheck.html match=matchstring-v1;
    }

    location @healthcheck-v2 {
    internal;
    proxy_pass http://demoapp-v2;
    health_check uri=/healthcheck.html match=matchstring-v2;
    }

  5. In the http context we add a match block to define the match conditions for each health check.

    # In the HTTP context
    match matchstring-v1 {
    status 200;
    body ~ "Version: 1.0 Status: OK";
    }

    match matchstring-v2 {
    status 200;
    body ~ "Version: 2.0 Status: OK";
    }

Scheduling the Launch

With just a bit of Lua scripting we can schedule an upgrade for a specific time. Once we have set up the new upstream group, the script returns a different upstream name depending on the system time – the old upstream name prior to the cut-over time and the new upstream group afterward.

Using the same upstream groups as in Doing a Dark Launch, we can add the following Lua script to the main location block ( / ) to make the cutover happen at 10:00 pm local time on June 21, 2016. All requests received prior that time are sent to the demoapp upstream group and all requests received at or after that time will be sent to the demoapp-v2 upstream group.

# In the first server block
location / {
rewrite_by_lua '
if ngx.localtime() >= "2016-06-21 22:00:00" then
ngx.var.app_upstream = "demoapp-v2"
else
ngx.var.app_upstream = "demoapp"
end
';

proxy_pass http://$app_upstream;
}

Controlling Access to the New Version Based on Client IP Address

In Doing a Dark Launch, we covered one way to test a new application with a small number of users before opening it to everyone. Here we select a small number of users based on their IP address and allow only them access to the URI for the new application. Specifically, we set up a map block that sets the upstream group name based on the $remote_addr variable, which contains the client IP address. We can specify a specific client IP address or a range of IP addresses.

As an example, using the same upstream groups described in Doing a Dark Launch, we create a regular expression to direct requests from internal IP addresses in the range between 172.16.210.1 and 172.16.210.19 to the demoapp-v2 upstream group (where the servers are running the new application version) while sending all other requests to the demoapp upstream group:

# In the HTTP context
map $remote_addr $app_upstream {
~^172\.16\.210\.([1-9]|[1-9][0-9])$ demoapp-v2;
default demoapp;
}

As before, the value of the $app_upstream variable is passed to the proxy_pass directive in the first server block, and so determines which upstream group receives the request.

# In the first server block
location / {
proxy_pass http://$app_upstream;
}

Conclusion

NGINX Plus provides operations and DevOps engineers with several options for managing software and hardware upgrades on individual servers while continuing to provide a good customer experience by avoiding downtime.

Check out the other two articles in this series:

Try NGINX Plus out for yourself and see how it makes upgrades easier and more efficient – start a 30-day free trial today or contact us for a live demo.

Cover image
Free O'Reilly Ebook
Your guide to everything NGINX