In October 2016, the Court of Justice of the European Union ruled that IP addresses are “personal information” and as such fall under the Data Protection Directive and General Data Protection Regulation (GDPR). For many website owners, this presents challenges for archiving and analyzing log files if the data leaves the EU. For data moving into the US, the EU‑US Privacy Shield provides some protection but faces legal challenges by privacy groups and governments who do not believe the level of protection is adequate.
However, protecting personal data in log files is not just an EU problem. For organizations with security certifications like ISO/ICE 27001, moving log files outside of the security realm where they were generated, say from network ops to marketing, can compromise the scope and compliance of the certification.
In this blog post, we describe some simple solutions for sanitizing NGINX Plus and NGINX log files so that they can be safely exported without exposing what is often called personally identifiable information (PII).
The post is updated to use the
js_import directive, which replaces the
Simple Approaches Do Not Work
The simplest approach to personal data protection is to strip IP addresses from logs before they are exported. This is easy to achieve with standard Linux command line tools, but log analysis systems may expect log files in a standard format and fail to import logs that omit the IP address field. Even if logs are successfully imported, the value of log processing may be significantly reduced if the analysis system relies on IP addresses to track a user across a site.
Another potential approach, substituting fake or random values for real IP addresses, results in log files that look complete, but the quality of log analysis is compromised because each log entry appears to originate from a different randomly generated IP address.
Masking the Client IP Address
The most effective solution is to use a technique called data masking to transform the real IP address into one that does not identify the end user but still allows correlation of website activity for a particular user. Data masking algorithms always produce the same pseudorandom value for a given input value in a way that ensures it cannot be converted back to the original input value. Every occurrence of an IP address is always transformed to the same pseudorandom value.
NGINX and NGINX Plus Configuration for IP Address Masking
log_format directive controls which information appears in the access logs. NGINX and NGINX Plus ship with a default log format called combined which produces log files that can be processed by most log‑processing tools.
For this configuration we create a new log format, masked, which is identical to the combined format except for the first field, where we replace the
$remote_addr variable with
Notice that we use two
access_log directives. The first uses the default log format to produce access logs that can be used by administrators for operational purposes. The second specifies the masked log format. With this configuration we write two access logs for each request – one for sysadmins and DevOps, and one for export.
location block defines a very simple response using the
return directive to show that data masking is working. In production, this would most likely contain a
proxy_pass directive to direct requests to a backend server.
The essence of the data masking solution is to use a one‑way hashing algorithm to transform the client IP address. In this example we are using the FNV‑1a hash algorithm, which is compact, fast, and has reasonably good distribution characteristics. Its other advantage is that it returns a positive 32‑bit integer (the same size as an IPv4 address), which makes it trivial to present as an IP address. The
i2ipv4 function converts a 32‑bit integer to an IPv4 address in quad‑dotted notation. It takes the hashed values from
fnv32a() and provides a representation that “looks right” in our access log. Both IPv6 addresses and IPv4 addresses are represented in IPv4 format.
Finally, we have the
maskRemoteAddress function, which is referenced by the
js_set directive in the NGINX and NGINX Plus configuration above. It has a single parameter,
remoteAddress property contains the value of the client IP address (equivalent to the
IP Address Masking in Action
With the above configuration in place, we can make a simple request to our server and check the response and the resultant access log entries.
$ curl http://localhost/ 127.0.0.1 -> 220.127.116.11 $ sudo tail --lines=1 /var/log/nginx/access*.log ==> /var/log/nginx/access.log <== 127.0.0.1 - - [16/Mar/2017:19:08:19 +0000] "GET / HTTP/1.1" 200 26 "-" "curl/7.47.0" ==> /var/log/nginx/access_masked.log <== 18.104.22.168 - - [16/Mar/2017:19:08:19 +0000] "GET / HTTP/1.1" 200 26 "-" "curl/7.47.0"
Masking Personal Data in the Query String
NGINX and NGINX Plus Configuration for Query String Masking
Like the default combined format, the masked log format defined above for IP address masking logs the
$request variable, which captures three components of a request: HTTP method, URI (including query string), and HTTP version. We need to mask only the query string, so in the interests of code efficiency we use a separate variable for each of the three components, transforming only the request URI (second component) with the
$request_uri_masked variable and using standard variables (
$server_protocol) for the first and third components.
server block requires another
js_set directive to define how the
$request_uri_masked variable is evaluated.
We add the
maskRequestURI() depends on the
fnv32a hashing function and so appears below it in the file.
maskRequestURI function iterates through each key‑value pair in the query string, looking for specific keys that are known to contain personal data. For each of these keys, the value is transformed to a masked value.
Depending on the type of processing to be carried out on NGINX and NGINX Plus log files, the masked query string values may need to resemble genuine data. In the example above we have formatted
zip to be five digits and
Query String Masking in Action
With these additions to our configuration, we can see query string masking in action.
$ curl "http://email@example.com" 127.0.0.1 -> 22.214.171.124 $ sudo tail --lines=1 /var/log/nginx/access*.log ==> /var/log/nginx/access.log <== 127.0.0.1 - - [16/Mar/2017:20:05:55 +0000] "GET /firstname.lastname@example.org HTTP/1.1" 200 26 "-" "curl/7.47.0" ==> /var/log/nginx/access_masked.log <== 126.96.36.199 - - [16/Mar/2017:20:05:55 +0000] "GET /email@example.com HTTP/1.1" 200 26 "-" "curl/7.47.0"
Install the prebuilt package.
For Ubuntu and Debian systems:
$ sudo apt-get install nginx-module-njs
For RedHat, CentOS, and Oracle Linux systems:
$ sudo yum install nginx-module-njs
Enable the module by including a
load_moduledirective for it in the top‑level ("main") context of the nginx.conf configuration file (not in the
load_module modules/ngx_http_js_module.so; load_module modules/ngx_stream_js_module.so;
$ sudo nginx -s reload
If you prefer to compile an NGINX module from source:
- Copy the module binaries (ngx_http_js_module.so, ngx_stream_js_module.so) to the modules subdirectory of the NGINX root (usually /etc/nginx/modules).