Building Business Systems with Domain-Specific Languages for NGINX & OpenResty, Part 2

This post is adapted from a presentation at nginx.conf 2016 by Yichun Zhang, Founder and CEO of OpenResty, Inc. This is the second of two parts of the adaptation. In Part 1, Yichun described OpenResty’s capabilities and went over web application use cases built atop OpenResty. In this part, Yichun looks at what a domain-specific language is in more detail.

You can view the complete presentation on YouTube.

Table of Contents

26:17 A Web Platform As A Virtual Machine
30:10 Inventing the LZSQL Language
37:15 Writing Programs to Write Programs to Write Programs
37:51 Test Scaffold
38:26 How About Tests?
39:18 The OpenResty Model Language
39:55 The OpenResty View Language
40:35 The OpenResty Controller Language
44:03 WAF = Hot
44:20 ModSecurity = a Horrible DSL
45:55 Model, View, Controller
46:33 SportLang
47:34 The Y Language
47:44 CoffeeScript
48:04 A Meta DSL
48:35 Clean Separation Between Business Representation and Business Implementation
49:24 Compiling-Style Web Frameworks
49:52 The Best Language
50:28 The Machine Truly Understands Business Logic

26:17 A Web Platform As A Virtual Machine

Back to our main topic: we’re talking about DSLs built atop OpenResty. We’ve been talking about many OpenResty features, some of which are very new.

My point is that OpenResty can be used as a virtual machine, just like JVM. But it can be more powerful and more Web-oriented.

The experiment was actually done six to seven years ago. It’s nothing new, but I love experimenting.

I worked for Taobao, one of the largest C2C ecommerce websites in China, which is a subcompany of the Alibaba Group. I think most of you have heard of Alibaba. I worked on the data analytics product of Taobao whose customers are Taobao’s merchants, like eBay’s merchants are sellers.

They wanted a data analytics product to analyze the traffic to their shops, and also the effects of ad deployment as well as their sales. It was a big product.

This was the homepage of that product.

We have a lot of nifty charts, just like Google Analytics. But we have so many different reports, data reports.

The data volume is huge because of the size of Taobao.

I experimented on the client side, the rich Internet application approach. It was seven years ago. Basically, the whole application’s logic is reading each other’s script running in the web browser, like Gmail.

We also introduced a client-side template. We have a compiler that can generate JavaScript code from template files. We’ll talk about that later.

Also, we have web services to drive the client-side applications. Web services are the key. They’re the only things that run on the server side.

The web services emit JSON data to the client and JavaScript renders the page regions using the compiled version of the templates.

The whole server-side architecture of that data analytics product was like this: basically, OpenResty sits at the center between our backends and the web browsers, our customers.

The backends consist of a huge MySQL cluster because the data volume is so large. It’s almost 100 mils.

Also, there’s a real-time statistics cluster to provide real-time statistics.

And also, there’s Taobao’s open platform which is an open API that emits JSON. We also have a Memcached cluster and Tokyo Tyrant cluster for other metadata.

So, it’s much simpler than the previous version, which used PHP to run the whole product.

30:10 Inventing the LZSQL Language

I quickly realized the problem: I didn’t have enough men, I only had two intern students besides me. That was my whole team.

But we had to migrate the whole data analytics product, written in PHP, over to the new OpenResty platform. At the time OpenResty was still very young.

I forced my intern students to write in Lua directly. That’s a very complicated business log because of all the different dimensions of business logic in Taobao and very crazy business requirements.

I thought about it hard throughout the night and I invented a DSL (domain-specific language) for the whole system because I understood the core pattern or model of the data analytics product.

And to leverage that understanding, we can provide a natural way to express our business logic to the machines.

So think about it: what is programming? Programming is essentially an activity to communicate with machines, to make machines understand your intentions. And it runs your business fast, cheaply, and reliably.

The key is to increase the efficiency of communicating with machines.

If you can use two words or two sentences to convey an idea, then there’s no need to use ten solid lines of code. That’s crazy.

For example, I’m not a big fan of Java, obviously, because it’s so much code. And, for this case, Lua is not up to the task. Neither are similarly imperative languages.

So I invented my own language, my own way, to convey the idea to the machine. It’s based on SQL. Why? Because data analytics products are essentially based on the relational model whether or not you are using SQL databases.

We can define variables and user variables in SQL. They are first-class citizens. The SQL can be run remotely on some MySQL backends, or it can run in place inside NGINX. I implemented a SQL engine in-memory database with just a hundred lines of Lua. It works pretty well.

The complexity comes from the fact that data needs to be fetched from many different MySQL databases. Then we need to mash up the data relationally in memory and then assemble the final result set and emit it to the client.

There are many tricky parts because you have to decompose a SQL query and run it on many different nodes, and you also need to optimize the SQL queries automatically because the MySQL optimizer often can’t do a lot of of optimizations, especially for OLAP [online analytical processing] kinds of applications.

Essentially, we’re writing the business logic in LZSQL files. We use compilers to generate Lua code and, eventually, we publish our Lua code online without the compiler. That’s the beauty of compilation.

Basically, on the command line, we invoke the compiler to compile the LZSQL files and then we link it to the final Lua application.

The result is very impressive because the compiler can do a lot of optimizations that a human can’t normally do, or at least can’t do properly, correctly.

The old API was written in PHP and the new API was generated Lua code from my compilers. The latency dropped a lot – like 90% or something including MySQL latency here. It’s a whole API HTTP latency.

One caveat is that we’re using the standard Lua interpreter. We’ll see in the next chart.

Switching to LuaJIt gives a nice speed-up further. But still, in this chart, the LuaJIT is still using the interpreter only, not the JIT compiler.

The compiler was written in just 4000 lines of Perl. Not much, but it has very complicated optimizations and Type Check checking and everything and other context-sensitive analysis.

Basically, the compiler can pull itself a parser, an AST (Abstract Syntax Tree), and some optimizers, and a code emitter. Actually, it has several code emitters because it supports multiple backends.

Yeah, we can generate Lua code, but why not C code. We can also generate C code.

If you recall, one of our backends is actually a real-time database. It provides a very specific and complicated wire protocol which is very hard to manually craft a client for.

I got an idea, because this thing has very nice documentation, a wiki page: why not let the machine understand the wiki documentation page and generate an NGINX C module for us? That idea came into my mind and I quickly did an implementation.

Basically, I abstracted the wiki documentation with a very small DSL, because a wiki is not very precise. But it’s very similar to the original wiki page.

Then I wrote a very quick Perl script and a compiler to complete an NGINX C module that can talk to the real-time database nonblockingly. It’s an NGINX upstream module.

You can see that the documentation is just 300 lines of code, but the resulting C module code is more than 12,000 lines.

37:15 Writing Programs to Write Programs to Write Programs

From this example we can see that programming is about communication with machines. If your documentation is already concise and precise enough, then we can feed that to the machines directly with some guidance, maybe.

That can save all the headache of programming and implementations.

All this leads to the sentence: I’d rather write programs to write programs to write programs.

37:51 Test Scaffold

Our test scaffold is also based on DSLs. For example, we have a specification-like language syntax to describe test cases. So, this Test::Nginx::Socket is used by all OpenResty projects.

It doesn’t matter if you don’t know Perl because we are not asking you to use Perl. We provide a specification. Our services can also be tested in a similar way.

38:26 How About Tests?

Link from slide above: http://search.cpan.org/perldoc?Cheater

The next big thing: how about tests? For new products, there’s no data in the databases. But we have to test our SQL queries, our web pages, our services. So I wrote a DSL that looks very similar to SQL create table statements.

It’s called Cheater. Basically, you specify regular expressions for allowed value renders for table columns, and also specify the foreign key constraints between database tables. Then this tool will generate random data that satisfy all your constraints and requirements.

39:18 The OpenResty Model Language

Back to OpenResty: what can we learn from my experiment of 6 years ago? Basically, we can design and provide a model language that can do something similar; that you can use SQL in OpenResty and the compiler can figure out whether to make the running remotely or locally; to do the mash up, or discompose a SQL query because you’re running it up among many different databases, which are not necessarily relational databases.

39:55 The OpenResty View Language

Also, there’s an OpenResty View language, [View being part of] the MVC [model-view-controller] paradigm. And for View, we have template languages: for example, Perl’s TT2 template languages, which are my favorite. And Python has got Jinja2, for example.

Such DSLs can be compiled into client-side JavaScript or server-side Lua code. That’s the beauty of DSLs.

For example, we already have Jemplate to compile Perl’s TT2 template language into client-side JavaScript code and Lemplate for compiling down to OpenResty Lua code.

40:35 The OpenResty Controller Language

The Controller language is also a big thing.

It looks like this. Basically, it’s a rule-based language. You specify a bunch of rules.

The words on the left-hand side of the arrows are predicates, like conditions. The words on the right-hand side of the arrows are actions which perform actions like performing a redirect or returning an error page, right?

The predicates are side-effect free [do not cause an action to be performed], which is mandatory if the compiler is to be capable of doing some very aggressive optimizations like combining the predicates out of order, and for example, scanning, the URL just once and getting results from multiple regular expressions.

Most CDN business logic can be conveyed in this way. This is the essence of the CDN business model.

Different business models share some kind of intrinsic properties. This can be, maybe, formalized. For example, data analytics products share a common relational model, relational arithmetic. SQL happens to be a convenient way to convey that idea.

And for CDN or gateway-like or WAF-like business systems, I think the model is a rule-based model. And it’s theoretically a forward-chaining expert system model.

I’m a big fan of artificial intelligence. In high school I learned a lot about various branches of AI. Nowadays, machine learning is so hot, right? But actually, I think another branch, the expert system branch, can also be of great use in software engineering.

This is an example, I think: the syntax is based on the Prolog language, which is very popular in natural language processing, and the semantics is based on CLIPS. It’s a public demand system developed by NASA a few decades ago (in the 1970s, maybe).

We also can have response filters, which support multiple regular expressions, to do complicated substitutions, all in NGINX output filters. It’s also nonbuffered. So, constant buffer, infinite data stream processing: that’s pretty cool, right?

This example shows how to correctly remove all the comments from C++ source code. You can change this example to support, for example, CSS minification, and JavaScript minification arrays. So it will be based on sregex, for obvious reasons.

44:03 WAF = Hot

WAF is pretty hot. And yes, the company is launching a ModSecurity port for NGINX. In my opinion, WAF can be based on the controller language I just demonstrated.

44:20 ModSecurity = a Horrible DSL

ModSecurity is a horrible DSL. It’s a DSL, that’s the good thing. The bad thing is that it’s horrible. It’s so horrible that it makes my baby cry.

This is a rule from the core rule set of ModSecurity. It’s a single line – can you believe that? There’s a lot of line noise. And also, to be consistent with the Apache configuration syntax, they made the syntax of their WAF rule language very convoluted.

They also invented go-to actions like skip. They also have tricks to make it possible to express “if-else, if-else”, because the rule syntax is flat.

Why not create a new language like we did here?

The syntax is clean. For example, the done keyword can be used to “short circuit” processing: if the first rule matches, the second rule (and all subsequent rules) will be skipped.

So it’s essentially an “if-else”. It’s more natural and it doesn’t require crazily deep nesting of indentation.

If you look at VCL or some other CDN‘s business code, it’s all like a decision tree: “if-else, if-else, if-else”. Yes, it’s crazy.

45:55 Model, View, Controller

Speaking of the essential models, like theoretical models of different kinds of business systems, I can get a model language. It’s also quite general for data analytics businesses and many microservices.

And View: we already have many template languages. They’re DSLs.

And also, we have controller languages we’ve just shown that can be used to do very complicated dynamic routing and dispatching, and even WAF for validations.

46:33 SportLang

The sport-games industry could use their own sports language, their own language to describe the business systems. How could that be any cooler?

We, as professionals in the software industry, have no reason to force other industries – like physics, math, sports, or even philosophy – to use computer languages. That’s a shame, right?

Ideally, we should use domain-specific languages – their domains, their languages, their talks – and let the machine do the hard work.

Also, it opens a huge amount of possibilities for optimizations. Because not every programmer knows how to optimize things nicely.

47:34 The Y Language

We have the Y language for different debugging tools for different systems like GDB, like SystemTap, like LuaJIT.

47:44 CoffeeScript

Also, we can have CoffeeScript support, because CoffeeScript is a very popular maybe?) DSL overlay language that can generate client-side JavaScript. We can support a compiler that compiles CoffeeScript code into OpenResty Lua code.

48:04 A Meta DSL

We can also have a meta DSL that’s a DSL for creating all the DSLs including itself. We have a DSL specifically for creating compilers: DSL compilers, optimizing compilers. It would be great.

Perl is the best choice for me, but it’s not necessarily the best for everything. We can create a language specifically for compiler building.

48:35 Clean Separation Between Business Representation and Business Implementation

We can have clean separation between business representation and business implementation, which means that we can swap the underlying implementation overnight without touching even a single line of the business code. That would be crazy.

For example, suppose we are running business systems atop OpenResty. Maybe tomorrow we migrate the backend to the assembly code or C code to generate a pure C implementation, and without touching the business specification. That would be huge.

And migrating to a new technology stack in the future will not mean rewriting everything from scratch, but just writing a new backend optimizer and adding it to the existing DSL compiler.

49:24 Compiling-Style Web Frameworks

Also, we can have brand new kinds of web application frameworks. They should be compiling-style instead of adding up layers and layers and layers which makes the whole thing run very slowly and clumsily.

And we can achieve beauty and efficiency at the same time. That is the only way that I can see, after years of business product engineering.

49:52 The Best Language

And the best language is the business language, as we’ve already said.

One nice thing about business language is that once I create a DSL and I pour some business logic into that DSL, and I happen to find the original requirement document from the customer and I compare them side-by-side and find that they look so similar, then I know I’m on the right track.

I think there is a best way to describe a specific problem domain.

50:28 The Machine Truly Understands Business Logic

So the machine that truly understands your business logic can do more than ever for you, like generating test cases, doing context-sensitive analysis, or even generating a true, runnable implementation for you on the fly.

You are welcome to contact me online either by Twitter or via email.

Thank you so much.

This post is adapted from a presentation at nginx.conf 2016 by Yichun Zhang, Founder and CEO of OpenResty, Inc. This was the second of two parts of the adaptation. In Part 1, Yichun described OpenResty’s capabilities and went over web application use cases built atop OpenResty. In this part, Yichun looked at what a domain-specific language is in more detail.

You can view the complete presentation on YouTube.

Cover image
Free O'Reilly Ebook
Your guide to everything NGINX