A full-fledged HTTP/1.1 parser is just a function. I was bored on high-level web stuff and digged into one of the very basic of web technologies – HTTP parser. I looked into Node.js’s HTTP Parser which is based on NGINX’s parser and figured it out that HTTP Parser is basically one simple function with a huge switch-case. It’s written in C and I reconstructed it in OCaml. I’m going to explain what it does in a simplified HTTP method parser example which is a subset of the full HTTP Parser.
The first thing you need to know is that HTTP Parser is not in Yacc/Lex. When it comes to parsing structured text data, you may pick Yacc and Lex for the tools, or parser combinators maybe. Though, you can’t make use of those tools because an HTTP parser is allowed to see only a tiny fraction of data at a time. HTTP is based on TCP and data in TCP are transfered in several TCP packets.
Instead, HTTP Parser is a State Machine. It consumes one ASCII character at a time and changes its internal state. It also triggers callbacks when interesting events happen. In this way, it can phase parsing – process as many bytes as possible and wait until the next chunk arrives.
Moreover, a State Machine is a function. It takes an old state and a character then returns a new state. When we get another charater, use the new state as an old state, pass them to the function again, and repeat this iteration until it reaches a terminal state.
val f : state -> char -> state
Also, a state is an algebraic data type. It needs to keep track of parsing and remember a couple of things:
- Which syntactic element we’re at now
- In this syntactic element, how many characters we’re in
- In this syntactic element, how many characters we’re going to have
We can use a Variant to represent states. For example, If we’re at the very beginning of HTTP method, it’s
MethodStart. If we’re at the 3rd character of HTTP method, it’s
Method (3, m) where
m is the index of the first character of method. Well, I’ll explain the
m part later.
type state = | MethodStart | Method of int * int | SpacesBeforeUri
Nevertheless, callbacks make things a little bit complicated, though. Here’s a partial HTTP parser example that parses HTTP method:
exception InvalidMethod let parse_method on_method data state p = let ch = String.get data p in match state, ch with | MethodStart, ch when 'A' <= ch && ch <= 'Z' -> Method (1, p) | MethodStart, _ -> (* invalid token *) raise InvalidMethod | Method (n, m), _ when n > 255 -> (* too long *) raise InvalidMethod | Method (n, m), ' ' -> on_method data m p; SpacesBeforeUri | Method (n, m), ch when 'A' <= ch && ch <= 'Z' -> Method (n + 1, m) | Method _, _ -> (* invalid token *) raise InvalidMethod
This will parse “GET”, “PUT”, “POST”, “DELETE” or any other HTTP method-ish strings followed by a space. As you can see, it’s not
state -> char -> state. Its type is actually a bit more complicated.
type callback = string -> int -> int -> unit val parse_method : callback -> string -> state -> int -> state
on_method callback function. When the state machine finds (a part of) HTTP method, it triggers the callback so that we can extract a substring of
data. The next
data currently available for parsing. Last
state -> int -> state is a slight modification of
state -> char -> state. Since we want to get HTTP method when we reach the end of method, we need to keep track of index of the first character of the syntactic element (in this case, HTTP method). That’s the
Also, since we can see only a tiny fraction of
data at a time, we need to refill
data after we consume all of it. In that case, we need to reset
m which points to a character in the previous
m doesn’t mean anything in the new
data. Before we reset
m to 0, we need to trigger the callback for a substring of
m to the end of
data. So, we may have several
on_message callbacks for a single HTTP method.
By following this approach, a full-fledged HTTP/1.1 parser can be constructed with a single function. My implementation is on GitHub.