Kevin Hoffman's Blog

Blathering about development, dragons, and all that lies between

Creating a NATS protocol parser in Haskell

Jul 26, 2024 11 min read
fp haskell parsers

As an experiment (possibly in self-inflicted pain), I decided to write a NATS protocol parser in Haskell.

It’s hard to find excuses to build stuff in Haskell, so as I go through the process of trying to teach myself the basics, I’ve decided to build some things that I might actually use some day. In this case, it’s the beginning of a NATS Haskell client. There are several existing clients out there, including two that are more than 8 years old. There is no “official” NATS Haskell client at the moment.

In typical functional programming style, I decided to work on the “pure” stuff first - defining the wire protocol for communicating with NATS. This involves deciding on my data types, and then writing a parser (input) and transformer (output) for that data.

According to the NATS Client Protocol Spec, we can define the known message types as shown in this snippet:


data ProtocolMessage
  = InfoMessage
      { info :: Maybe ServerInfo,
        errorMessage :: Maybe String
      }
  | ConnectMessage
      { settings :: Maybe ConnectSettings,
        errorMessage :: Maybe String
      }
  | MsgMessage
      { subject :: String,
        reply :: Maybe String,
        sid :: String,
        len :: Integer,
        payload :: Maybe Text
      }
  | HeaderMsgMessage
      { subject :: String,
        reply :: Maybe String,
        sid :: String,
        headerBytes :: Integer,
        totalBytes :: Integer,
        messageHeaders :: [Header],
        payload :: Maybe Text
      }
  | PubMessage
      { subject :: String,
        reply :: Maybe String,
        len :: Integer,
        payload :: Maybe Text
      }
  | HeaderPubMessage
      { subject :: String,
        reply :: Maybe String,
        headerBytes :: Integer,
        totalBytes :: Integer,
        messageHeaders :: [Header],
        payload :: Maybe Text
      }
  | SubMessage
      { subject :: String,
        queueGroup :: Maybe String,
        sid :: String
      }
  | UnsubMessage
      { sid :: String,
        maxMessages :: Maybe Integer
      }
  | PingMessage
  | PongMessage
  | OkMessage
  | ErrMessage
      { reason :: String
      }
  deriving (Eq, Show)

data Header = Header {key :: String, value :: String} deriving (Eq, Show)

The ServerInfo and ConnectSettings types are just additional data that derive a JSON serializer and deserializer.

Being a filthy newb when it comes to Haskell, I followed some online recommendations that say for things like the NATS arbitrary payloads, one should use Text, and then if I need to convert into a ByteString later, it’s just a matter of calling encodeUtf8.

An alternative design to this one would be to create a separate type for each of the messages rather than grouping them together in a sum type like this. Personally, treating the protocol messages the way I would an enum in other languages just felt right.

Parsing in Anger

Okay, so now that I’ve got my data types defined, it should be relatively easy to parse them, right? Narrator: It was not easy . The first problem I had was the fact that there are so many parsing libraries available in Haskell. Writing a parser (or “parser combinator” for the purists) seems to be one of the first things people do in this language. After many hours of research, I narrowed my choices down to:

Use the parsec library
Use the megaparsec library
Roll my own (about 100 lines of code for the scaffolding)

I immediately ruled out the third option because it felt similar to the old addage “never write your own crypto”. There were as many websites and blog posts advocating for parsec as there were for megaparsec, and a neverending debate seemed constantly near boilover about which is best.

Ultimately I decided to use megaparsec because it’s mega (naturally). I felt like I made this decision by rolling a D20 for initiative.

Parsers of this kind work by evaluating the rules you supply as the parser designer. This converts the input into something useful, and it then recursively calls all the other parsers until the input is exhausted, gradually building up more and more meaningful context. Each time a new parser in the “chain” is called, it’s given both the output of the previous step and the remaining unparsed text.

When building parsers like this, I like to start at the smallest possible unit of input and then work my way out from there. This ends up being the ping and pong messages:

pPingMessage :: Parser ProtocolMessage
pPingMessage = do
  _ <- string "PING" <* crlf
  return PingMessage

pPongMessage :: Parser ProtocolMessage
pPongMessage = do
  _ <- string "PONG" <* crlf
  return PongMessage

In the case of these parsers, they “eat” the string literals PING or PONG followed by a crlf (which is then discarded as per the <* operator).

The next up in terms of complexity might be the UNSUB message:

pUnsubMessage :: Parser ProtocolMessage
pUnsubMessage = do
  _ <- string "UNSUB" <* hspace1
  sid <- many alphaNumChar
  maxMessages <- try (hspace1 *> fmap Just integer) <|> return Nothing
  _ <- crlf
  return UnsubMessage {..}

It took me for ever, plus annoying a bunch of folks online to figure out how to arrange the exact incantation to make this work. The first line should look familiar, as we “eat” the UNSUB string and then discard any whitespace following it. Next, we grab the subscription ID, which the spec defines as a string, so we use many alphaNumChar to capture that.

maxMessages is where I almost gave up. When parsers fail, they’ve already “eaten” their way down the failing path. So if we want to try a particular path that can fail, or use that same data on an alternative, we need to use some of the monadic functions that do this.

maxMessages <- try (hspace1 *> fmap Just integer) <|> return Nothing

Here we try to parse whitespace followed by something that can be parsed as an integer (note integer here is a function). If that fails, then we return Nothing, and then backtrack to where the try block started.

With all of that behind me and working, I moved on to messages that supported headers. In keeping with the inner-first building pattern, I defined how to parse a header:

pHeaders :: Parser [Header]
pHeaders = do
  _ <- string "NATS/1.0" <* crlf
  many pHeader <* crlf

pHeader :: Parser Header
pHeader = do
  key <- validSubject <* char ':'
  value <- validSubject <* crlf
  return Header {..}

Per the spec, a list of headers is always followed by the NATS/1.0 string literal and a cr/lf pair. Headers are relatively simple to parse because we can use the : as a separator token.

Now I can write a message parser for something that uses headers like HMSG:

pHeaderMsgMessage :: Parser ProtocolMessage
pHeaderMsgMessage = do
  _ <- string "HMSG" <* hspace1
  subject <- validSubject <* hspace1 <?> "subject"
  sid <- many alphaNumChar <* hspace1 <?> "sid"
  (reply, headerBytes, totalBytes) <- try pHeaderMsgTailNoReply <|> pHeaderMsgTailReply
  messageHeaders <- pHeaders <?> "headers"
  payload <- optional $ do
    guard (totalBytes /= 0)
    takeP (Just "payload byte") (fromIntegral totalBytes) <* crlf
  return $ HeaderMsgMessage {..}

This one is pretty complicated. After eating HMSG, the subject, and the sid, we come to a branching point. Here we have to parse out a couple of bits of information like the header bytes, total bytes, and optional reply-to. Lastly, if the total bytes number is greater than 0, we read exactly that number of bytes from input (followed by a discarded crlf).

Parsing can get very frustrating with “false positives”. The reply subject of a message can be any string, which technically could be a string like “123”. This means that “123” will satisfy both an integer parser and a string parser, and so my original naive implementations only worked for one branch, not both.

Instead of saying “if we have an integer, use it” (which can fail when someone supplies an integer-friendly reply-to subject!), we need to try and parse all the way to the end of the line. If we have a reply subject, header bytes, and total bytes, then that’s one path. The other path is just header bytes and total bytes. We define these two backtrack-aware parsing paths as:

pHeaderMsgTailNoReply :: Parser (Maybe string, Integer, Integer)
pHeaderMsgTailNoReply = do
  headerBytes <- integer <* hspace1 <?> "header bytes"
  totalBytes <- integer <* crlf <?> "total bytes"
  return (Nothing, headerBytes, totalBytes)

pHeaderMsgTailReply :: Parser (Maybe String, Integer, Integer)
pHeaderMsgTailReply = do
  reply <- fmap Just validSubject <* hspace1
  headerBytes <- integer <* hspace1 <?> "header bytes"
  totalBytes <- integer <* crlf <?> "total bytes"
  return (reply, headerBytes, totalBytes)

The <?> operator just lets me supply a label for that particular parse operation, which can help when parsing fails. When there’s no reply subject, we return (Nothing, headerBytes, totalBytes), and when there is one, we return (reply, headerBytes, totalBytes).

Both of these branches parse the last 2 values on the line, but we have to set the backtrack position to before we read the subject. This parser could be much simpler if the spec didn’t allow reply subjects to be pure numbers, but alas this is where we are.

Next, we can do something a little bit different for the CONNECT and INFO messages, which both have literals immediately before a JSON payload.

pConnectMessage :: Parser ProtocolMessage
pConnectMessage = do
  _ <- string "CONNECT" <* hspace1
  rest <- many printChar <* crlf
  case eitherDecode $ toByteString rest of
    Right a -> return ConnectMessage {settings = Just a, errorMessage = Nothing}
    Left e -> return ConnectMessage {settings = Nothing, errorMessage = Just e}

Here, eitherDecode is a function that we get from having derived a JSON decoder for the connection settings structure. If the decode fails, we return the Left side of either, and if it succeeds, we right the Right side. Technically the Either type lets you put whatever you want on whatever side, but the community convention seems to be that failures are Left while success is Right. This is obviously a conspiracy against left-handed people like me.

Now we’re finally ready for the payoff. After having defined parsers for all of the individual message types, we can create a root parser that uses alternate (<|>) syntax to make it obvious how message parsing works:

parseMessage :: Parser ProtocolMessage
parseMessage =
  pMsgMessage
    <|> pInfoMessage
    <|> pConnectMessage
    <|> pPubMessage
    <|> pSubMessage
    <|> pUnsubMessage
    <|> pHeaderPubMessage
    <|> pHeaderMsgMessage
    <|> pOkMessage
    <|> pErrMessage
    <|> pPingMessage
    <|> pPongMessage

The order matters here. We need to be very explicit about how we define the individual parsers so that we don’t accidentally try and parse one input as a different type. Fortunately, we don’t need to do any backtracking here because each NATS message has a unique prefix. This means our parsers will all fail when reading an incorrect first token (e.g. HMSG, MSG, INFO, etc). The monadic portion of this operator enables us to chain all these together, knowing that if one fails, we will attempt the next one in line. If all of them fail, then parseMessage will return a Left a where a is the parser error type.

I really disliked the process of learning how to build this parser. I personally am not a big fan of DSLs, and so the DSL-like nature of how all these parser combinators, er, combine, rubbed me the wrong way.

It feels like there’s “mainstream” Haskell, and then there’s parsing, which is a completely different animal. I will admit, however, that after all the suffering and smashing my head against the desk, I came to like how testable and verifiable these parsers are.

Here’s an example of test scaffolding for the PING message:


spec :: Spec
spec = do
  doParserCases
  doTransformerCases  

parserCases :: [(Text, ProtocolMessage)]
parserCases = [("PING\r\n", PingMessage)]

transformerCases :: [(ProtocolMessage, Text)]
transformerCases = map (\(a,b) -> (b,a)) parserCases

doParserCases :: SpecWith ()
doParserCases = parallel $ do
  describe "PING parser" $ do
    forM_ parserCases $ \(input, want) -> do
      it (printf "correctly parses explicit case %s" (show input)) $ do
        let output = parse parseMessage "" input
        output `shouldBe` Right want

doTransformerCases :: SpecWith ()
doTransformerCases = parallel $ do
  describe "PING transformer" $ do
    forM_ transformerCases $ \(input, want) -> do
      it (printf "correctly transforms %s" (show input)) $ do
        transform input `shouldBe` encodeUtf8 want

You can see how easy it is to reuse this format to define the expected inputs, what they should parse as, and to ensure that they are output (transformed) exactly as they were input.

In other words, I don’t think learning how to build parsers teaches you the fundamentals of Haskell. However, having a very good working knowledge of type classes, monads, and applicatives can help make sense of all this stuff that seemed like gibberish when I first started.

I am not claiming to have figured this all out on my own. As usual, learning is a community effort and I got a lot of help from Reddit, Discord, and got a lot of project layout inspiration from natskell, one of the other Haskell NATS clients.

Next up, I need to add some basic network connectivity to talk to the NATS server. I’m sure that’ll be easy, right?