Easy parsing with reasonable error messages in OCaml's Angstrom

#ocaml

PARSER combinators are widely used in the world of functional programming, and OCaml's Angstrom library is one of them. It is used to implement many foundational parsers in the OCaml ecosystem, eg HTTP parsers for the httpaf stack.

However, one of their bigger downsides is the lack of accurate parse error reporting. Let's take a look. Suppose you want to parse records of this format: 1 Bob ie an ID number followed by one or more spaces, followed by an alphabetic word (a name). Here's a basic Angstrom parser for this:

open Angstrom

type person = { id : int; name : string; }

let sp = skip_many1 (char ' ')
let word = take_while1 (function 'A' .. 'Z' | 'a'..'z' -> true | _ -> false)
let num = take_while1 (function '0'..'9' -> true | _ -> false)

let person =
  let+ id = num
  and+ _ = sp
  and+ name = word
  and+ _ = end_of_input in
  { id = int_of_string id; name }

Let's try out various bad inputs and check the errors:

# parse_string ~consume:Consume.All person "";;
- : (person, string) result = Error ": count_while1"

# parse_string ~consume:Consume.All person "1";;
- : (person, string) result = Error ": not enough input"

# parse_string ~consume:Consume.All person "1 ";;
- : (person, string) result = Error ": count_while1"

# parse_string ~consume:Consume.All person "1 1";;
- : (person, string) result = Error ": count_while1"

The error messages are not great, unfortunately! It's hard to tell what went wrong. Of course, in this case we know what caused each error because we are feeding small inputs to the parser. But it's easy to imagine that for larger inputs it may be difficult to understand why a parse is failing.

Fortunately, parser combinator libraries usually provide a 'label' function to improve the error messages slightly. In Angstrom, a label works like this: parser <?> "label string". But the default label functionality allows labelling with only a static string. Let's improve labelling even more! Using a little-known feature of Angstrom, we can take a snapshot of the remaining string left to parse and actually include it in the error message if parsing fails.

Here, we are just augmenting the built-in label operator with a more powerful, snapshotting version:

let ( <?> ) p l =
  let* remaining = available in
  let remaining = min remaining 20 in
  let* s = peek_string remaining in
  p <?> Printf.sprintf "%s, got: [%s]" l s

Now let's redefine our parsers to use this augmented labelling operator:

let sp = skip_many1 (char ' ') <?> "expected one or more spaces"
let word = take_while1 (function 'A' .. 'Z' | 'a'..'z' -> true | _ -> false) <?> "expected a word"
let num = take_while1 (function '0'..'9' -> true | _ -> false) <?> "expected a number"

let person =
  (let+ id = num <?> "expected a numeric ID"
   and+ _ = sp
   and+ name = word <?> "expected a name"
   and+ _ = end_of_input <?> "expected end of input" in
   { id = int_of_string id; name }) <?> "expected a person"

Let's try the same error scenarios:

# parse_string ~consume:Consume.All person "";;
- : (person, string) result =
Error
 "expected a person, got: [] > expected a numeric ID, got: [] > expected a number, got: []: count_while1"

# parse_string ~consume:Consume.All person "1";;
- : (person, string) result =
Error
 "expected a person, got: [1] > expected one or more spaces, got: []: not enough input"

# parse_string ~consume:Consume.All person "1 ";;
- : (person, string) result =
Error
 "expected a person, got: [1 ] > expected a name, got: [] > expected a word, got: []: count_while1"

# parse_string ~consume:Consume.All person "1 b1";;
- : (person, string) result =
Error
 "expected a person, got: [1 b1] > expected end of input, got: [1]: end_of_input"

The text in the brackets shows a snapshot of the string remaining to parse, which narrows down at each level to the exact string where the parse failed! With this snapshot of the remaining string, we can easily figure out where the parse failed.

Happy parsing!

DEV Community

Easy parsing with reasonable error messages in OCaml's Angstrom

Top comments (0)