By Eric Burden | November 22, 2022
The season is nearly upon us! And by the season, I of course am referring to the
Advent of Code (what else?). This year I plan to take my first pass at all the puzzles
in Rust, and I’ve already set up my repository.
This past year, I went back and finished all the prior years of AoC in Julia
(repo here), which was really
a delightful experience. I considered on reflection, though, just how much I had
leaned on regular expressions for input parsing over the course of those many, many
puzzles. The regular expression support in Julia was just too good! However, given
that I like to use AoC as a learning and practice exercise, my thoughts drifted to
other options for input parsing. I’ve had some small experience with the nom
parser combinator crate, and
while the menu of sub-modules and functions spread throughout seems a bit daunting
at first, using nom
for text parsing seems like a handy skill to have. So, I
decided to practice a bit on some past AoC puzzles. The input from
2020, Day 4 was interesting enough that
I thought it deserved a blog post, so here it is!
The Goal
The puzzle itself probably has a better description of the input, but to save you from needing to navigate over, here’s a brief overview… You’re given a text file with a bunch of “passport” entries, each of which is separated by a blank line, like so:
ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm
iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929
hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm
hcl:#cfa07d eyr:2025 pid:166559648
iyr:2011 ecl:brn hgt:59in
Note that each entry may constitute one or more lines and on each line the individual
fields are separated by a single space. Each field is a string that contains a label
and a value, separated by a colon like <label>:<value>
. My goal is to parse each entry
into the most specific possible representation. You may think this is overkill, and
you may be right, but remember the goal here is to overuse nom
, so there. Here’s the
format I went with:
struct Passport<'a> {
byr: Option<Year<'a>>,
iyr: Option<Year<'a>>,
eyr: Option<Year<'a>>,
hgt: Option<Height<'a>>,
hcl: Option<HairColor<'a>>,
ecl: Option<EyeColor<'a>>,
pid: Option<IdNumber<'a>>,
cid: Option<IdNumber<'a>>,
}
enum Year<'a> {
Valid(u16),
Invalid(&'a str),
}
enum Height<'a> {
CM(u16),
IN(u16),
Invalid(&'a str),
}
enum HairColor<'a> {
Valid(u8, u8, u8),
Invalid(&'a str),
}
enum EyeColor<'a> {
Amber,
Blue,
Brown,
Gray,
Green,
Hazel,
Other,
Invalid(&'a str),
}
enum IdNumber<'a> {
Valid(u32),
Invalid(&'a str),
}
Why, oh, why!? Well, according to the puzzle instructions, each entry may or may
not contain each field, and each field may or may not actually be possible to
represent correctly, like "ecl:1984"
or some junk. So, what am I doing here? Why
not just store everything in a HashMap<&str,&str>
or something? Specificity. In
particular, it’s a good excuse to do lots of parsing. Each field in Passport
is
optional because the input may or may not contain an entry for that field, and each
of the enums has an Invalid
variant where I will store the string that couldn’t be
parsed into a Valid
value. This means that "ecl:1984"
would be represented as
a Some(EyeColor::Invalid("1984"))
. In a more “real-world” scenario, this might
be really useful, allowing the developer to easily pick out fields that couldn’t
be parsed and review their contents. For me, this was super handy for checking
examples to determine when my parsers weren’t doing what I expected them to.
Parser Combinators
If you’re not familiar with the term, a “combinator” is a function (typically a small, focused function) that takes in a function and returns a function. If that definition doesn’t make it clear (because of course it doesn’t), you can think of combinators as pluggable pieces of an assembly line, where can piece together a series of combinators in such a way as to build “pipelines” that take input at one end and produce different kinds of output out the other end, depending on the order and arrangement of the pieces. If it’s still a bit fuzzy, then perhaps the examples below will help.
The General Shape
Before we start writing any parsing code, let’s try to define something like a “grammar” for these passports, that is, some structure can we infer from the entries as they’re given. For example, we can see that each entry consists of a list of “field"s separated by either a single space or a newline character. In the image below, each Field is highlighted blue and the Separators (either a space or newline) are indicated by yellow rectangles.
Given that each Field represents a Label and a Value, our current structure looks like:
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label = "byr" || "iyr" || "eyr" || "hgt" || "hcl" || "ecl" || "pid" || "cid"
Value = ????
Let’s keep in mind that there are no promises that only valid label values will be given in the input, so we’ll need to account for that. We know that the form of Value will be determined by the content of Label, but we don’t have a good feel for what the options for Value look like just yet. Let’s fix that.
Field Types
Years
There are three different fields that should contain a “year” value, represented by a decimal number. These fields have the label “byr”, “iyr”, or “eyr”.
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label
= "byr" -> Value = Year
= "iyr" -> Value = Year
= "eyr" -> Value = Year
= "hgt" -> Value = ???
= "hcl" -> Value = ???
= "ecl" -> Value = ???
= "pid" -> Value = ???
= "cid" -> Value = ???
Year = [0-9]+
Yes, I am representing the definition of Year as a regular expression. Sue me.
Height
Height can either be given in inches or centimeters, which is given as a number followed by the suffix “in” or “cm”. This field has the label “hgt”.
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label
= "byr" -> Value = Year
= "iyr" -> Value = Year
= "eyr" -> Value = Year
= "hgt" -> Value = Height
= "hcl" -> Value = ???
= "ecl" -> Value = ???
= "pid" -> Value = ???
= "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
Hair Color
Hair color should be given as a hex color code, complete with '#'
. This field has
the label “hcl”.
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label
= "byr" -> Value = Year
= "iyr" -> Value = Year
= "eyr" -> Value = Year
= "hgt" -> Value = Height
= "hcl" -> Value = HairColor
= "ecl" -> Value = ???
= "pid" -> Value = ???
= "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+
Eye Color
Eye color can be one of “amb”, “blu”, “brn”, “gry”, “grn”, “hzl”, or “oth”, and that’s it. This field has the label “ecl”.
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label
= "byr" -> Value = Year
= "iyr" -> Value = Year
= "eyr" -> Value = Year
= "hgt" -> Value = Height
= "hcl" -> Value = HairColor
= "ecl" -> Value = EyeColor
= "pid" -> Value = ???
= "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+
EyeColor = "amb" || "blu" || "brn" || "gry" || "grn" || "hzl" || "oth"
ID Numbers
Both Passport ID and Country ID can be represented as numbers. These are a bit larger than a Year, so we’ll make a distinction between the two. These fields have the label “pid” or “cid”.
Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label
= "byr" -> Value = Year
= "iyr" -> Value = Year
= "eyr" -> Value = Year
= "hgt" -> Value = Height
= "hcl" -> Value = HairColor
= "ecl" -> Value = EyeColor
= "pid" -> Value = IdNumber
= "cid" -> Value = IdNumber
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+
EyeColor = "amb" || "blu" || "brn" || "gry" || "grn" || "hzl" || "oth"
IdNumber = [0123456789]+
With the completed specification, we can start writing functions! We’ll start at the bottom of our specification and build our way up.
Combinator Functions
IdNumber
use nom::{
bytes::complete::take,
character::complete::u32,
combinator::all_consuming,
IResult,
};
enum IdNumber<'a> {
Valid(u32),
Invalid(&'a str),
}
impl<'a> From<&'a str> for IdNumber<'a> {
fn from(value: &'a str) -> Self {
match id_number(value) {
Ok((_, year)) => Self::Valid(year),
Err(_) => Self::Invalid(value),
}
}
}
fn id_number(s: &str) -> IResult<&str, u32> {
let (_, nine_chars) = all_consuming(take(9usize))(s)?;
u32(nine_chars)
}
There are a few things to note here. One, the nom
functions can take a variety of
inputs, but in this case we’ll pass a string slice to all of them. The IResult<T, U>
from
nom
is the equivalent of Result<(T, U), E>
where E
is a nom::error::Error
, U
is the result of successfully parsing the input, and T
is the rest of the unparsed input.
It is generally a good idea to wrap the nom
functions in your own functions for the
function signature, otherwise you can end up needing some really gnarly type hints if
you try to use the parser combinator functions in-line. So, we can understand
id_number(&str) -> IResult<&str, u32>
to be a function that takes a string slice
and parses it into a (&str, u32)
, where the returned string slice is the rest of the
input and the u32
is the resulting number.
I’d recommend taking a look at the list of nom
combinators
on GitHub. This list makes it much easier to find the combinator you’re looking for.
Our id_number(&str) -> IResult<&str, u32>
function first takes 9 characters from
the input and returns an error if there are more than 9 characters there. It then
attempts to parse those 9 characters into a u32
and returns the result. We use
the nom::character::complete::u32
function instead of _.parse::<u32>()
to
keep the nom error types.
Beyond that, we’ve implemented From
such that IdNumber::from(&str)
tries to parse
the given string into a u32
containing exactly 9 digits including leading zeros
and, if successful, returns an IdNumber::Valid(u32)
. Otherwise, you get back an
IdNumber::Invalid(&str)
containing the original string slice.
There are a few different ways you could structure this, including by implementing
FromStr
or making id_number()
a member function of IdNumber
, but the heart of
it should remain the same.
EyeColor
enum EyeColor<'a> {
Amber,
Blue,
Brown,
Gray,
Green,
Hazel,
Other,
Invalid(&'a str),
}
impl<'a> From<&'a str> for EyeColor<'a> {
fn from(s: &'a str) -> Self {
match s {
"amb" => EyeColor::Amber,
"blu" => EyeColor::Blue,
"brn" => EyeColor::Brown,
"gry" => EyeColor::Gray,
"grn" => EyeColor::Green,
"hzl" => EyeColor::Hazel,
"oth" => EyeColor::Other,
_ => EyeColor::Invalid(s),
}
}
}
I honestly couldn’t come up with a good reason to use any parser functions here, we’re
just matching the input string to one of the EyeColor
variants or returning
EyeColor::Invalid(&str)
if we can’t.
HairColor
use nom::{
bytes::complete::take_while_m_n,
character::complete::char,
combinator::map_res,
sequence::{preceded, tuple},
IResult,
}
enum HairColor<'a> {
Valid(u8, u8, u8),
Invalid(&'a str),
}
impl<'a> From<&'a str> for HairColor<'a> {
fn from(value: &'a str) -> Self {
match hex_color_triplet(value) {
Ok((_, (r, g, b))) => HairColor::Valid(r, g, b),
Err(_) => HairColor::Invalid(value),
}
}
}
/// Use `nom` parser combinators to parse "#ffffff" into (255, 255, 255).
fn hex_color_triplet(s: &str) -> IResult<&str, (u8, u8, u8)> {
preceded(char('#'), tuple((hex_value, hex_value, hex_value)))(s)
}
/// Use `nom` to parse two hex characters into their u8 value.
fn hex_value(s: &str) -> IResult<&str, u8> {
let two_hex_bytes = take_while_m_n(2, 2, |c: char| c.is_ascii_hexdigit());
let convert_to_u8 = |v| u8::from_str_radix(v, 16);
map_res(two_hex_bytes, convert_to_u8)(s)
}
Now here’s some parsing! We can see that the From<&str>
trait implementation
is almost exactly the same as for IdNumber
, so that’s not too interesting, but
hex_value(&str)
and hex_color_triplet(&str)
really show off what the parser
combinator functions can do. hex_color_triplet(&str)
looks for a string that
starts with ‘#’ and is followed by three substrings that can be parsed by
hex_value(&str)
which are return in a three-tuple. hex_value(&str)
takes
the first two characters as long as they satisfy c.is_ascii_hexdigit()
and
attempts to convert them into a u8
using u8::from_str_radix(&str, 16)
. Recall
that we get a nom::error::Error
if any part of this process fails, which
ultimately means a HairColor::Invalid(&str)
.
Height
use nom::{
branch::alt,
bytes::complete::tag,
character::complete::u16,
sequence::pair,
IResult,
};
enum Height<'a> {
CM(u16),
IN(u16),
Invalid(&'a str),
}
impl<'a> From<&'a str> for Height<'a> {
fn from(value: &'a str) -> Self {
if let Ok((_, (number, unit))) = number_unit_pair(value) {
match unit {
"in" => Height::IN(number),
"cm" => Height::CM(number),
_ => unreachable!(),
}
} else {
Height::Invalid(value)
}
}
}
/// `nom` parser combinator to parse "173cm" into (173, "cm").
fn number_unit_pair(s: &str) -> IResult<&str, (u16, &str)> {
pair(u16, alt((tag("in"), tag("cm"))))(s)
}
Height
is interesting because we could be looking at a Height::CM(u16)
or a
Height::IN(u16)
depending on the last two letters of the input. number_unit_pair(&str)
is looking for a number that can be parsed to a u16
followed by either “in” or
“cm” and returns as the output a tuple of the number and the suffix as a string.
With that in hand, it’s straightforward to wrap the number in the appropriate
variant of Height
.
Year
use nom::{
bytes::complete::take,
character::complete::u16,
combinator::all_consuming,
IResult,
};
enum Year<'a> {
Valid(u16),
Invalid(&'a str),
}
impl<'a> From<&'a str> for Year<'a> {
fn from(value: &'a str) -> Self {
match year(value) {
Ok((_, year)) => Self::Valid(year),
Err(_) => Self::Invalid(value),
}
}
}
fn year(s: &str) -> IResult<&str, u16> {
let (_, four_chars) = all_consuming(take(4usize))(s)?;
u16(four_chars)
}
This is almost exactly the same as IdNumber
, just with a u16
instead of a u32
and four digits instead of nine. Specificity!
Passport
With all the pieces in place, we’re ready to parse an entire Passport!
use nom::{
branch::alt,
bytes::complete::{is_not, tag, take},
multi::separated_list1,
sequence::separated_pair,
IResult,
};
/// Parses a line-separated string into a list of the results of attempting
/// to parse each entry into a `Passport`.
fn parse_file(s: &str) -> Vec<Result<Passport, ParsingError>> {
s.split("\n\n").map(Passport::try_from).collect()
}
enum ParsingError<'a> {
InvalidPassport(&'a str),
InvalidField(&'a str),
}
#[derive(Debug, Default)]
struct Passport<'a> {
byr: Option<Year<'a>>,
iyr: Option<Year<'a>>,
eyr: Option<Year<'a>>,
hgt: Option<Height<'a>>,
hcl: Option<HairColor<'a>>,
ecl: Option<EyeColor<'a>>,
pid: Option<IdNumber<'a>>,
cid: Option<IdNumber<'a>>,
}
/// Create a `Passport` based on a chunk of the input string.
impl<'a> TryFrom<&'a str> for Passport<'a> {
type Error = ParsingError<'a>;
fn try_from(s: &'a str) -> Result<Self, Self::Error> {
// If we don't get back a list of (<field name>, <field value>) from the parsing
// function, then it's an invalid passport entry.
let (_, fields) = field_list(s).map_err(|_| ParsingError::InvalidPassport(s))?;
// Initialize a new passport with all the optional fields set to `None`.
let mut passport = Passport::default();
// Check each field label, parse its value, and put the value in the right
// `Passport` field. If the label is unrecognizable, then it's an invalid
// field and gets reported. In hindsight, instead of returning an error
// here, you could just store the un-parseable field entries in the `Passport`
// struct for later review.
for (label, value) in fields {
match label {
"byr" => passport.byr = Some(Year::from(value)),
"iyr" => passport.iyr = Some(Year::from(value)),
"eyr" => passport.eyr = Some(Year::from(value)),
"hgt" => passport.hgt = Some(Height::from(value)),
"hcl" => passport.hcl = Some(HairColor::from(value)),
"ecl" => passport.ecl = Some(EyeColor::from(value)),
"pid" => passport.pid = Some(IdNumber::from(value)),
"cid" => passport.cid = Some(IdNumber::from(value)),
_ => return Err(ParsingError::InvalidField(s)),
};
}
Ok(passport)
}
}
/// Parses "byr:1985 ecl:blu hgt:173cm" into [("byr", "1985"), ("ecl", "blu"),
/// ("hgt", "173cm")]
fn field_list(s: &str) -> IResult<&str, Vec<(&str, &str)>> {
separated_list1(separator, field_value_pair)(s)
}
/// All field/value pairs in a passport entry are separated by either a space or
/// a newline. If that were to change, this function would be the one to be updated.
fn separator(s: &str) -> IResult<&str, &str> {
alt((tag(" "), tag("\n")))(s)
}
/// Parses "byr:1985" into ("byr", "1985").
fn field_value_pair(s: &str) -> IResult<&str, (&str, &str)> {
separated_pair(take(3usize), tag(":"), is_not(" \n"))(s)
}
I’ve opted to have a couple of different possible error types just to cover my bases,
cases where there are field labels that aren’t in Label get reported as a
ParsingError::InvalidField(&str)
and cases where the parser can’t identify
the basic Passport structure (fields and separators) get reported as a
ParsingError::InvalidPassport(&str)
. Note the heavy use of our previously
implemented From<&str>
traits within the From<&str>
implementation for
Passport
.
Close-Out
This was a good exercise for me. Honestly, nom
was a bit intimidating to get started
with and the documentation could use a bit more work in terms of examples. That’s one
of the reasons I decided to write this post, to get more nom
examples out into the
world. As I started to get more familiar with usage patterns and where different things
were in the library, I was better able to chain and nest combinators to do what I
wanted. The code above is by no means representative of my first approach, initially
I found I was creating overly complicated combinators because I just didn’t know about
other options within the library. I will say, though, that taking this approach was
much more “de-buggable” than a complex enough regular expression to handle this
situation. I suspect that, in the long run, this is a more maintainable approach
as well. All in all, worth the time investment to get familiar, and I’m looking forward
to practicing more during this year’s Advent of Code.