Recently, I’ve been learning parser combinators in Haskell, using the included Text.ParserCombinators.ReadP library. For validating small bits of text, such as checking something like a phone number for proper formatting, they’re a departure from regular expressions that I and many other programmers are already accustomed to. However, they’re more easily readable maintainable than the keysmash that regular expressions tend to become when they get sufficiently complex. This post details a few small exercises for ReadP, which should (hopefully) be applicable to other Haskell parser combinator libraries such as Parsec, Attoparsec and Megaparsec as well.

This post won’t attempt to teach parser combinators from scratch. For that, I recommend Two Wrongs’ introduction to parser combinators, which I followed when I first started. Past that, the docs on Hackage are a good reference for what each ReadP operation/included parser does.

Canadian postal codes

Input: a Canadian postal code, with or without a space in the middle, in uppercase or lowercase. (eg: “h0h0h0”, “H0H 0H0”)

Output: a newtyped String containing the postal code in uppercase and with a space. (eg: “H0H 0H0”)

Canadian postal codes are six-character strings that alternate between letters and digits, usually but not always written with a space after the first three. In addition, the first letter is limited to some, but not all, letters of the Latin alphabet (A through Z). This means we’ll need to construct a parser out of four smaller parsers: one for a single starting letter, one for a single Latin letter in general, one for a single digit, and one for an optional space.

For a single digit, we can use satisfy isDigit. satisfy takes a Char -> Bool, and we can use isDigit (from Data.Char) to get a parser.

For a single Latin letter, however, isLetter returns True not just for Latin characters, but for anything defined in Unicode to be a letter. Therefore, we’ll have to make our own function, which we can then give to satisfy to make a parser that returns a single uppercase or lowercase Latin chatacter:

isPostalCodeLetter :: Char -> Bool
isPostalCodeLetter = (flip elem ['A'..'Z']) . toUpper

The first letter in a postal code is limited to a certain subset of the Latin alphabet as well. In the same manner as the above, we can use satisfy to get a parser for that, too:

isFsaDistrictLetter :: Char -> Bool
isFsaDistrictLetter = (flip elem "ABCEGHJKLMNPRSTVXY") . toUpper

Finally, for our optional space, we’ll need to again make a parser that takes a single space. satisfy isSpace, again leaning on Data.Char, should do. To make it take zero or one space, however, we’ll need to put the optional parser in front of it:

optional $ satisfy isSpace

Finally, when put together, our postal code parser looks like this, including what we need to return a properly formatted postal code:

newtype PostalCode = PostalCode String

postalCode :: ReadP PostalCode
postalCode = do
    fsaDistrict <- satisfy isFsaDistrictLetter
    fsaDigit <- satisfy isDigit
    fsaLetter <- satisfy isPostalCodeLetter
    optional $ satisfy isSpace
    lduDigit1 <- satisfy isDigit
    lduLetter <- satisfy isPostalCodeLetter
    lduDigit2 <- satisfy isDigit
    eof
    return $ PostalCode
        [ toUpper fsaDistrict
        , fsaDigit
        , toUpper fsaLetter
        , lduDigit1
        , toUpper lduLetter
        , lduDigit2
        ]

We use eof at the end so that we fail to match on any strings that have more to then than what we’ve matched on so far. (eg: “H0H 0H0Hello, world!”)

North American phone numbers

Input: a North American phone number with the following criteria:

  • beginning either with “+1”, “1”, or nothing, with an optional hyphen or space to the right
  • a three-digit area code, optionally surrounded by parentheses, optionally separated by a hyphen or space
  • three digits, optionally separated by a hyphen or space
  • four digits, optionally separated by a hyphen or space

Output: a newtyped String containing the phone number in E.164 format

First of all, there’s a few places where we expect either a space, a hyphen, or nothing. We can use the optional parser to get part of the way there, but then we need a parser that reads either a single hyphen or space. There are a few different ways of doing this, but for the sake of using the choice parser which we haven’t seen yet, here’s what I used:

spaceOrHyphen :: ReadP Char
spaceOrHyphen = choice [char ' ', char '-']

The char parser reads one of the character that is given to it, and the choice parser goes through the list of parsers it is given until it finds one that succeeds. We can also use char ' ' <|> char '-', with <|> coming from Control.Applicative. This is what choice uses behind the scenes, since ReadP is a Monad, and therefore, an Applicative. We could also use satisfy with flip elem "- " if we wanted, depending on what you find more readable.

Next, let’s turn our attention to specific parts of the phone number. Firstly, it starts with either a “+1”, “1” or nothing at all, which is the international calling code used for the US, Canada, and other countries that are part of the North American Numbering Plan. Again, there are a few different ways of reading this, but to make use of choice again, we’ll can use this:

optional $ choice [string "`", string "+1"]

For groups of numbers, such as the groups of three and four digits after the area code, the simplest method is to employ the count parser. For brevity, we’ll define a parser digit that parses a single digit:

digit :: ReadP Char
digit = satisfy isDigit

Then, to get a group of digits, we can use count n digit, where n is however many digits we want.

For the area code, we need to deal with the optional surrounding parentheses. To show off another built-in parser we haven’t seen yet, we’ll use between. For an area code, we can do this:

between
    (optional (char '('))
    (optional (char ')'))
    (count 3 digit)

The first argument is a parser for what’s on the left of what we want to parse (the optional “(“), the second for the right side (the optional “)”), and the third argument is for the part in the middle (the three digits of the area code), which between will pass on to us. between is more or less there for convenience, so you can just as easily make a parser that combines the first parser, then the third, then the second if desired.

Finally, we can put our complete phone number parser together like this:

newtype PhoneNumber = PhoneNumber String

nanpPhoneNumber :: ReadP PhoneNumber
nanpPhoneNumber = do
    optional $ choice [string "1", string "+1"]
    optional spaceOrHyphen

    nanpPhoneNumber = do
    optional $ choice [string "1", string "+1"]
    optional spaceOrHyphen
    
    areaCode <- between
        (optional (char '('))
        (optional (char ')'))
        (count 3 digit)
    optional spaceOrHyphen
    
    phoneNumber1 <- count 3 digit
    optional spaceOrHyphen
    
    phoneNumber2 <- count 4 digit
    eof
    let str = "+1" ++ areaCode ++ phoneNumber1 ++ phoneNumber2
    return $ PhoneNumber str

Passwords

Input: a password containing at least:

  • Eight characters
  • One uppercase letter
  • One lowercase letter
  • One digit
  • One symbol (any character that isn’t a letter, number or whitespace)

Output: a newtyped String containing the unchanged password

Doing this as a parser combinator might be unnecessary and a little convoluted, but as an exercise, we get to make use of ReadP’s lookahead capabilities, which we’ll see in a second. For now, though, we need functions that check if a single character is any of the four above things. isUpper, isLower and isDigit check all the boxes for the first three (unlike for postal codes before, we’ll let non-Latin characters slide), but for the last, we’ll need a new function:

isPwSymbol :: Char -> Bool
isPwSymbol c = not $ or $ map ($ c) [isUpper, isLower, isDigit, isSpace]

isSpace also comes from Data.Char. With that, we can put our complete password parser together like this (with some explanation below):

newtype Password = Password String

password :: ReadP String
password = do
    let criteria = [isUpper, isLower, isDigit, isPwSymbol]
    pw <- look
    if and (map (\f -> or (map f pw)) criteria)
        then return $ Password pw
        else pfail

We start out by putting our criteria functions from above in a list (criteria), then using look to get the rest of the password (since we haven’t parsed anything yet, the full password) without parsing it.

This sets up the one-liner we use for our if-statement. map f pw applies a function (one of our criteria functions) to each character of the password. Wrapping that in f -> or (...) creates an anonymous function that returns True if f returns True for any one of the characters in the password. Wrapping that in map (...) criteria applies that function to criteria, which gives us a list of Bools representing which criteria are met by the password. Finally, calling and on all of that tells us if our password matches all the criteria.

Finally, we use that one-liner in our if statement. If the password matches all the criteria, we return the password wrapped in a newtype. Else, we call pfail, which always fails and returns nothing.