1. Overview

WIP, a tutorial which demonstrates the basics of Parsec and goes on to build a SQL query parser.

You can view this tutorial as HTML online here:

and you can view the files directly in the github repository here:

1.1. Summary of sections

Introduction to parsing with Parsec, including a review of Text.Parsec.Char functions.

Creating a very simple expression language parser, and introducing some functions from Text.Parsec.Combinator.

Rewriting the simple expression parser code in a more succinct style.

Review and examples of all functions from Text.Parsec.Combinator, and some from Control.Applicative and Control.Monad.

The utility functions used in the previous tutorials, plus some notes on types in Parsec.

This covers using the Text.Parsec.Expr for expression parsing with prefix, postfix and infix operators with fixity.

Looks at an issue we have with the way the symbol parser in the Text.Parsec.Expr tutorial was used, and some possible fixes.

This covers the Text.Parsec.Perm module which is used for parsing different things in flexible order.

This covers Text.Parsec.Token which can be used to create token parsers easily.

This covers building a parser a subset of value expressions from SQL, which are an extension of the simple expression types and parsers covered in previous tutorials.

This covers building a parser to parse query expressions with select lists, simple from, where, group by, having and order by.

This extend the parser for query expressions to support a from clause with much more features including joins.

Here is the code from ValueExpressions, QueryExpressions and FromClause plus tests put together and rearranged as a coherent standalone module.

This quick module covers a simple pretty printer for our SQL ast.

In this document, we will explore error messages with parsec and how restructuring parser code can lead to better or worse error messages.

1.2. Going further

If you are interesting in SQL parsing, check out the project to build a complete SQL parser here: http://jakewheat.github.io/simple-sql-parser/latest. The parsing code in the simple-sql-parser project is based on this tutorial code.

1.3. Extras

an executable which contains the boilerplate to run a parsec parser on a string passed as an argument

an executable which contains the boilerplate to run a parsec parser on a file passed as an argument

Contact: jakewheatmail@gmail.com

License: BSD3

2. Getting started

This is an introduction to parsing with Haskell and Parsec.

Prerequisites: you should know some basic Haskell and have GHC and cabal-install installed (installing the Haskell Platform will give you this).

This tutorial was originally written using GHC 7.6.3 and Parsec 3.1.3, which are the versions which come with the Haskell Platform 2013.2.0.0. It should also work fine with GHC 7.8.4 and GHC 7.10.2 and through to at least the latest release of Parsec 3.1.x.

This tutorial was written using Literate Haskell files available here: https://github.com/JakeWheat/intro_to_parsing.

I recommend you download them all, and follow along in your favourite editor, and use GHCi to experiment. To download the intro_to_parsing files, use:

git clone https://github.com/JakeWheat/intro_to_parsing.git

Here are the imports.

import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (anyChar)
import Text.Parsec.String.Char
import FunctionsAndTypesForParsing (regularParse, parseWithEof, parseWithLeftOver)
import Data.Char
import Text.Parsec.String.Combinator (many1)

2.1. First parser

The first parser:

anyChar :: Parser Char

This parser is in the module Text.Parsec.Char. There is a wrapper in this tutorial’s project, Text.Parsec.String.Char, which gives this function a simplified type.

Whenever we write a parser which parses to a value of type a, we give it the return type of Parser a. In this case, we parse a character so the return type is Parser Char. The Parser type itself is in the module Text.Parsec.String. We will cover this in more detail later.

Let’s use this parser. I will assume you have GHC and cabal-install installed (which provides the 'cabal' executable) and both are in your PATH. The Haskell Platform is one way that provides this.

Change to the directory where you downloaded the intro_to_parsing source files (which will contain the GettingStarted.lhs file). Then you can set up a cabal sandbox and be ready to work with the code by running the following commands in that directory:

cabal update
cabal sandbox init
cabal install parsec
cabal repl

Now you will get the ghci prompt. Type in ':l GettingStarted.lhs'. You can run the parser using a wrapper, enter the following at the ghci prompt: regularParse anyChar "a".

Here is a transcript of running ghci via 'cabal repl':

$ cabal repl
GHCi, version 7.6.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> :l GettingStarted.lhs
[1 of 5] Compiling Text.Parsec.String.Parsec ( Text/Parsec/String/Parsec.hs, interpreted )
[2 of 5] Compiling Text.Parsec.String.Combinator ( Text/Parsec/String/Combinator.hs, interpreted )
[3 of 5] Compiling Text.Parsec.String.Char ( Text/Parsec/String/Char.hs, interpreted )
[4 of 5] Compiling FunctionsAndTypesForParsing ( FunctionsAndTypesForParsing.lhs, interpreted )
[5 of 5] Compiling Main             ( GettingStarted.lhs, interpreted )
Ok, modules loaded: FunctionsAndTypesForParsing, Text.Parsec.String.Char, Text.Parsec.String.Combinator, Main, Text.Parsec.String.Parsec.
*Main> regularParse anyChar "a"
Loading package array-0.4.0.1 ... linking ... done.
Loading package deepseq-1.3.0.1 ... linking ... done.
Loading package bytestring-0.10.0.2 ... linking ... done.
Loading package containers-0.5.0.0 ... linking ... done.
Loading package binary-0.5.1.1 ... linking ... done.
Loading package transformers-0.4.3.0 ... linking ... done.
Loading package mtl-2.2.1 ... linking ... done.
Loading package text-1.2.1.3 ... linking ... done.
Loading package parsec-3.1.9 ... linking ... done.
Right 'a'
*Main>

You can exit ghci by entering ':quit' or using Ctrl-d. From now on, to start ghci again, you can just change to the directory with GettingStarted.lhs and run 'cabal repl'. ghci should have readline support so you can browse through your command history using up and down arrow, etc.

This is the type of regularParse. It is wrapper which takes a parser function such as anyChar, and wraps it so you can parse a string to either a parse error, or the return value from your parser function:

regularParse :: Parser a -> String -> Either ParseError a

Here are some examples of running this parser on various input:

*Main> regularParse anyChar "a"
Right 'a'

*Main> regularParse anyChar "b"
Right 'b'

*Main> regularParse anyChar "0"
Right '0'

*Main> regularParse anyChar " "
Right ' '

*Main> regularParse anyChar "\n"
Right '\n'

*Main> regularParse anyChar "aa"
Right 'a'

*Main> regularParse anyChar ""
Left (line 1, column 1):
unexpected end of input

*Main> regularParse anyChar " a"
Right ' '

You can see that if there are no characters, we get an error. Otherwise, it takes the first character and returns it, and throws away any trailing characters. The details of the helper function regularParse will come later.

Here are two alternatives to regularParse you can also use for experimenting for the time being:

parseWithEof :: Parser a -> String -> Either ParseError a
parseWithLeftOver :: Parser a -> String -> Either ParseError (a,String)

These can be useful when you are not sure if your parser is consuming all your input string or not. The eof parser will error if you haven’t consumed all the input, and the leftover parser can instead tell you what was not consumed from the input.

*Main> regularParse anyChar "a"
Right 'a'

*Main> parseWithEof anyChar "a"
Right 'a'

*Main> parseWithLeftOver anyChar "a"
Right ('a',"")

*Main> *Main> regularParse anyChar ""
Left (line 1, column 1):
unexpected end of input

*Main> parseWithEof anyChar ""
Left (line 1, column 1):
unexpected end of input

*Main> parseWithLeftOver anyChar ""
Left (line 1, column 1):
unexpected end of input

*Main> regularParse anyChar "aa"
Right 'a'

*Main> parseWithEof anyChar "aa"
Left (line 1, column 2):
unexpected 'a'
expecting end of input

*Main> parseWithLeftOver anyChar "aa"
Right ('a',"a")

*Main> parseWithLeftOver anyChar "abc"
Right ('a',"bc")

You can use these functions and ghci to experiment. Try running all the parsers in ghci on various input strings as you work through the document to get a good feel for all the different features. Tip: you can also write the parsers inline in the function call, for example:

*Main> regularParse (many1 digit) "1"
Right "1"

*Main> regularParse (many1 digit) "122"
Right "122"

This can be used to quickly try out new ad hoc parsing code.

2.2. Type signatures

The real Parsec functions have quite complex type signatures. This makes a lot of things very tricky before you understand them, and can make some of the error messages you’ll see really difficult to understand. I’ve created some wrapper modules, which set the types of all the functions from Parsec we use to be much more restricted. This will make the types easy to understand, and reduce the amount of tricky to understand compiler errors you get. You can use this approach when writing your own parser code with Parsec. These wrapper modules are created with the following name pattern: Text.Parsec.CharText.Parsec.String.Char.

Later on, we will look at the general types in more detail.

2.3. Text.Parsec.Char

Let’s go through some of the functions in Text.Parsec.Char module from the Parsec package. The haddock is available here: http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html.

Here is the satisfy function, with its full type signature.

satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char

This is one of the main primitive functions in Parsec. This looks at the next character from the current input, and if the function (Char → Bool) returns true for this character, it 'pops' it from the input and returns it. In this way, the current position in the input string is tracked behind the scenes.

In the simplified type wrappers, the satisfy function’s type is this:

satisfy :: (Char -> Bool) -> Parser Char

This makes it a bit clearer what it is doing. All the functions in Text.Parsec.Char are reproduced in the local Text.Parsec.String.Char module with simplified types (<https://github.com/JakeWheat/intro_to_parsing/blob/master/Text/Parsec/String/Char.hs>).

Here are some examples of satisfy in action.

*Main> parseWithEof (satisfy (=='a')) "a"
Right 'a'

*Main> parseWithEof (satisfy (=='b')) "a"
Left (line 1, column 1):
unexpected "a"

*Main> parseWithEof (satisfy (`elem` "abc")) "a"
Right 'a'

*Main> parseWithEof (satisfy (`elem` "abc")) "d"
Left (line 1, column 1):
unexpected "d"

*Main> parseWithEof (satisfy isDigit) "d"
Left (line 1, column 1):
unexpected "d"

*Main> parseWithEof (satisfy isDigit) "1"
Right '1'

You can see that it is easy to use ==, or elem or one of the functions from the Data.Char module.

If you look at the docs on hackage http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html, you can view the source. The implementations of most of the functions in Text.Parsec.Char are straightforward. I recommend you look at the source for all of these functions.

You can see in the source that the satisfy function is a little more primitive than the other functions.

Here is the parser we used above in the anyChar parser:

anyChar :: Parser Char

If you look at the source via the haddock link above, you can see it uses satisfy.

Here are some other simple wrappers of satisfy from Text.Parsec.Char which use different validation functions.

The char parser parses a specific character which you supply:

char :: Char -> Parser Char
*Main> regularParse (char 'a') "a"
Right 'a'

*Main> regularParse (char 'a') "b"
Left (line 1, column 1):
unexpected "b"
expecting "a"

These parsers all parse single hardcoded characters

space :: Parser Char
newline :: Parser Char
tab :: Parser Char

They all return a Char. You might be able to guess what Char each of them returns, you can double check your intuition using ghci.

These parser all parse one character from a hardcoded set of characters:

upper :: Parser Char
lower :: Parser Char
alphaNum :: Parser Char
letter :: Parser Char
digit :: Parser Char
hexDigit :: Parser Char
octDigit :: Parser Char

In these cases, the return value is less redundant.

oneOf and noneOf parse any of the characters in the given list

oneOf :: [Char] -> Parser Char
noneOf :: [Char] -> Parser Char

These are just simple wrappers of satisfy using elem.

You should try all these parsers out in ghci, e.g.:

regularParse space " "

regularParse upper "A"

regularParse (char 'b') "B"

regularParse (oneOf "abc") "c"

Here are the final functions in Text.Parsec.Char:

string matches a complete string, one character at a time. I think the implementation of this function is like it is for efficiency when parsing from, e.g., Data.Text.Text, instead of String, but I’m not sure. We will skip the detailed explanation of the implementation for now.

string :: String -> Parser String
*Main> regularParse (string "one") "one"
Right "one"

*Main> regularParse (string "one") "two"
Left (line 1, column 1):
unexpected "t"
expecting "one"

Here is the spaces parser, which, if you look at the source, you can see uses a combinator (skipMany). We will cover this combinator shortly.

spaces :: Parser ()
*Main> regularParse spaces ""
Right ()

*Main> regularParse spaces " "
Right ()

*Main> regularParse spaces "   "
Right ()

*Main> regularParse spaces " a  "
Right ()

*Main> regularParse spaces "a a  "
Right ()

It always succeeds.

2.4. A couple of helper executables

Here are two exes which you can use to parse either a string or a file to help you experiment. This will save you having to figure out how to write this boilerplate until later.

Now you can easily experiment using ghci, or with a string on the command line, or by putting the text to parse into a file and parsing that.

3. Very simple expression parsing

In this tutorial we will develop a parser for a very simple expression language, and start learning about the set of combinators which comes with Parsec.

import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char (oneOf, char, digit, satisfy)
import Text.Parsec.String.Combinator (many1, choice, chainl1)
import Control.Applicative ((<|>), many)
import Control.Monad (void)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing

3.1. num

The first element we will have in this expression language is positive integral numbers:

numberExamples :: [(String,Integer)]
numberExamples = [("1", 1)
                 ,("23", 23)]

TODO: make examples with parsing failures for all of the example scripts below?

To parse a number, we need to parse one or more digits, and then read the resulting string. We can use the combinator many1 to help with this. We will also use do notation.

num :: Parser Integer
num = do
    n <- many1 digit
    return (read n)

Let’s try it out.

*Main> regularParse num "1"
Right 1

*Main> regularParse num "123456"
Right 123456

*Main> regularParse num "aa"
Left (line 1, column 1):
unexpected "a"
expecting digit

How does it work? First, we parse one or more (many1) digits (digit), and give the result the name 'n'. Then we convert the string to an integer using read.

The many1 function’s type looks like this:

many1 :: Parser a -> Parser [a]

It applies the parser given one or more times, returning the result.

Let’s see what happens when we use the many combinator which parses zero or more items instead of one or more.

num1 :: Parser Integer
num1 = do
    n <- many digit
    return (read n)
*Main> regularParse num1 "1"
Right 1

*Main> regularParse num1 "123456"
Right 123456

*Main> regularParse num1 "aa"
Right *** Exception: Prelude.read: no parse

3.2. var

For var, we have to decide on a syntax for the identifiers. Let’s go for a common choice: identifiers must start with a letter or underscore, and then they can be followed by zero or more letters, underscores or digits in any combination.

varExamples :: [(String,String)]
varExamples = [("test", "test")
              ,("_stuff", "_stuff")
              ,("_1234", "_1234")]
var :: Parser String
var = do
    fc <- firstChar
    rest <- many nonFirstChar
    return (fc:rest)
  where
    firstChar = satisfy (\a -> isLetter a || a == '_')
    nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')

This time, we create two helper parsers: firstChar, which parses a letter or underscore, and nonFirstChar which parses a digit, letter or underscore. This time, we use the many function instead of many1.

Try it out in ghci. I like to try things which you expect to work, and also to try things which you expect to not work and make sure you get an error.

3.3. parens

The parens parser will eventually parse any expression inside parentheses. First it will just parse integers inside parentheses.

data Parentheses = Parentheses Integer
                   deriving (Eq,Show)

parensExamples :: [(String, Parentheses)]
parensExamples = [("(1)", Parentheses 1)
                 ,("(17)", Parentheses 17)]
parens :: Parser Parentheses
parens = do
    void $ char '('
    e <- many1 digit
    void $ char ')'
    return (Parentheses (read e))

There is a new function: void. This might be familiar to you already. This is used to ignore the result of the char parser, since we are not interested in this value. You can also write the function without void, but ghc will give you a warning if you have warnings turned on.

One way of turning warnings on in ghci is to enter :set -Wall at the ghci prompt.

parens' :: Parser Parentheses
parens' = do
    char '('
    e <- many1 digit
    char ')'
    return (Parentheses (read e))
*Main> :set -Wall
*Main> :l "VerySimpleExpressions.lhs"

...

FirstRealParsing.lhs:140:7: Warning:
    A do-notation statement discarded a result of type Char.
    Suppress this warning by saying "_ <- char '('",
    or by using the flag -fno-warn-unused-do-bind

FirstRealParsing.lhs:142:7: Warning:
    A do-notation statement discarded a result of type Char.
    Suppress this warning by saying "_ <- char ')'",
    or by using the flag -fno-warn-unused-do-bind

...

As you can see, another way to suppress the warning is to use _ ← char '('.

One issue with this parser is that it doesn’t handle whitespace:

*Main> regularParse parens "(1)"
Right (Parentheses 1)

*Main> regularParse parens "( 1)"
Left (line 1, column 2):
unexpected " "
expecting digit

*Main> regularParse parens "(1 )"
Left (line 1, column 3):
unexpected " "
expecting digit or ")"

We will look at this issue below.

3.4. add

Now we will write a little parser to parse strings like 'a+b' where a and b are numbers.

data SingleAdd = SingleAdd Integer Integer
                 deriving (Eq,Show)

singleAddExamples :: [(String, SingleAdd)]
singleAddExamples = [("1+2", SingleAdd 1 2)
                    ,("101+202", SingleAdd 101 202)]
add :: Parser SingleAdd
add = do
    e0 <- many1 digit
    void $ char '+'
    e1 <- many1 digit
    return (SingleAdd (read e0) (read e1))

It has the same whitespace issues as the parens parser.

*Main> regularParse add "1+2"
Right (SingleAdd 1 2)

*Main> regularParse add "1 +2"
Left (line 1, column 2):
unexpected " "
expecting digit or "+"

3.5. whitespace

Here is a parser which will skip zero or more whitespace characters.

whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"

We can use this to make our parsers handle whitespace better.

*Main> regularParse whitespace " "
Right ()
*Main> regularParse whitespace "  "
Right ()
*Main> regularParse whitespace "\t"
Right ()
*Main> regularParse whitespace " \n "
Right ()
*Main> regularParse whitespace ""
Right ()

Notice that it always succeeds.

Here is the parens parser rewritten with a common approach to whitespace handling:

parensW :: Parser Parentheses
parensW = do
    whitespace
    void $ char '('
    whitespace
    e <- many1 digit
    whitespace
    void $ char ')'
    whitespace
    return (Parentheses (read e))
*Main> regularParse parensW "(1)"
Right (Parentheses 1)

*Main> regularParse parensW " (1)"
Right (Parentheses 1)

*Main> regularParse parensW " (1 )"
Right (Parentheses 1)

*Main> regularParse parensW " ( 1 ) "
Right (Parentheses 1)

Looks good.

In the original parsec documentation, one of the concepts mentioned is the idea of 'lexeme' parsing. This is a style in which every token parser should also consume and ignore any trailing whitespace.

This is a simple convention which with a bit of care allows skipping whitespace exactly once wherever it needs to be skipped. To complete the lexeme style, we should also always skip leading whitespace at the top level only. This feels more elegant than spamming all the parsing code with many calls to whitespace.

lexeme :: Parser a -> Parser a
lexeme p = do
           x <- p
           whitespace
           return x
parseWithWhitespace :: Parser a -> String -> Either ParseError a
parseWithWhitespace p = parseWithEof wrapper
  where
    wrapper = do
        whitespace
        p

Here is the parens parser rewritten to use lexeme:

parensL :: Parser Parentheses
parensL = do
    void $ lexeme $ char '('
    e <- lexeme $ many1 digit
    void $ lexeme $ char ')'
    return (Parentheses (read e))
*Main> parseWithWhitespace parensL "(1)"
Right (Parentheses 1)

*Main> parseWithWhitespace parensL " (1)"
Right (Parentheses 1)

*Main> parseWithWhitespace parensL " ( 1)"
Right (Parentheses 1)

*Main> parseWithWhitespace parensL " ( 1 ) "
Right (Parentheses 1)

The parseWithWhitespace function can also use (>>) to make it a bit shorter, wrapper = whiteSpace >> p.

Here is the shorter version of this function using (>>):

parseWithWhitespace' :: Parser a -> String -> Either ParseError a
parseWithWhitespace' p = parseWithEof (whitespace >> p)

Try rewriting the SingleAdd parser to use lexeme, and test it out to convince yourself that it skips whitespace correctly.

3.6. simple expr

Now we are ready to write a parser which parses simple expressions made from these components. Here is the data type for these expressions.

data SimpleExpr = Num Integer
                | Var String
                | Add SimpleExpr SimpleExpr
                | Parens SimpleExpr
                  deriving (Eq,Show)

It’s so simple that it is almost useless at the moment.

simpleExprExamples :: [(String,SimpleExpr)]
simpleExprExamples =
    [("a", Var "a")
    ,("1", Num 1)
    ,("2 + 3", Add (Num 2) (Num 3))
    ,("(42)", Parens (Num 42))]

TODO: some more complex examples

Here are all our component parsers with lexeme, and with the SimpleExpr constructors:

numE :: Parser SimpleExpr
numE = do
    n <- lexeme $ many1 digit
    return $ Num $ read n

There doesn’t seem to be a unique obviously correct place to put the lexeme call in the var parser:

varE :: Parser SimpleExpr
varE = lexeme $ do
    fc <- firstChar
    rest <- many nonFirstChar
    return $ Var (fc:rest)
  where
    firstChar = satisfy (\a -> isLetter a || a == '_')
    nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')

Here is an alternative, with the call to lexeme in a different place, but gives effectively the same function.

varE' :: Parser SimpleExpr
varE' = do
    fc <- firstChar
    rest <- lexeme $ many nonFirstChar
    return $ Var (fc:rest)
  where
    firstChar = satisfy (\a -> isLetter a || a == '_')
    nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')
parensE :: Parser SimpleExpr
parensE = do
    void $ lexeme $ char '('
    e <- lexeme $ many1 digit
    void $ lexeme $ char ')'
    return $ Parens $ Num $ read e

In the parens parser, we can reuse the numE parser like this:

parensE' :: Parser SimpleExpr
parensE' = do
    void $ lexeme $ char '('
    e <- numE
    void $ lexeme $ char ')'
    return $ Parens e

Here is the add parser using numE also.

addE :: Parser SimpleExpr
addE = do
    e0 <- numE
    void $ lexeme $ char '+'
    e1 <- numE
    return $ Add e0 e1

3.6.1. choice

To combine these, we can use an operator called (<|>):

numOrVar :: Parser SimpleExpr
numOrVar = numE <|> varE

It tries the first parser, and it if fails (without consuming any input), it tries the second parser. More about the 'consuming input' concept later.

Here is another way to write the numOrVar parser:

numOrVar' :: Parser SimpleExpr
numOrVar' = choice [numE,varE]

choice is just wrapper around (<|>). You can choose which one to use based on which is more readable in each particular case.

*Main> parseWithWhitespace numOrVar "a"
Right (Var "a")

*Main> parseWithWhitespace numOrVar "1"
Right (Num 1)

*Main> parseWithWhitespace numOrVar "!"
Left (line 1, column 1):
unexpected "!"
expecting digit

Here is the first version of the simpleExpr parser:

simpleExpr :: Parser SimpleExpr
simpleExpr = numE <|> varE <|> addE <|> parensE
*Main> parseWithWhitespace simpleExpr "12"
Right (Num 12)

*Main> parseWithWhitespace simpleExpr "aa"
Right (Var "aa")

*Main> parseWithWhitespace simpleExpr "1+2"
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input

*Main> parseWithWhitespace simpleExpr "(1)"
Right (Parens (Num 1))

*Main> parseWithWhitespace simpleExpr "(aa)"
Left (line 1, column 2):
unexpected "a"
expecting digit

It works well for some of the parsers. One problem is that the addE and parensE parsers don’t parse general expressions as the components, but just numE. Another problem is that the addE doesn’t work at all: the numE parser parses the first number, and the addE parser is never tried. This is an example of (<|>) not trying the second parser if the first parser succeeds, even if a later alternative would consume more input or successfully parse the whole input.

Let’s try and rearrange the order:

simpleExpr1 :: Parser SimpleExpr
simpleExpr1 = addE <|> numE <|> varE <|> parensE
*Main> parseWithWhitespace simpleExpr1 "12"
Left (line 1, column 3):
unexpected end of input
expecting digit or "+"

*Main> parseWithWhitespace simpleExpr1 "aa"
Right (Var "aa")

*Main> parseWithWhitespace simpleExpr1 "1+2"
Right (Add (Num 1) (Num 2))

*Main> parseWithWhitespace simpleExpr1 "(1)"
Right (Parens (Num 1))

We swapped one problem for another. Let’s fix this using the try function.

simpleExpr2 :: Parser SimpleExpr
simpleExpr2 = try addE <|> numE <|> varE <|> parensE
*Main> parseWithWhitespace simpleExpr2 "12"
Right (Num 12)

*Main> parseWithWhitespace simpleExpr2 "aa"
Right (Var "aa")

*Main> parseWithWhitespace simpleExpr2 "1+2"
Right (Add (Num 1) (Num 2))

*Main> parseWithWhitespace simpleExpr2 "(1)"
Right (Parens (Num 1))

Now everything seems to work fine. The try function is very powerful and easy to use, and can be used where in a more traditional parsing approach you would have to use left factoring or something else.

The try function implements backtracking. When this is used with (<|>), it means that if the first parser fails, it will undo the consumed input and carry on with the next option, instead of failing completely. This works even if the try is nested deeply within the first parser given to (<|>).

try has its downsides (some of which we will see later), and I usually try to minimise its use or eliminate it completely. I found I often got into a complete mess when I used try too much when writing parsers for something a little tricky like SQL, and that although doing some left-factoring appeared at first to be tedious and appeared to make the code less readable, I eventually decided that for me it made the code more readable since what was happening was more transparent.

Now we are going to fix this parser to parse arbitrarily nested expressions. In a way, the method used will roughly mean we are left factoring the numE and addE common prefix.

Here is the naive implementation:

parensE3 :: Parser SimpleExpr
parensE3 = do
    void $ lexeme $ char '('
    e <- simpleExpr3
    void $ lexeme $ char ')'
    return $ Parens e
addE3 :: Parser SimpleExpr
addE3 = do
    e0 <- simpleExpr3
    void $ lexeme $ char '+'
    e1 <- simpleExpr3
    return $ Add e0 e1
simpleExpr3 :: Parser SimpleExpr
simpleExpr3 = try addE3 <|> numE <|> varE <|> parensE3

If you run this parser, it will enter an infinite loop, since simpleExpr3 and addE3 will keep calling each other recursively without making any progress.

*Main> parseWithWhitespace simpleExpr3 "a+b"
  C-c Interrupted.

Let’s try without add.

parensE4 :: Parser SimpleExpr
parensE4 = do
    void $ lexeme $ char '('
    e <- simpleExpr4
    void $ lexeme $ char ')'
    return $ Parens e
simpleExpr4 :: Parser SimpleExpr
simpleExpr4 = numE <|> varE <|> parensE4
*Main> parseWithWhitespace simpleExpr4 "a"
Right (Var "a")

*Main> parseWithWhitespace simpleExpr4 "1"
Right (Num 1)

*Main> parseWithWhitespace simpleExpr4 "(1)"
Right (Parens (Num 1))

*Main> parseWithWhitespace simpleExpr4 "((a))"
Right (Parens (Parens (Var "a")))

At least this part seems to work OK.

Let’s try to stop the add parser from calling itself indirectly:

Here is a parameterized parens parser where we supply the nested expression parser as an argument. This is used here to try to make the code easier to follow and avoid rewriting this parser out again and again in full.

parensEN :: Parser SimpleExpr -> Parser SimpleExpr
parensEN simpleExprImpl = do
    void $ lexeme $ char '('
    e <- simpleExprImpl
    void $ lexeme $ char ')'
    return $ Parens e

Here is a new parser, which parses expressions except add.

term :: Parser SimpleExpr -> Parser SimpleExpr
term simpleExprImpl = numE <|> varE <|> parensEN simpleExprImpl
term5 :: Parser SimpleExpr
term5 = term term5
addE5 :: Parser SimpleExpr
addE5 = do
    e0 <- term5
    void $ lexeme $ char '+'
    e1 <- term5
    return $ Add e0 e1
simpleExpr5 :: Parser SimpleExpr
simpleExpr5 = try addE5 <|> term5
*Main> parseWithWhitespace simpleExpr5 "1"
Right (Num 1)

*Main> parseWithWhitespace simpleExpr5 "a"
Right (Var "a")

*Main> parseWithWhitespace simpleExpr5 "(a)"
Right (Parens (Var "a"))

*Main> parseWithWhitespace simpleExpr5 "1+2"
Right (Add (Num 1) (Num 2))

*Main> parseWithWhitespace simpleExpr5 "1+a"
Right (Add (Num 1) (Var "a"))

*Main> parseWithWhitespace simpleExpr5 "(1+a)"
Right (Parens (Add (Num 1) (Var "a")))

*Main> parseWithWhitespace simpleExpr5 "1+a+b"
Left (line 1, column 4):
unexpected '+'
expecting end of input

Almost. Let’s see what happens when the second term in add is changed to the general expression parser.

term6 :: Parser SimpleExpr
term6 = term simpleExpr6
addE6 :: Parser SimpleExpr
addE6 = do
    e0 <- term6
    void $ lexeme $ char '+'
    e1 <- simpleExpr6
    return $ Add e0 e1
simpleExpr6 :: Parser SimpleExpr
simpleExpr6 = try addE6 <|> term6
*Main> parseWithWhitespace simpleExpr6 "a + b + c"
Right (Add (Var "a") (Add (Var "b") (Var "c")))

Maybe it looks like we’ve made it. But there is a problem. We’ve parsed the + operator as if it has right associativity:

a + b + c -> a + (b + c)

But it should be left associative:

a + b + c -> (a + b) + c

Let’s left factor the parsing and fix this:

term7 :: Parser SimpleExpr
term7 = term simpleExpr7
simpleExpr7 :: Parser SimpleExpr
simpleExpr7 = do
    -- first parse a term
    e <- term7
    -- then see if it is followed by an '+ expr' suffix
    maybeAddSuffix e
  where
    -- this function takes an expression, and parses a
    -- '+ expr' suffix, returning an Add expression
    -- it recursively calls itself via the maybeAddSuffix function
    addSuffix e0 = do
        void $ lexeme $ char '+'
        e1 <- term7
        maybeAddSuffix (Add e0 e1)
    -- this is the wrapper for addSuffix, which adapts it so that if
    -- addSuffix fails, it returns just the original expression
    maybeAddSuffix e = addSuffix e <|> return e
*Main> parseWithWhitespace simpleExpr7 "a + b + c"
Right (Add (Add (Var "a") (Var "b")) (Var "c"))

Now the parser seems to work for everything it should.

There is a combinator function in Parsec we can use which abstracts this sort of pattern, chainl1.

simpleExpr8 :: Parser SimpleExpr
simpleExpr8 = chainl1 term8 op
  where
    op = do
        void $ lexeme $ char '+'
        return Add
    term8 = term simpleExpr8

How does this work? Here is the type of chainl1:

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 term op = ...

The type of the Add constructor in pseudocode is:

Add :: SimpleExpr -> SimpleExpr -> SimpleExpr

The op parser here now just parses the operator itself, i.e. '+' (and not the second expression like our simpleExpr7 parser). The return from the op function is a function which accepts two elements and combines them using the appropriate operator representation. In this case, the represenation is the Add constructor.

You can look at the source http://hackage.haskell.org/package/parsec-3.1.3/docs/src/Text-Parsec-Combinator.html#chainl1 and see if you can understand how it works. If you can’t work it out, you could come back to it later when you have more experience writing parsing code.

3.7. Testing with the examples

TODO: write a little manual tester that accepts a parser and a list of examples, and checks they all parse correctly.

3.8. Testing with quickcheck

Let’s see if we can check with quickcheck. It’s a bit tricky testing parsers in this way, but one way to do something useful is to generate random asts, convert them to concrete syntax, parse them, and check the result. We can write a simple 'pretty printer' to convert an ast to concrete syntax.

3.8.1. a pretty printer

TODO: a really simple pretty printer just pasting strings together, no layout.

3.8.2. the quick check code

TODO: write a quickcheck property and arbitary instance and show running it at the ghci prompt

4. Applicative style parsing code

Now we can go back over the expression parsing code written in the last tutorial, and make it much more concise, and also make it more readable. We are going to do this mainly by using functions from the typeclass Applicative.

Remember you can (and should) use the functions regularParse and its variations (TODO list them here) to try out the all these parsers in ghci, and you can write your own variations to experiment with if you are unsure about anything.

import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (oneOf, char, digit, letter, satisfy)
import Text.Parsec.String.Combinator (many1, chainl1)
import Control.Applicative ((<$>), (<*>), (<*), (*>), (<|>), many, (<$))
import Control.Monad (void, ap)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing

Here is the SimpleExpr type again:

data SimpleExpr = Num Integer
                | Var String
                | Add SimpleExpr SimpleExpr
                | Parens SimpleExpr
                  deriving (Eq,Show)

Here is the basic pattern behind most of the rewrites we are going to cover. Here is a function which takes a constructor and two parsers for the two arguments for the constructor. It parses the two arguments, then applies the constructor to the results:

myParser1 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser1 ctor pa pb = do
    a <- pa
    b <- pb
    return $ ctor a b

TODO: concrete example, plus examples at the bottom of this section (for ctor <$> a, ctor <$> a <*> b, ctor <$> a <*> b <*> c).

This can be rewritten without the do syntactic sugar like this:

myParser2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser2 ctor pa pb =
    pa >>= \a -> pb >>= \b -> return $ ctor a b

And can also be rewritten like this:

myParser3 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser3 ctor pa pb = ctor `fmap` pa `ap` pb

(This uses functions from Applicative instead of Monad.) We replace the use of >>= with fmap and ap. This isn’t always possible, but it often is.

Here is the version using the operators for fmap and ap (fmap changed to <$>, and ap changed to <*>).

myParser4 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser4 ctor pa pb = ctor <$> pa <*> pb

This style takes less typing, and is often much simpler to write and read.

This pattern 'scales', you can use:

Ctor <$> pa

for a single argument constructor. This might also be familiar to you as

fmap Ctor pa

or

Ctor `fmap` pa

All of which mean the same thing, just slightly different spellings.

This can also be written using Monad operators:

pa >>= liftM ctor

or

liftM ctor =<< pa

(liftM is in the module Control.Monad)

These liftM versions effectively mean the same thing as the previous versions with fmap and <$>.

You can use

Ctor <$> pa <*> pb <*> pc

for three args, and so on. So you use <$> between the pure constructor and the first arg, then <*> between each subsequent arg.

Let’s go over the simple expression parsers and try to rewrite them using this style. We will see a few other new functions. I will break things down into a lot of steps.

4.1. lexeme

Here is the old lexeme parser, 'D' suffix for 'do notation'.

lexemeD :: Parser a -> Parser a
lexemeD p = do
           x <- p
           whitespace
           return x
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"

First, we can move the whitespace from its own separate line.

lexemeA0 :: Parser a -> Parser a
lexemeA0 p = do
           x <- p <* whitespace
           return x

The expression pa <* pb means run pa, then run pb, and return the result of pa. It is sort of equivalent to this code:

(<*) :: Parser a -> Parser b -> Parser a
(<*) pa pb = do
    a <- pa
    void pb
    return a

(It isn’t implemented this way, since (<*) only needs Applicative and not Monad.)

Now we can use the usual monad syntax rewrites, first eliminate the name x.

lexemeA1 :: Parser a -> Parser a
lexemeA1 p = do
             p <* whitespace

Now remove the redundant do:

lexemeA :: Parser a -> Parser a
lexemeA p = p <* whitespace

4.2. num

Now let’s tackle the num parser.

numD :: Parser SimpleExpr
numD = do
    n <- lexemeD $ many1 digit
    return $ Num $ read n

Let’s move 'read' to the first line.

numA0 :: Parser SimpleExpr
numA0 = do
    n <- read <$> lexemeA (many1 digit)
    return $ Num n

This uses (<$>) which we saw above. You may have done code rewrites like this using fmap with IO in other Haskell code.

Now let’s move the Num ctor as well:

numA1 :: Parser SimpleExpr
numA1 = do
    n <- (Num . read) <$> lexemeA (many1 digit)
    return n

You can also write it in this way:

numA2 :: Parser SimpleExpr
numA2 = do
    n <- Num <$> read <$> lexemeA (many1 digit)
    return n

Why does this work? It it equivalent to the previous version partly because of the applicative laws.

Let’s break it down:

numA2'' :: Parser SimpleExpr
numA2'' = do
    n <- numb
    return n
  where
    numb :: Parser SimpleExpr
    numb = Num <$> int
    int :: Parser Integer
    int = read <$> lexemeA (many1 digit)

In terms of style, which do you think looks better: (a . b) <$> p or a <$> b <$> p.

The next step for num, we can eliminate the temporary name n and the do:

numA3 :: Parser SimpleExpr
numA3 = (Num . read) <$> lexemeA (many1 digit)

In more 'industrial' parser code, I would usually write some tokenization parsers separately like this:

integerA4 :: Parser Integer
integerA4 = read <$> lexemeA (many1 digit)

Then the num expression parser looks like this:

numA4 :: Parser SimpleExpr
numA4 = Num <$> integerA4

and we also get a integer parser which we can reuse if we need to parse an integer in another context.

4.3. var

Here is the previous var parser:

varD :: Parser SimpleExpr
varD = lexemeA $ do
    fc <- firstChar
    rest <- many nonFirstChar
    return $ Var (fc:rest)
  where
    firstChar = satisfy (\a -> isLetter a || a == '_')
    nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')

The first thing we can do is to make the firstChar and nonFirstChar a little easier to read, using (<|>), char, letter and digit:

varA0 :: Parser SimpleExpr
varA0 = lexemeA $ do
    fl <- firstChar
    rest <- many nonFirstChar
    return $ Var (fl:rest)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar

Here is another way of making the function a little better:

varA0' :: Parser SimpleExpr
varA0' = lexemeA $ do
    fl <- satisfy validFirstChar
    rest <- many (satisfy validNonFirstChar)
    return $ Var (fl:rest)
  where
    validFirstChar a = isLetter a || a == '_'
    validNonFirstChar a = validFirstChar a || isDigit a

We can lift the (:) using the Applicative operators.

varA1 :: Parser SimpleExpr
varA1 = do
    i <- iden
    return $ Var i
  where
    iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar

We used the prefix version of (:) to use it with (<$>) and (<*>). The lexemeA call was moved to the iden helper function.

Now tidy it up using (<$>) with the Var constructor:

varA2 :: Parser SimpleExpr
varA2 = Var <$> iden
  where
    iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar

We could also split the iden into a separate top level function, with the same idea as with splitting the integer parser.

4.4. parens

Here is the starting point:

parensD :: Parser SimpleExpr
parensD = do
    void $ lexemeA $ char '('
    e <- simpleExprD
    void $ lexemeA $ char ')'
    return $ Parens e

Here is the rewrite in one step:

parensA0 :: Parser SimpleExpr
parensA0 =
    Parens <$> (lexemeA (char '(')
                *> simpleExprD
                <* lexemeA (char ')'))

Here you can see that there is a (*>) which works in the opposite direction to (<*). The precendence of these operators means that we have to use some extra parentheses (!) here.

TODO: lost the chained <*. Put something below about this so there is a concrete example.

4.5. simple expr

Here is the old version:

termD :: Parser SimpleExpr
termD = numD <|> varD <|> parensD
simpleExprD :: Parser SimpleExpr
simpleExprD = chainl1 termD op
  where
    op = do
         void $ lexemeA $ char '+'
         return Add

We can simplify the op function using the techniques we’ve already seen:

simpleExprA0 :: Parser SimpleExpr
simpleExprA0 = chainl1 termD op
  where op = lexemeA (char '+') *> return Add

The pattern p *> return f can use a different operator like this: f <$ p. Here it is in the expression parser:

simpleExprA1 :: Parser SimpleExpr
simpleExprA1 = chainl1 termD op
  where op = Add <$ lexemeA (char '+')

You could also write the op parser inline:

simpleExprA2 :: Parser SimpleExpr
simpleExprA2 = chainl1 termD (Add <$ lexemeA (char '+'))

Maybe this last step makes it less readable?

4.6. summary

Here is the finished job for all the simple expression code without separate token parsers:

num' :: Parser SimpleExpr
num' = (Num . read) <$> lexemeA (many1 digit)
var' :: Parser SimpleExpr
var' = Var <$> iden
  where
    iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
parens' :: Parser SimpleExpr
parens' =
    Parens <$> (lexemeA (char '(')
                *> simpleExpr'
                <* lexemeA (char ')'))
term' :: Parser SimpleExpr
term' = num' <|> var' <|> parens'
simpleExpr' :: Parser SimpleExpr
simpleExpr' = chainl1 term' op
  where op = Add <$ lexemeA (char '+')

Here they are with separate token parsers and a helper function:

lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)

Here is a lexeme wrapper for parsing single character symbols.

symbol :: Char -> Parser ()
symbol c = void $ lexeme $ char c

Here is another little helper function. It barely pays its way in this short example, but even though it is only used once, I think it is worth it to make the code clearer.

betweenParens :: Parser a -> Parser a
betweenParens p = symbol '(' *> p <* symbol ')'

Now the expression parsers:

num :: Parser SimpleExpr
num = Num <$> integer
var :: Parser SimpleExpr
var = Var <$> identifier
parens :: Parser SimpleExpr
parens = Parens <$> betweenParens simpleExpr
term :: Parser SimpleExpr
term = num <|> var <|> parens
simpleExpr :: Parser SimpleExpr
simpleExpr = chainl1 term' op
  where op = Add <$ lexemeA (char '+')

Splitting the lexer parser layer out means that we have one place where we have to remember to add lexeme wrappers, and also I think makes the code easier to follow.

5. Combinator review

In this tutorial we will go through all the functions in Text.Parsec.Combinator, and some useful ones in Control.Applicative and Control.Monad as well.

import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char (oneOf, char, digit
                               ,string, letter, satisfy)
import Text.Parsec.String.Combinator (many1, choice, chainl1, between
                                     ,count, option, optionMaybe, optional)
import Control.Applicative ((<$>), (<*>), (<$), (<*), (*>), (<|>), many)
import Control.Monad (void, ap, mzero)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing

5.1. Text.Parsec.Combinator

You should look at the source for these functions and try to understand how they are implemented.

The style of the source code in the Parsec library sources is a little different to what we used at the end of the last tutorial. You can try reimplementing each of the Text.Parsec.Combinator module functions using the Applicative style. See if you can find a way to reassure yourself that the rewritten versions you make are correct, perhaps via writing automated tests, or perhaps some other method.

You should be able to easily understand the implementation of all the functions in Text.Parsec.Combinator except possibly anyToken and notFollowedBy.

5.1.1. choice

choice :: [Parser a] -> Parser a

choice ps tries to apply the parsers in the list ps in order, until one of them succeeds. It returns the value of the succeeding parser.

a :: Parser Char
a = char 'a'

b :: Parser Char
b = char 'b'

aOrB :: Parser Char
aOrB = choice [a,b]
*Main> regularParse aOrB "a"
Right 'a'

*Main> regularParse aOrB "b"
Right 'b'

*Main> regularParse aOrB "c"
Left (line 1, column 1):
unexpected "c"
expecting "a" or "b"
using with try

If a parser fails with (<|>) or choice, then it will only try the next parser if the last parser consumed no input.

TODO: make the parsers return the keyword and update the examples

byKeyword :: Parser ()
byKeyword = void $ string "by"

betweenKeyword :: Parser ()
betweenKeyword = void $ string "between"

Since both of these have the same prefix - b - if we combine them using choice then it doesn’t work correctly:

*Main> regularParse byKeyword "by"
Right ()

*Main> regularParse byKeyword "between"
Left (line 1, column 1):
unexpected "e"
expecting "by"

*Main> regularParse betweenKeyword "between"
Right ()

*Main> regularParse betweenKeyword "by"
Left (line 1, column 1):
unexpected "y"
expecting "between"

*Main> regularParse (choice [betweenKeyword,byKeyword]) "between"
Right ()

*Main> regularParse (choice [betweenKeyword,byKeyword]) "by"
Left (line 1, column 1):
unexpected "y"
expecting "between"

*Main> regularParse (choice [byKeyword,betweenKeyword]) "between"
Left (line 1, column 1):
unexpected "e"
expecting "by"

*Main> regularParse (choice [byKeyword,betweenKeyword]) "by"
Right ()

If we use try on the first option, then it all works fine.

*Main> regularParse (choice [try byKeyword,betweenKeyword]) "by"
Right ()

*Main> regularParse (choice [try byKeyword,betweenKeyword]) "between"
Right ()

5.1.2. count

count :: Int -> Parser a -> Parser [a]

count n p parses n occurrences of p. If n is smaller or equal to zero, the parser is equivalent to return []. It returns a list of the n values returned by p.

*Main> regularParse (count 5 a) "aaaaa"
Right "aaaaa"

*Main> regularParse (count 5 a) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a"

*Main> regularParse (count 5 a) "aaaab"
Left (line 1, column 5):
unexpected "b"
expecting "a"

*Main> regularParse (count 5 aOrB) "aabaa"
Right "aabaa"

5.1.3. between

between :: Parser open -> Parser close -> Parser a -> Parser a

between open close p parses open, followed by p and close. It returns the value returned by p.

We can replace the betweenParens from the previous tutorial using this:

betweenParens :: Parser a -> Parser a
betweenParens p = between (symbol '(') (symbol ')') p

It hardly seems worth it to make this change, but it might be slightly quicker to read and understand if you aren’t already familiar with some code or haven’t viewed it for a while. This is good for 'code maintenance', where we need to fix bugs or add new features quickly to code we haven’t looked at for two years or something.

Here are the support functions for this parser.

symbol :: Char -> Parser Char
symbol c = lexeme $ char c
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
whitespace :: Parser ()
whitespace = void $ oneOf " \n\t"

5.1.4. option

option :: a -> Parser a -> Parser a

option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.

*Main> regularParse (option "" (count 5 aOrB)) "aaaaa"
Right "aaaaa"

*Main> regularParse (option "" (count 5 aOrB)) "caaaa"
Right ""

*Main> regularParse (option "" (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"

*Main> regularParse (option "" (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"

*Main> regularParse (option "" (try (count 5 aOrB))) "aaaa"
Right ""

5.1.5. optionMaybe

optionMaybe :: Parser a -> Parser (Maybe a)

optionMaybe p tries to apply parser p. If p fails without consuming input, it returns Nothing, otherwise it returns Just the value returned by p.

*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaaa"
Right (Just "aaaaa")

*Main> regularParse (optionMaybe (count 5 aOrB)) "caaaa"
Right Nothing

*Main> regularParse (optionMaybe (count 5 aOrB)) "caaa"
Right Nothing

*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"

*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"

*Main> regularParse (optionMaybe (try $ count 5 aOrB)) "aaaac"
Right Nothing

5.1.6. optional

optional :: Parser a -> Parser ()

optional p tries to apply parser p. It will parse p or nothing. It only fails if p fails after consuming input. It discards the result of p.

*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaaa"
Right ((),"")

*Main> parseWithLeftOver (optional (count 5 aOrB)) "caaaa"
Right ((),"caaaa")

*Main> parseWithLeftOver (optional (count 5 aOrB)) "caaa"
Right ((),"caaa")

*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"

*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"

*Main> parseWithLeftOver (optional (try $ count 5 aOrB)) "aaaac"
Right ((),"aaaac")

5.1.7. skipMany1

skipMany1 :: Parser a -> Parser ()

skipMany1 p applies the parser p one or more times, skipping its result.

5.1.8. many1

many1 :: Parser a -> Parser [a]

many1 p applies the parser p one or more times. Returns a list of the returned values of p.

  word  = many1 letter

5.1.9. sepBy

sepBy :: Parser a -> Parser sep -> Parser [a]

sepBy p sep parses zero or more occurrences of p, separated by sep. Returns a list of values returned by p.

  commaSep p  = p `sepBy` (symbol ",")

5.1.10. sepBy1

sepBy1 :: Parser a -> Parser sep -> Parser [a]

sepBy1 p sep parses one or more occurrences of p, separated by sep. Returns a list of values returned by p.

5.1.11. endBy

endBy :: Parser a -> Parser sep -> Parser [a]

endBy p sep parses zero or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p.

   cStatements  = cStatement `endBy` semi

5.1.12. endBy1

endBy1 :: Parser a -> Parser sep -> Parser [a]

endBy1 p sep parses one or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p.

5.1.13. sepEndBy

sepEndBy :: Parser a -> Parser sep -> Parser [a]

sepEndBy p sep parses zero or more occurrences of p, separated and optionally ended by sep, ie. haskell style statements. Returns a list of values returned by p.

  haskellStatements  = haskellStatement `sepEndBy` semi

5.1.14. sepEndBy1

sepEndBy1 :: Parser a -> Parser sep -> Parser [a]

sepEndBy1 p sep parses one or more occurrences of p, separated and optionally ended by sep. Returns a list of values returned by p.

5.1.15. chainl

chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a

chainl p op x parser zero or more occurrences of p, separated by op. Returns a value obtained by a left associative application of all functions returned by op to the values returned by p. If there are zero occurrences of p, the value x is returned.

5.1.16. chainl1

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a

chainl1 p op x parser one or more occurrences of p, separated by op Returns a value obtained by a left associative application of all functions returned by op to the values returned by p. . This parser can for example be used to eliminate left recursion which typically occurs in expression grammars.

5.1.17. chainr

chainr :: Parser a -> Parser (a -> a -> a) -> a -> Parser a

chainr p op x parser zero or more occurrences of p, separated by op Returns a value obtained by a right associative application of all functions returned by op to the values returned by p. If there are no occurrences of p, the value x is returned.

5.1.18. chainr1

chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a

chainr1 p op x parser one or more occurrences of p, separated by op Returns a value obtained by a right associative application of all functions returned by op to the values returned by p.

5.1.19. eof

eof :: Parser ()

This parser only succeeds at the end of the input. This is not a primitive parser but it is defined using notFollowedBy.

  eof  = notFollowedBy anyToken <?> "end of input"

The (<?>) operator is used for error messages. We will come back to error messages after writing the basic SQL parser.

5.1.20. notFollowedBy

notFollowedBy :: Show a => Parser a -> Parser ()

notFollowedBy p only succeeds when parser p fails. This parser does not consume any input. This parser can be used to implement the 'longest match' rule. For example, when recognizing keywords (for example let), we want to make sure that a keyword is not followed by a legal identifier character, in which case the keyword is actually an identifier (for example lets). We can program this behaviour as follows:

  keywordLet  = try (do{ string "let"
                       ; notFollowedBy alphaNum
                       })

5.1.21. manyTill

manyTill :: Parser a -> Parser end -> Parser [a]

manyTill p end applies parser p zero or more times until parser end succeeds. Returns the list of values returned by p. This parser can be used to scan comments:

  simpleComment   = do{ string "<!--"
                      ; manyTill anyChar (try (string "-->"))
                      }

Note the overlapping parsers anyChar and string "-->", and therefore the use of the try combinator.

5.1.22. lookAhead

lookAhead :: Parser a -> Parser a

lookAhead p parses p without consuming any input.

If p fails and consumes some input, so does lookAhead. Combine with try if this is undesirable.

5.1.23. anyToken

anyToken :: Parser Char

The parser anyToken accepts any kind of token. It is for example used to implement eof. Returns the accepted token.

5.2. Control.Applicative

Here are the functions from Applicative that are used:

(<$>), (<*>), (<$), (<*), (*>), (<|>), many

TODO: examples for all of these

We’ve already seen all of these, except (<$). This is often used to parse a keyword and return a no argument constructor:

data Something = Type1 | Type2 | Type3
something :: Parser Something
something = choice [Type1 <$ string "type1"
                   ,Type2 <$ string "type2"
                   ,Type3 <$ string "type3"]

There is also (<**>) which is (<*>) with the arguments flipped.

TODO: double check using these from Parsec instead of Control.Applicative: possible performance implictions?

5.3. Control.Monad

5.3.1. return

One use of return is to always succeed, and return a value:

alwaysX :: Parser Char
alwaysX = return 'x'
*Main> parseWithLeftOver (a <|> alwaysX) "a"
Right ('a',"")

*Main> parseWithLeftOver (a <|> alwaysX) "b"
Right ('x',"b")

5.3.2. mzero

This function is used in the implementation of choice:

choice' :: [Parser a] -> Parser a
choice' ps = foldr (<|>) mzero ps

TODO: go through a bunch of functions + do notation examples

 >>=
 =<<
 >>
 void
 mapM, mapM_
 sequence,sequence_
 guard
 return
mzero
mplus
when, unless
liftMN
ap
quick note about fail, will return to this in the error messages stage

TODO: using trace

6. Functions and types for parsing

In this file is the source and explanation for the parsing functions which we’ve been using, and some limited notes about the wrappers and full types in Parsec.

module FunctionsAndTypesForParsing where

import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (parse)
import Text.Parsec.String.Char (oneOf)
import Text.Parsec.String.Combinator (eof,manyTill,anyToken)
import Control.Applicative ((<$>), (<*>), (<*), (*>), many)
import Control.Monad (void)

6.1. Functions for parsing

Here are the testing functions which were used earlier:

The basic parse function: this is a pretty simple wrapper. The parse function from parsec just adds a filename to use in parse errors, which is set as the empty string here.

regularParse :: Parser a -> String -> Either ParseError a
regularParse p = parse p ""

'parse' is a basic function in the family of functions for running parsers in Parsec. You can compose the parser functions in the Parser monad, then run the top level function using 'parse' and get back an 'Either ParserError a' as the result. There are a few alternatives to 'parse' in Parsec, mostly when you are using a more general parser type instead of 'Parser a' (which is an alias for 'ParsecT String () Identity a'). Have a look in the Text.Parsec.Prim module for these http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Prim.html.

This function will run the parser, but additionally fail if it doesn’t consume all the input.

parseWithEof :: Parser a -> String -> Either ParseError a
parseWithEof p = parse (p <* eof) ""

This function will apply the parser, then also return any left over input which wasn’t parsed.

parseWithLeftOver :: Parser a -> String -> Either ParseError (a,String)
parseWithLeftOver p = parse ((,) <$> p <*> leftOver) ""
  where leftOver = manyTill anyToken eof

TODO: what happens when you use 'many anyToken <* eof' variations instead? Maybe should talk about greediness? Or talk about it in a better place in the tutorial.

parseWithWSEof :: Parser a -> String -> Either ParseError a
parseWithWSEof p = parseWithEof (whiteSpace *> p)
  where whiteSpace = void $ many $ oneOf " \n\t"

You should have a look at the two helper executables, and see if you can understand the code now. You can see them online here:

6.2. type signatures revisited

todo: update this to refer to real parsec instead of the string wrappers here.

I think you should always use type signatures with Parsec. Because the Parsec code is really generalized, without the type GHC will refuse to compile this code. Try commenting out the type signature above and loading into ghci to see the error message.

There is an alternative: you can get this code to compile without a type signature by using the NoMonomorphismRestriction language pragma. You can also see the type signature that GHC will choose for this function by commenting the type signature and using -Wall and -XNoMonomorphismRestriction together. Using NoMonomorphismRestriction is a popular solution to these sorts of problems in haskell.

It’s up to you whether you prefer to always write type signatures when you are developing parsing code, or use the NoMonomorphismRestriction pragma. Even if you can use NoMonomorphismRestriction, when using explicit type signatures you usually get much simpler compiler error messages.

6.3. Parser

The definition of Parser and a partial explanation of the full type signature.

type Parser = Parsec String ()

This means that a function returning Parser a parses from a String with () as the initial state.

The Parsec type is defined like this:

type Parsec s u = ParsecT s u Identity

ParsecT is a monad transformer, I think it is the primitive one in the Parsec library, and the 'Parsec' type is a type alias which sets the base monad to be Identity.

Here is the haddock for the ParsecT type:

ParsecT s u m a is a parser with stream type s, user state type u, underlying monad m and return type a.

The full types that you see like this:

satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char

refer to the same things (stream type s, user state type u, underlying monad m).

We are using String as the stream type (i.e. the input type), () as the user state type (this effectively means no user state, since () only has one value), and the underlying monad is Identity: we are using no other underlying monad, so Parser a expands to ParsecT String () Identity a.

I.e. the source is String, the user state is (), and the underlying monad is Identity.

6.4. Other information

TODO: Here is some other information on Parsec and Haskell: links, tutorials on fp, section in rwh, lyah?, old parsec docs, parsec docs on hackage, other parser combinator libs (uu, polyparse, trifecta?)

7. Parsing expressions with fixity

Text.Parsec.Expr allows building expression parsers with a range of operators with different precedences and associativities easily. Fixity is the (not completely standard) term for precendence and associativity together.

Text.Parsec.Expr can be great to quickly get a parser for simple expressions or a simple programming language with simple expressions up and running.

import Text.Parsec.String (Parser)
import Text.Parsec.String.Combinator (many1, between)
import Text.Parsec.String.Char (letter, char, digit, string, oneOf)

import Control.Applicative ((<$>), (<*>), (<*), (<|>), many, (<$))
import Control.Monad (void)

import qualified Text.Parsec.String.Expr as E
import FunctionsAndTypesForParsing

Let’s extend the SimpleExpression type and parsers to cover a range of operators with different precedences and associativity.

7.1. expressions with plus and times

Let’s start with a simple case: + and * with the usual fixity. Here is the abstract syntax:

data PlusTimesExpr = PteVar String
                   | PteNum Integer
                   | PteParens PlusTimesExpr
                   | Plus PlusTimesExpr PlusTimesExpr
                   | Times PlusTimesExpr PlusTimesExpr
                         deriving (Eq,Show)
plusTimesExpr :: Parser PlusTimesExpr
plusTimesExpr = E.buildExpressionParser pteTable pteTerm
pteTable :: [[E.Operator PlusTimesExpr]]
pteTable = [[E.Infix (Times <$ symbol "*") E.AssocLeft]
           ,[E.Infix (Plus <$ symbol "+") E.AssocLeft]
           ]

Here you can see the operator parsers are the same as the previous SimpleExpr parser which used chainl1: Times <$ symbol "*" and Plus <$ symbol "+". We just wrapped these up in the E.Infix constructor with the associativity, and put them in a list of lists which represents the precendence classes.

Here is the term parser and components. All this is just the same as the SimpleExpr parser a previous tutorial.

pteTerm :: Parser PlusTimesExpr
pteTerm = pteVar <|> pteNum <|> pteParens
pteNum :: Parser PlusTimesExpr
pteNum = PteNum <$> integer
pteVar :: Parser PlusTimesExpr
pteVar = PteVar <$> identifier
pteParens :: Parser PlusTimesExpr
pteParens = PteParens <$> between (symbol "(") (symbol ")") plusTimesExpr

support functions:

whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = lexeme $ string s

Here you can see the precendence in action:

*Main> regularParse plusTimesExpr "a + b * c"
Right (Plus (PteVar "a") (Times (PteVar "b") (PteVar "c")))

*Main> regularParse plusTimesExpr "a * b + c"
Right (Plus (Times (PteVar "a") (PteVar "b")) (PteVar "c"))

Now let’s try a much bigger example with lots more operators. Now we are thinking ahead to the first version of the SQL query parser, and preparing for this.

Here are our new operators in precedence order:

7.2.1. unary + -

+a
-3

7.2.2. exponentiation

a ^ 3

associativity: left

7.2.3. multiplication, division, modulo

a * 3
3 / b
a % 5

associativity: left

7.2.4. addition, subtraction

a + b
a - b

associativity: left

7.2.5. less than, greater than

a < b
a > b

associativity: none

7.2.6. equals

a = 3

associativity: right

7.2.7. not

not a

7.2.8. and

a and b

associativity: left

7.2.9. or

a or b

associativity: left

Here is the abstract syntax type:

data SimpleExpr = Num Integer
                | Var String
                | Parens SimpleExpr
                | PrefixOp String SimpleExpr
                | BinaryOp SimpleExpr String SimpleExpr
                   deriving (Eq,Show)

Here is the new expression parser:

simpleExpr :: Parser SimpleExpr
simpleExpr = E.buildExpressionParser table term
table :: [[E.Operator SimpleExpr]]
table = [[prefix "-", prefix "+"]
        ,[binary "^" E.AssocLeft]
        ,[binary "*" E.AssocLeft
         ,binary "/" E.AssocLeft
         ,binary "%" E.AssocLeft]
        ,[binary "+" E.AssocLeft
         ,binary "-" E.AssocLeft]
        ,[binary "<" E.AssocNone
         ,binary ">" E.AssocNone]
        ,[binary "=" E.AssocRight]
        ,[prefix "not"]
        ,[binary "and" E.AssocLeft]
        ,[binary "or" E.AssocLeft]
        ]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ symbol name) assoc
    mkBinOp nm a b = BinaryOp a nm b
    prefix name = E.Prefix (PrefixOp name <$ symbol name)

TODO: expand and explain the bits.

Here is the term parser.

term :: Parser SimpleExpr
term = var <|> num <|> parens
num :: Parser SimpleExpr
num = Num <$> integer
var :: Parser SimpleExpr
var = Var <$> identifier
parens :: Parser SimpleExpr
parens = between (symbol "(") (symbol ")") simpleExpr

TODO: write lots of parsing examples, including parse failures with ambiguity.

issue: double prefix op, link to bug on parsec bug tracker.

The source in Text.Parsec.Expr is not too big. You can have a look and try to understand it. There are several standard approaches in parsing theory to parse expressions with data driven precendences and associativity. I don’t know which one Text.Parsec.Expr uses, but if you find these and read about them, then the source of Text.Parsec.Expr might be a bit more understandable.

8. An issue with token parsers

This is a tutorial about an issue with the token parsing we have so far.

import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Combinator (many1, notFollowedBy)
import Text.Parsec.String.Char (digit, string, oneOf, satisfy, char, letter)

import Control.Applicative ((<$>), (<*>), (<*), many, (<$), (<|>))
import Control.Monad (void, guard)

import qualified Text.Parsec.String.Expr as E
import FunctionsAndTypesForParsing

Here is a simplified expression type and parser:

data SimpleExpr = Num Integer
                | BinaryOp SimpleExpr String SimpleExpr
                   deriving (Eq,Show)

simpleExpr :: Parser SimpleExpr
simpleExpr = E.buildExpressionParser table num

table :: [[E.Operator SimpleExpr]]
table = [[binary "<=" E.AssocNone
         ,binary ">=" E.AssocNone]
        ,[binary "<" E.AssocNone
         ,binary ">" E.AssocNone]
        ]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ symbol name) assoc
    mkBinOp nm a b = BinaryOp a nm b

num :: Parser SimpleExpr
num = Num <$> integer

whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"

lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace

integer :: Parser Integer
integer = read <$> lexeme (many1 digit)

symbol :: String -> Parser String
symbol s = lexeme $ string s

Let’s try it out:

*Main> regularParse simpleExpr "1=2"
Right (Num 1)

*Main> regularParse simpleExpr "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))

*Main> regularParse simpleExpr "1>2"
Left (line 1, column 2):
unexpected "2"
expecting ">="

What happened? The parser tried to parse > as >=, failed, and since the failure consumed some input (the first >), it failed completely.

We are going to change the symbol parser to fix this. Here is a parameterized version of the simpleExpr parser so we can try a few variations out.

simpleExprP :: (String -> Parser String) -> Parser SimpleExpr
simpleExprP sym = E.buildExpressionParser (tableP sym) num

tableP :: (String -> Parser String) -> [[E.Operator SimpleExpr]]
tableP sym = [[binary "<=" E.AssocNone
              ,binary ">=" E.AssocNone]
             ,[binary "<" E.AssocNone
              ,binary ">" E.AssocNone]]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ sym name) assoc
    mkBinOp nm a b = BinaryOp a nm b

Let’s reproduce the failure:

*Main> regularParse (simpleExprP symbol) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))

*Main> regularParse (simpleExprP symbol) "1>2"
Left (line 1, column 2):
unexpected "2"
expecting ">="

We are going to look at two possible solutions.

  1. Let’s use try:

*Main> regularParse (simpleExprP (try . symbol)) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))

*Main> regularParse (simpleExprP (try . symbol)) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))

This seems to have done the job. There is still a problem though. Consider a case when the precedence is the other way round - the < and > are higher precedence than and >=,

simpleExprP1 :: (String -> Parser String) -> Parser SimpleExpr
simpleExprP1 sym = E.buildExpressionParser (tableP1 sym) num

tableP1 :: (String -> Parser String) -> [[E.Operator SimpleExpr]]
tableP1 sym = [[binary "<" E.AssocNone
              ,binary ">" E.AssocNone]
             ,[binary "<=" E.AssocNone
              ,binary ">=" E.AssocNone]]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ sym name) assoc
    mkBinOp nm a b = BinaryOp a nm b
*Main> regularParse (simpleExprP1 (try . symbol)) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))

*Main> regularParse (simpleExprP1 (try . symbol)) "1>=2"
Left (line 1, column 3):
unexpected "="
expecting digit

Although the precendence order is a little contrived in this case, this issue could easily crop up for real when we start adding more operators. Let’s fix it now.

This could be solved by adding a try at a earlier place in the parsing. Because of how the buildExpressionParser function works, it’s not obvious where the try could go.

Let’s try tackling the problem in a different way. One way of looking at this is to consider that the symbol parser stops parsing too soon:

*Main> parseWithLeftOver (symbol ">") ">="
Right (">","=")

What it should do is keep parsing symbol characters until it gets a result string which can’t be a symbol, and stop one character before this..

Here is a slightly naive way of doing it, which will be good enough for quite a while:

symbol1 :: String -> Parser String
symbol1 s = try $ lexeme $ do
    u <- many1 (oneOf "<>=+-^%/*")
    guard (s == u)
    return s

Here is a similar alternative:

symbol2 :: String -> Parser String
symbol2 s = try $ lexeme $ do
    void $ string s
    notFollowedBy (oneOf "<>=+-^%/*")
    return s

Let’s try them out:

*Main> parseWithLeftOver (symbol1 ">") ">="
Left (line 1, column 3):
unexpected end of input

*Main> parseWithLeftOver (symbol1 ">") ">"
Right (">","")

*Main> parseWithLeftOver (symbol1 ">") ">= 3"
Left (line 1, column 3):
unexpected " "

*Main> parseWithLeftOver (symbol1 ">=") ">= 3"
Right (">="," 3")

The error messages don’t seem very good, but it parses and fails to parse correctly.

*Main> parseWithLeftOver (symbol2 ">") ">="
Left (line 1, column 3):
unexpected '='

*Main> parseWithLeftOver (symbol2 ">") ">"
Right (">","")

*Main> parseWithLeftOver (symbol2 ">") ">= 3"
Left (line 1, column 3):
unexpected '='

*Main> parseWithLeftOver (symbol2 ">=") ">= 3"
Right (">="," 3")

This one appears to give better error messages in this limited scenario, apart from that they both work the same.

Let’s try them out in the full expression parser:

*Main> regularParse (simpleExprP symbol1) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))

*Main> regularParse (simpleExprP symbol1) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))

*Main> regularParse (simpleExprP symbol2) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))

*Main> regularParse (simpleExprP symbol2) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))

They both work fine here. Let’s see some error messages in this context.

*Main> parseWithEof (simpleExprP symbol1) "1>*2"
Left (line 1, column 4):
unexpected "2"
expecting operator

*Main> parseWithEof (simpleExprP symbol2) "1>*2"
Left (line 1, column 4):
unexpected '*'
expecting operator

Both error messages are a bit crap. So much for the second variation producing better error messages.

Let’s look at the equivalent issue with respect to keyword parsing. We can get a similar problem here.

keyword :: String -> Parser String
keyword s = try $ string s
*Main> parseWithEof (keyword "not") "not"
Right "not"

*Main> parseWithEof (keyword "not") "nothing"
Left (line 1, column 4):
unexpected 'h'
expecting end of input

*Main> parseWithEof (keyword "not" <|> keyword "nothing") "nothing"
Left (line 1, column 4):
unexpected 'h'
expecting end of input

*Main> parseWithEof (keyword "nothing" <|> keyword "not") "nothing"
Right "nothing"

We can fix this overlapping prefix issue by reordering the choices. But let’s fix the keyword parser in a similar way to the symbol parser.

TODO: I don’t know if symbol is the right name, I don’t think Parsec usually uses symbol in this way. Maybe it should be called operator.

identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
keyword1 :: String -> Parser String
keyword1 k = try $ do
    i <- identifier
    guard (i == k)
    return k

TODO: later note in error messages about choosing identifier here instead of e.g. many1 letter.

*Main> parseWithEof (keyword1 "not") "not"
Right "not"

*Main> parseWithEof (keyword1 "not") "nothing"
Left (line 1, column 8):
unexpected end of input
expecting digit, letter or "_"

*Main> parseWithEof (keyword1 "not" <|> keyword1 "nothing") "nothing"
Right "nothing"

*Main> parseWithEof (keyword1 "nothing" <|> keyword1 "not") "nothing"
Right "nothing"

*Main> parseWithEof (keyword1 "not" <|> keyword1 "nothing") "not"
Right "not"

*Main> parseWithEof (keyword1 "nothing" <|> keyword1 "not") "not"
Right "not"

Try implementing the keyword2 parser which uses notFollowedBy instead of guard, using something analogous to the change from symbol1 to symbol2 above.

After this, you can try reimplementing the expression parser from the Text.Parsec.Expr tutorial using the new symbol and keyword parsers.

9. Permutation parsing

This can parse a bunch of different things in any order. TODO: examples.

10. Token parsing

This can be used quickly create a set of token parsers handling lots of little issues which you otherwise have to deal with manually.

11. Value expressions

In this tutorial, we will build a parser for a subset of SQL value expressions. These are roughly the same as the expressions used in languages like Haskell or C. This will follow on from the work on expressions in previous tutorials.

Our value expressions will support literals, identifiers, asterisk, some simple operators, case expression and parentheses.

The phrase 'value expression' is from the ANSI SQL standards. What we will develop here isn’t exactly ANSI SQL value expressions, and we won’t use them exactly how the standards do, but the differences really aren’t important right now. I will come back to this in a later tutorial.

TODO: not so sure anymore what is a 'value expression' in the standard, and what is a 'scalar expression'. Find out and document it.

{-# LANGUAGE TupleSections #-}
module ValueExpressions (ValueExpr(..)
                        ,valueExpr
                        ,keyword
                        ,symbol
                        ,identifier
                        ,makeTest
                        ,blackListValueExpr
                        ,comma
                        ,parens
                        ) where

import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (oneOf, digit, string, anyChar, char, letter)
import Text.Parsec.String.Combinator (many1, manyTill, eof, choice, between
                                     ,sepBy, optionMaybe)
import Text.Parsec.String.Parsec (try,parse)

import Control.Applicative (many, (<*),(<$>), (*>), (<|>),(<$),(<*>))
import Control.Monad (void,guard)
import qualified Text.Parsec.String.Expr as E
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing

import Debug.Trace (trace)

11.1. Lineup

11.1.1. comments

We will start supporting comments. It will support the two standard comment syntaxes from the standard:

-- single line comment
/*
multiline
comment
*/

The /* */ comments do not nest.

11.1.2. literals

It will just support positive integral and string literals at this time. Proper SQL supports more literal types including some quite weird syntax which we will skip for now.

1
500
'string literal'

11.1.3. identifiers

We will use simple identifiers: an identifier may start with a letter or underscore, and contain letters, underscores and numbers. Full SQL identifiers are more complicated to support so we will skip this for now also.

a
something
_test_
a123

11.1.4. 'dotted identifiers'

We will do some limited support for identifiers with two parts separated by a dot. I don’t want to get into the exact meaning or the various names used to describe these since it is a bit confusing, especially in SQL. We can just stick to the syntax. Both parts must parse according to the identifier rules above.

t.a
something.something_else

11.1.5. star

We will support the star as special expression which can be used at the top level of select lists (and a few other places in SQL). We will also support a 'dotted star'.

*
t.*

11.1.6. function application

This represents any syntax which looks like the normal function application used in languages like C. The function name must parse as a valid identifier according to the rules above.

f()
g(1)
h(2,'something')

11.1.7. operators

We will only support a small range of binary operators for now plus a single prefix unary operator (not). We will attempt to support correct precedence and associativity for these via the Text.Parsec.Expr module. Here is a complete list of all the supported operators.

a = b
a > b
a < b
a >= b
a <= b
a != b
a <> b (two spellings of the same thing, we will parse them as
        separate operators though)
a and b
a or b
1 + 2
1 - 2
1 * 2
1 / 2
'some' || 'thing'
a like b
not a

11.1.8. case expression

There are two standard variations of case expressions in SQL. One is more like a switch statement in C (but is an expression, not a statement):

case a
when 3 then 'got three'
when 5 then 'got five'
else 'neither'
end

The other has a boolean expression in each branch:

case
when a = 3 then 'a is three'
when b = 4 then 'b is four'
else 'neither'
end

The else branch is optional (if it is missing, it implicitly means 'else null').

11.1.9. parentheses

It will parse and represent parentheses explicitly in the abstract syntax, like we did with the previous expression parsers.

(1 + 2) * 3

11.2. abstract syntax for value expressions

Here is a type to represent value expressions:

data ValueExpr = StringLit String
               | NumLit Integer
               | Iden String
               | DIden String String -- a.b
               | Star
               | DStar String -- t.*
               | App String [ValueExpr]
               | PrefOp String ValueExpr
               | BinOp ValueExpr String ValueExpr
               | Case (Maybe ValueExpr) -- test value
                      [(ValueExpr,ValueExpr)] -- when branches
                      (Maybe ValueExpr) -- else value
               | Parens ValueExpr
                 deriving (Eq,Show)

Here is the plan for tackling this:

Let’s write some simple automated tests to check our progress and check for regressions.

We can start by using the code already written to produce a partial expression parser, then add parsing for each new constructor one at a time.

11.3. automated testing framework

Let’s start with some examples we can turn into automated tests. Here are the constructors above which I think we’ve already more or less implemented the parsers for in previous tutorials: NumLit, Iden, PrefixOp, BinaryOp and Parens.

numLitTests :: [(String,ValueExpr)]
numLitTests =
    [("1", NumLit 1)
    ,("54321", NumLit 54321)]

idenTests :: [(String,ValueExpr)]
idenTests =
    [("test", Iden "test")
    ,("_something3", Iden "_something3")]

operatorTests :: [(String,ValueExpr)]
operatorTests =
   map (\o -> (o ++ " a", PrefOp o (Iden "a"))) ["not", "+", "-"]
   ++ map (\o -> ("a " ++ o ++ " b", BinOp (Iden "a") o (Iden "b")))
          ["=",">","<", ">=", "<=", "!=", "<>"
          ,"and", "or", "+", "-", "*", "/", "||", "like"]
*ValueExpressions> import Data.List
*ValueExpressions Data.List> putStrLn $ intercalate "\n" $ map show operatorTests
("not a",PrefOp "not" (Iden "a"))
("+ a",PrefOp "+" (Iden "a"))
("- a",PrefOp "-" (Iden "a"))
("a = b",BinOp (Iden "a") "=" (Iden "b"))
("a > b",BinOp (Iden "a") ">" (Iden "b"))
("a < b",BinOp (Iden "a") "<" (Iden "b"))
("a >= b",BinOp (Iden "a") ">=" (Iden "b"))
("a <= b",BinOp (Iden "a") "<=" (Iden "b"))
("a != b",BinOp (Iden "a") "!=" (Iden "b"))
("a <> b",BinOp (Iden "a") "<>" (Iden "b"))
("a and b",BinOp (Iden "a") "and" (Iden "b"))
("a or b",BinOp (Iden "a") "or" (Iden "b"))
("a + b",BinOp (Iden "a") "+" (Iden "b"))
("a - b",BinOp (Iden "a") "-" (Iden "b"))
("a * b",BinOp (Iden "a") "*" (Iden "b"))
("a / b",BinOp (Iden "a") "/" (Iden "b"))
("a || b",BinOp (Iden "a") "||" (Iden "b"))
("a like b",BinOp (Iden "a") "like" (Iden "b"))
parensTests :: [(String,ValueExpr)]
parensTests = [("(1)", Parens (NumLit 1))]
basicTests :: [(String,ValueExpr)]
basicTests = numLitTests ++ idenTests ++ operatorTests ++ parensTests

Here is a test runner which uses HUnit:

makeTest :: (Eq a, Show a) => Parser a -> (String,a) -> H.Test
makeTest parser (src,expected) = H.TestLabel src $ H.TestCase $ do
    let gote = parse (whitespace *> parser <* eof) "" src
    case gote of
      Left e -> H.assertFailure $ show e
      Right got -> H.assertEqual src expected got

11.4. the parsing we already have

11.4.1. tokens

lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = try $ lexeme $ do
    u <- many1 (oneOf "<>=+-^%/*!|")
    guard (s == u)
    return s
openParen :: Parser Char
openParen = lexeme $ char '('
closeParen :: Parser Char
closeParen = lexeme $ char ')'

The whitespace parser is below, since it includes some new code to deal with comments.

11.4.2. helper functions

keyword :: String -> Parser String
keyword k = try $ do
    i <- identifier
    guard (i == k)
    return k

TODO: find a place to discuss putting try in the keyword and symbol parsers. I think maybe this should come in the error message tutorial?

parens :: Parser a -> Parser a
parens = between openParen closeParen

11.4.3. terms

num :: Parser ValueExpr
num = NumLit <$> integer
iden :: Parser ValueExpr
iden = Iden <$> identifier

I’m going to parameterize the parens parser again to avoid rewriting lots of very similar code in this tutorial.

parensValue :: Parser ValueExpr -> Parser ValueExpr
parensValue val = Parens <$> parens val
term0 :: Parser ValueExpr
term0 = iden <|> num <|> parensValue valueExpr0

11.4.4. operator table and the first value expression parser

I’ve added all the new operators in this table.

table :: [[E.Operator ValueExpr]]
table = [[prefix "-", prefix "+"]
         ,[binary "^" E.AssocLeft]
         ,[binary "*" E.AssocLeft
          ,binary "/" E.AssocLeft
          ,binary "%" E.AssocLeft]
         ,[binary "+" E.AssocLeft
          ,binary "-" E.AssocLeft]
         ,[binary "<=" E.AssocRight
          ,binary ">=" E.AssocRight
          ,binaryK "like" E.AssocNone
          ,binary "!=" E.AssocRight
          ,binary "<>" E.AssocRight
          ,binary "||" E.AssocRight]
         ,[binary "<" E.AssocNone
          ,binary ">" E.AssocNone]
         ,[binary "=" E.AssocRight]
         ,[prefixK "not"]
         ,[binaryK "and" E.AssocLeft]
         ,[binaryK "or" E.AssocLeft]]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ symbol name) assoc
    mkBinOp nm a b = BinOp a nm b
    prefix name = E.Prefix (PrefOp name <$ symbol name)
    binaryK name assoc =
        E.Infix (mkBinOp name <$ keyword name) assoc
    prefixK name = E.Prefix (PrefOp name <$ keyword name)
valueExpr0 :: Parser ValueExpr
valueExpr0 = E.buildExpressionParser table term0

Now we can run the tests:

*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr0) basicTests
Cases: 23  Tried: 23  Errors: 0  Failures: 0
Counts {cases = 23, tried = 23, errors = 0, failures = 0}

11.5. new parsers

Let’s start extending things one constructor at a time, but first we will do comments.

11.5.1. whitespace and comments

Here is our old whitespace parser:

whitespace0 :: Parser ()
whitespace0 = void $ many $ oneOf " \n\t"

We are going to change this to do comments as well.

TODO: build these two parsers up in stages, show why each bit is there

lineComment :: Parser ()
lineComment = void (try (string "--") *>
                    manyTill anyChar (void (char '\n') <|> eof))
blockComment :: Parser ()
blockComment = void (-- no nesting of block comments in SQL
                     try (string "/*")
                     -- TODO: why is try used here
                     *> manyTill anyChar (try $ string "*/"))

todo: discuss how to compose to create the whitespace parser todo: create the whitespace parser without many1 and return () and show the problem.

Here is the final parser for whitespace:

whitespace :: Parser ()
whitespace =
    choice [simpleWhitespace *> whitespace
           ,lineComment *> whitespace
           ,blockComment *> whitespace
           ,return ()]
  where
    lineComment = try (string "--")
                  *> manyTill anyChar (void (char '\n') <|> eof)
    blockComment = try (string "/*")
                   *> manyTill anyChar (try $ string "*/")
    simpleWhitespace = void $ many1 (oneOf " \t\n")

11.5.2. string literal

Our string literal is any characters except single quote enclosed in single quotes. We aren’t going to support other string literal syntaxes or escaping single quotes within a string right now.

Here are some examples:

stringLiteralTests :: [(String,ValueExpr)]
stringLiteralTests =
    [("''", StringLit "")
    ,("'test'", StringLit "test")]

We need a new token parser for string literals:

stringToken :: Parser String
stringToken = lexeme (char '\'' *> manyTill anyChar (char '\''))

TODO: more explanation

And the string literal expression parser:

stringLit :: Parser ValueExpr
stringLit = StringLit <$> stringToken

Here is the new value expression parser:

term1 :: Parser ValueExpr
term1 = iden <|> num <|> parensValue valueExpr1 <|> stringLit
valueExpr1 :: Parser ValueExpr
valueExpr1 = E.buildExpressionParser table term1
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr1) (basicTests ++ stringLiteralTests)
Cases: 25  Tried: 25  Errors: 0  Failures: 0
Counts {cases = 25, tried = 25, errors = 0, failures = 0}

So far, so good.

11.5.3. dotted identifier

dIdenTests :: [(String,ValueExpr)]
dIdenTests =
    [("t.a", DIden "t" "a")]

Here is a new token parser to use:

dot :: Parser Char
dot = lexeme $ char '.'
dIden :: Parser ValueExpr
dIden = DIden <$> identifier <*> (dot *> identifier)
term2 :: Parser ValueExpr
term2 = iden <|> num <|> parensValue valueExpr2
        <|> stringLit <|> dIden
valueExpr2 :: Parser ValueExpr
valueExpr2 = E.buildExpressionParser table term2
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr2) (basicTests ++ stringLiteralTests ++ dIdenTests)
 ### Failure in: 25:t.a
(line 1, column 2):
unexpected '.'
expecting digit, letter, "_", "--", "/*", operator or end of input
Cases: 26  Tried: 26  Errors: 0  Failures: 1
Counts {cases = 26, tried = 26, errors = 0, failures = 1}

Do you know why this happened? Can you think of a way to solve this?

Here is the usual blunt hammer technique: reorder the choices to put the longest choice first and use try when there is a common prefix.

term3 :: Parser ValueExpr
term3 = try dIden <|> iden <|> num <|> parensValue valueExpr3
        <|> stringLit
valueExpr3 :: Parser ValueExpr
valueExpr3 = E.buildExpressionParser table term3
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr3) (basicTests ++ stringLiteralTests ++ dIdenTests)
Cases: 26  Tried: 26  Errors: 0  Failures: 0
Counts {cases = 26, tried = 26, errors = 0, failures = 0}

11.5.4. star

star - no surprises

starTests :: [(String,ValueExpr)]
starTests = [("*", Star)]
star :: Parser ValueExpr
star = Star <$ symbol "*"
term4 :: Parser ValueExpr
term4 = try dIden <|> iden <|> num <|> parensValue valueExpr4
        <|> stringLit <|> star
valueExpr4 :: Parser ValueExpr
valueExpr4 = E.buildExpressionParser table term4
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr4) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests)
Cases: 27  Tried: 27  Errors: 0  Failures: 0
Counts {cases = 27, tried = 27, errors = 0, failures = 0}

11.5.5. dotted star

Here is dotted star.

dStarTests :: [(String,ValueExpr)]
dStarTests = [("t.*", DStar "t")]
dstar :: Parser ValueExpr
dstar = DStar <$> (identifier <* dot <* symbol "*")

We’ll have the same issue we’ve seen before, so let’s go straight to the solution.

term5 :: Parser ValueExpr
term5 = try dstar <|> try dIden <|> iden <|> num
        <|> parensValue valueExpr5 <|> stringLit <|> star
valueExpr5 :: Parser ValueExpr
valueExpr5 = E.buildExpressionParser table term5
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr5) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests)
Cases: 28  Tried: 28  Errors: 0  Failures: 0
Counts {cases = 28, tried = 28, errors = 0, failures = 0}

11.5.6. app

The App constructor is used for syntax which looks like regular function application: f(), f(a), f(a,b), etc.

appTests :: [(String,ValueExpr)]
appTests = [("f()", App "f" [])
           ,("f(1)", App "f" [NumLit 1])
           ,("f(1,a)", App "f" [NumLit 1, Iden "a"])]

Here is the parser, parameterized like the parensValue parser for the same reason, so we can reuse the code for different versions of the value expression parser:

app :: Parser ValueExpr -> Parser ValueExpr
app val = App <$> identifier <*> parens (commaSep val)

And here is the commaSep helper:

commaSep :: Parser a -> Parser [a]
commaSep = (`sepBy` comma)
comma :: Parser Char
comma = lexeme $ char ','

It is another parser with the identifier prefix, so another try.

term6 :: Parser ValueExpr
term6 = try (app valueExpr6) <|> try dstar <|> try dIden <|> iden
        <|> num <|> parensValue valueExpr6 <|> stringLit <|> star
valueExpr6 :: Parser ValueExpr
valueExpr6 = E.buildExpressionParser table term6
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr6) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests ++ appTests)
Cases: 31  Tried: 31  Errors: 0  Failures: 0
Counts {cases = 31, tried = 31, errors = 0, failures = 0}

Everything looks good.

11.5.7. case

Here is something a little more interesting.

caseTests :: [(String,ValueExpr)]
caseTests =
    [("case a when 1 then 2 end"
     ,Case (Just $ Iden "a") [(NumLit 1,NumLit 2)] Nothing)

    ,("case a when 1 then 2 when 3 then 4 end"
     ,Case (Just $ Iden "a")
           [(NumLit 1, NumLit 2)
           ,(NumLit 3, NumLit 4)]
           Nothing)

    ,("case a when 1 then 2 when 3 then 4 else 5 end"
     ,Case (Just $ Iden "a")
           [(NumLit 1, NumLit 2)
           ,(NumLit 3, NumLit 4)]
           (Just $ NumLit 5))

    ,("case when a=1 then 2 when a=3 then 4 else 5 end"
     ,Case Nothing
           [(BinOp (Iden "a") "=" (NumLit 1), NumLit 2)
           ,(BinOp (Iden "a") "=" (NumLit 3), NumLit 4)]
           (Just $ NumLit 5))
    ]

How can we approach this? We know that there will always be a case keyword at the start, and an end keyword at the end. Each when branch seems to be self contained.

Here is a rough pseudo-code sketch:

keyword "case"
optional test_expression
many1 when_clause
optional else clause
keyword "end"

Here is the case parser based simply on this pseudo-code.

caseValue0 :: Parser ValueExpr -> Parser ValueExpr
caseValue0 val = do
    void $ keyword "case"
    testExp <- optionMaybe val
    whens <- many1 whenClause
    els <- optionMaybe elseClause
    void $ keyword "end"
    return $ Case testExp whens els
 where
   whenClause = (,) <$> (keyword "when" *> val)
                    <*> (keyword "then" *> val)
   elseClause = keyword "else" *> val

Let’s try it on its own first and see if we have any problems:

*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (caseValue0 valueExpr6)) caseTests in: 3:case when a=1 then 2 when a=3 then 4 else 5 end
(line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"
Cases: 4  Tried: 4  Errors: 0  Failures: 1
Counts {cases = 4, tried = 4, errors = 0, failures = 1}

Three work, and one fails. What happened here? Try to run the parsing code in your head to see if you can work out what the problem is.

Here is a method we can use when you can’t manage to locate a problem in this way.

Let’s single out the failure first:

*ValueExpressions> parseWithEof (caseValue0 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
Left (line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"

Here is the caseValue parser rewritten with trace interleaved:

caseValue1 :: Parser ValueExpr -> Parser ValueExpr
caseValue1 val = do
    trace "start case" $ return ()
    void $ keyword "case"
    trace "read case keyword" $ return ()
    testExp <- optionMaybe val
    trace ("read testExpr: " ++ show testExp) $ return ()
    whens <- many1 whenClause
    trace ("read whens: " ++ show (length whens)) $ return ()
    els <- optionMaybe elseClause
    trace ("read else: " ++ show els) $ return ()
    void $ keyword "end"
    trace "read end keyword" $ return ()
    return $ Case testExp whens els
 where
   whenClause = do
       trace "start when clause" $ return ()
       void $ keyword "when"
       trace "read when keyword" $ return ()
       w <- val
       trace ("read when exp: " ++ show w) $ return ()
       void $ keyword "then"
       trace "read then keyword" $ return ()
       t <- val
       trace ("read then exp: " ++ show t) $ return ()
       return (w,t)
   elseClause = do
       trace "start else clause" $ return ()
       void $ keyword "else"
       trace "read else keyword" $ return ()
       v <- val
       trace ("read else exp: " ++ show v) $ return ()
       return v

Now when we run the test, we can see a trace of what happens:

*ValueExpressions> parseWithEof (caseValue1 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
start case
read case keyword
read testExpr: Just (Iden "when")
start when clause
Left (line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"

Look at the trace and see if you spot the problem.

We’ve parsed the when keyword as an identifier for the optional first expression. The way we deal with this is to use a new identifier parser which has a blacklist of keywords which can’t be identifiers.

caseValue2 :: Parser ValueExpr -> Parser ValueExpr
caseValue2 val = do
    trace "start case" $ return ()
    void $ keyword "case"
    trace "read case keyword" $ return ()
    testExp <- optionMaybe caseVal
    trace ("read testExpr: " ++ show testExp) $ return ()
    whens <- many1 whenClause
    trace ("read whens: " ++ show (length whens)) $ return ()
    els <- optionMaybe elseClause
    trace ("read else: " ++ show els) $ return ()
    void $ keyword "end"
    trace "read end keyword" $ return ()
    return $ Case testExp whens els
 where
   whenClause = do
       trace "start when clause" $ return ()
       void $ keyword "when"
       trace "read when keyword" $ return ()
       w <- caseVal
       trace ("read when exp: " ++ show w) $ return ()
       void $ keyword "then"
       trace "read then keyword" $ return ()
       t <- caseVal
       trace ("read then exp: " ++ show t) $ return ()
       return (w,t)
   elseClause = do
       trace "start else clause" $ return ()
       void $ keyword "else"
       trace "read else keyword" $ return ()
       v <- caseVal
       trace ("read else exp: " ++ show v) $ return ()
       return v
   -- here is the fix, we replace calls to `val` with calls to
   -- this function which prevents any of the case keywords from
   -- being parsed as identifiers
   caseVal = try $ do
       v <- val
       guard $ case v of
           Iden i | i `elem` ["case", "when", "then", "else", "end"] -> False
           _ -> True
       return v

Maybe this blacklist used in this way isn’t permissive enough, or is too permissive, but let’s not get lost in these details right now, and come back to it later.

*ValueExpressions> parseWithEof (caseValue2 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
start case
read case keyword
read testExpr: Nothing
start when clause
read when keyword
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
start else clause
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Right (Case Nothing [(BinOp (Iden "a") "=" (NumLit 1),NumLit 2),(BinOp (Iden "a") "=" (NumLit 3),NumLit 4)] (Just (NumLit 5)))

Looks good. Have a close look through the trace to see if you can follow it all and it makes sense. TODO: There appear to be some messages missing - I think it is some sort of memoization effect.

Let’s run all the tests with the tracing still in so if there are any failures we can zero in on them.

*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (caseValue2 valueExpr6)) caseTests
Cases: 4  Tried: 0  Errors: 0  Failures: 0start case
read case keyword
read testExpr: Just (Iden "a")
start when clause
read when keyword
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read whens: 1
start else clause
read else: Nothing
read end keyword
Cases: 4  Tried: 1  Errors: 0  Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else: Nothing
read end keyword
Cases: 4  Tried: 2  Errors: 0  Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 4  Tried: 3  Errors: 0  Failures: 0read testExpr: Nothing
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 4  Tried: 4  Errors: 0  Failures: 0
Counts {cases = 4, tried = 4, errors = 0, failures = 0}

Now the full expression parser:

term7 :: Parser ValueExpr
term7 = caseValue2 valueExpr7
        <|> try (app valueExpr7) <|> try dstar <|> try dIden <|> iden
        <|> num <|> parensValue valueExpr7 <|> stringLit <|> star
valueExpr7 :: Parser ValueExpr
valueExpr7 = E.buildExpressionParser table term7
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr7) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests ++ appTests ++ caseTests)
Cases: 35  Tried: 0  Errors: 0  Failures: 0start case
Cases: 35  Tried: 31  Errors: 0  Failures: 0read case keyword
read testExpr: Just (Iden "a")
start when clause
read when keyword
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read whens: 1
start else clause
read else: Nothing
read end keyword
Cases: 35  Tried: 32  Errors: 0  Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else: Nothing
read end keyword
Cases: 35  Tried: 33  Errors: 0  Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 35  Tried: 34  Errors: 0  Failures: 0read testExpr: Nothing
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 35  Tried: 35  Errors: 0  Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}

Everything looks good. Let’s refactor the case parser and tidy everything up.

blackListValueExpr :: [String] -> Parser ValueExpr -> Parser ValueExpr
blackListValueExpr blackList val = try $ do
    v <- val
    guard $ case v of
        Iden i | i `elem` blackList -> False
        _ -> True
    return v

Here is the parser with the trace`s removed, and using the new `blackListValueExpr parser.

caseValue3 :: Parser ValueExpr -> Parser ValueExpr
caseValue3 val = do
    void $ keyword "case"
    testExp <- optionMaybe caseVal
    whens <- many1 whenClause
    els <- optionMaybe elseClause
    void $ keyword "end"
    return $ Case testExp whens els
 where
   whenClause = do
       void $ keyword "when"
       w <- caseVal
       void $ keyword "then"
       t <- caseVal
       return (w,t)
   elseClause = do
       void $ keyword "else"
       v <- caseVal
       return v
   caseVal = blackListValueExpr blackList val
   blackList = ["case", "when", "then", "else", "end"]

And here it is after some shortening:

caseValue4 :: Parser ValueExpr -> Parser ValueExpr
caseValue4 val =
    Case
    <$> (keyword "case" *> optionMaybe caseVal)
    <*> many1 whenClause
    <*> optionMaybe elseClause
    <* keyword "end"
 where
   whenClause = (,) <$> (keyword "when" *> caseVal)
                    <*> (keyword "then" *> caseVal)
   elseClause = keyword "else" *> caseVal
   caseVal = blackListValueExpr blackList val
   blackList = ["case", "when", "then", "else", "end"]

I think we passed the point where (<|>) is more readable than choice:

term8 :: Parser ValueExpr
term8 = choice [caseValue4 valueExpr8
               ,try $ app valueExpr8
               ,try dstar
               ,try dIden
               ,iden
               ,num
               ,parensValue valueExpr8
               ,stringLit
               ,star]
valueExpr8 :: Parser ValueExpr
valueExpr8 = E.buildExpressionParser table term8
allExpressionTests :: [(String,ValueExpr)]
allExpressionTests = concat [basicTests
                            ,stringLiteralTests
                            ,dIdenTests
                            ,starTests
                            ,dStarTests
                            ,appTests
                            ,caseTests]

Let’s double check:

*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr8) allExpressionTests
Cases: 35  Tried: 35  Errors: 0  Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}

TODO: review code above for new syntax patterns to talk about. Or maybe don’t talk about them?

Let’s predict that we will need more blacklisting when we work on the query expression parsing, and create a parser which can be used in this way:

term :: [String] -> Parser ValueExpr
term blackList = choice [caseValue4 (valueExpr blackList)
                        ,try (app (valueExpr blackList))
                        ,try dstar
                        ,try dIden
                        ,blackListValueExpr blackList iden
                        ,num
                        ,parensValue (valueExpr blackList)
                        ,stringLit
                        ,star]

TODO: need to change app, dstar and diden for blacklisting?

valueExpr :: [String] -> Parser ValueExpr
valueExpr blackList = E.buildExpressionParser table (term blackList)

Final sanity check:

*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (valueExpr [])) allExpressionTests
Cases: 35  Tried: 35  Errors: 0  Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}

12. Query expressions

We can now start on the 'select' parser. In the SQL standard, it refers to these things as 'query expressions' to distinguish then from 'value expressions', so we will reuse this language here.

The subset of SQL we will support is this:

TODO: write lots of examples here

select queries only, no union, intersect or except. No common table expressions.

we will support all the value expressions that the value expression parser above supports

we will support select lists with optional aliases, with the 'as' optional in the alias

we will support the * as in 'select * from t', and the variation 'select t.* from t', but not the alias version select * as (a,b,c) from t.

we support two part dotted identifiers in value expressions, but no other sort (such as 3-part dotted value expression identifiers, or schema qualified table names or function names).

for the from clause, we will only support optional 'from table_name'.

supports where

we will support regular group by lists, but not the new group by options in SQL2003 (group by (), grouping sets, cube, rollup).

we will support having

we support order by, with multiple columns, but not explicit asc or desc , and no 'nulls first' or 'nulls last' syntax.

No support for offset and fetch first, or variations.

{-# LANGUAGE TupleSections #-}
module QueryExpressions where

--import Text.Groom (groom)
--import qualified Text.Parsec as P
import Text.Parsec.String (Parser)
import Text.Parsec (try,optionMaybe, optional, sepBy1,option)
import Control.Applicative ((<$>),(*>),(<*>))
import Control.Monad (void,guard)
--import Debug.Trace
--import Data.List (intercalate)
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import ValueExpressions (ValueExpr(..), valueExpr, identifier, symbol, keyword, comma)

Here is the datatype for query expressions to get started with. In this tutorial, we will only support an optional single table in the from clause, and this will be expanded in the next tutorial.

TODO: rearrange the from/where/groupby/having/orderby to a separate datatype which is optional in a select, and the from part is mandatory in this new type. This follows the standard and is more accurate syntax.

data QueryExpr
    = Select
      {qeSelectList :: [(ValueExpr,Maybe String)]
      ,qeFrom :: Maybe String
      ,qeWhere :: Maybe ValueExpr
      ,qeGroupBy :: [ValueExpr]
      ,qeHaving :: Maybe ValueExpr
      ,qeOrderBy :: [ValueExpr]
      } deriving (Eq,Show)

Here is a default value which can be used to easily construct query expression values.

makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
                    ,qeFrom = Nothing
                    ,qeWhere = Nothing
                    ,qeGroupBy = []
                    ,qeHaving = Nothing
                    ,qeOrderBy = []}

12.1. select lists

Let’s start with something simple:

select [value expr]

TODO: shorten the names of these examples

singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
    [("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]

Here are a couple of wrappers for symbol and keyword which wrap them with void.

keyword_ :: String -> Parser ()
keyword_ = void . keyword
symbol_ :: String -> Parser ()
symbol_ = void . symbol
singleSelectItem :: Parser QueryExpr
singleSelectItem = do
    keyword_ "select"
    e <- valueExpr []
    return $ makeSelect {qeSelectList = [(e,Nothing)]}

You can use the old test runner to check these:

*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest singleSelectItem) parseSingleSelectItemTestData
Cases: 1  Tried: 1  Errors: 0  Failures: 0
Counts {cases = 1, tried = 1, errors = 0, failures = 0}

Let’s rewrite it in the Applicative and mostly point free style.

singleSelectItemApplicative :: Parser QueryExpr
singleSelectItemApplicative =
    (\sl -> makeSelect {qeSelectList = sl})
    <$> (keyword_ "select" *> (((:[]) . (,Nothing)) <$> valueExpr []))

That didn’t go so well. Using a wrapper:

singleSelectItemApplicative' :: Parser QueryExpr
singleSelectItemApplicative' =
    ms <$> (keyword_ "select" *> valueExpr [])
  where
    ms e = makeSelect {qeSelectList = [(e,Nothing)]}

Now let’s write something that supports multiple value expressions, e.g.

select 1+2, 3+4;
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
    [("select a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
    ,("select a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)]})
    ,("select 1+2,3+4"
     ,makeSelect {qeSelectList =
                     [(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
                     ,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
    ]
selectMultipleItems :: Parser QueryExpr
selectMultipleItems = do
    keyword_ "select"
    es <- commaSep1 (valueExpr [])
    return $ makeSelect {qeSelectList = map (,Nothing) es}
commaSep1 :: Parser a -> Parser [a]
commaSep1 = (`sepBy1` comma)

12.1.1. aliases

We can write names for the columns produced from a select list using the keyword as, and we can miss out the as:

select a as a1, b as b1, f(c) as c1;

-- no as
select a a1, b b1;
selectListTests :: [(String,QueryExpr)]
selectListTests =
    [("select a as a1, b as b1"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a1")
                                 ,(Iden "b", Just "b1")]})
    ,("select a a1, b b1"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a1")
                                 ,(Iden "b", Just "b1")]})
    ] ++ multipleSelectItemsTests
      ++ singleSelectItemTests

Finally, here is the select list parser and the helper for select items:

selectItem0 :: Parser (ValueExpr, Maybe String)
selectItem0 = (,) <$> valueExpr [] <*> optionMaybe (try alias)
  where alias = optional (keyword_ "as") *> identifier
selectList0 :: Parser [(ValueExpr, Maybe String)]
selectList0 = keyword_ "select" *> commaSep1 selectItem0
queryExpr0 :: Parser QueryExpr
queryExpr0 = mkSelect <$> selectList0
  where mkSelect sl = makeSelect {qeSelectList = sl}

12.2. from clause

from :: Parser String
from = keyword_ "from" *> identifier
fromTests :: [(String,QueryExpr)]
fromTests =
    [("select a from t"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = (Just "t")})]
queryExpr1 :: Parser QueryExpr
queryExpr1 = mkSelect
             <$> selectList0
             <*> optionMaybe from
  where mkSelect sl fr = makeSelect {qeSelectList = sl
                                    ,qeFrom = fr}
*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest queryExpr1) (selectListTests ++ fromTests)
 ### Failure in: 6:select a from t
(line 1, column 16):
unexpected end of input
expecting digit, letter, "_", "--" or "/*"
Cases: 7  Tried: 7  Errors: 0  Failures: 1
Counts {cases = 7, tried = 7, errors = 0, failures = 1}

This is a keyword issue again. We are parsing the from as if it was a column alias and then getting stuck.

blackListIdentifier :: [String] -> Parser String
blackListIdentifier bl = do
    i <- identifier
    guard (i `notElem` bl)
    return i
selectItem :: Parser (ValueExpr, Maybe String)
selectItem = (,) <$> valueExpr [] <*> optionMaybe (try alias)
  where alias = optional (keyword_ "as") *> blackListIdentifier ["from"]
selectList :: Parser [(ValueExpr, Maybe String)]
selectList = keyword_ "select" *> commaSep1 selectItem
queryExpr2 :: Parser QueryExpr
queryExpr2 = mkSelect
             <$> selectList
             <*> optionMaybe from
  where mkSelect sl fr = makeSelect {qeSelectList = sl
                                    ,qeFrom = fr}

That did the job for now.

12.3. where

The where, group by, having, and order by parsers are simple.

whereTests :: [(String,QueryExpr)]
whereTests =
    [("select a from t where a = 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = Just "t"
                 ,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
    ]
whereClause :: Parser ValueExpr
whereClause = keyword_ "where" *> valueExpr []
queryExpr3 :: Parser QueryExpr
queryExpr3 = mkSelect
             <$> selectList
             <*> optionMaybe from
             <*> optionMaybe whereClause
  where mkSelect sl fr wh =
            makeSelect {qeSelectList = sl
                       ,qeFrom = fr
                       ,qeWhere = wh}

12.4. group by

groupByTests :: [(String,QueryExpr)]
groupByTests =
    [("select a,sum(b) from t group by a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = Just "t"
                 ,qeGroupBy = [Iden "a"]
                 })
    ,("select a,b,sum(c) from t group by a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)
                                 ,(App "sum" [Iden "c"],Nothing)]
                 ,qeFrom = Just "t"
                 ,qeGroupBy = [Iden "a",Iden "b"]
                 })
    ]
groupByClause :: Parser [ValueExpr]
groupByClause = keyword_ "group" *> keyword_ "by"
                *> commaSep1 (valueExpr [])
queryExpr4 :: Parser QueryExpr
queryExpr4 = mkSelect
             <$> selectList
             <*> optionMaybe from
             <*> optionMaybe whereClause
             <*> option [] groupByClause
  where mkSelect sl fr wh gr =
            makeSelect {qeSelectList = sl
                       ,qeFrom = fr
                       ,qeWhere = wh
                       ,qeGroupBy = gr}

12.5. having

havingTests :: [(String,QueryExpr)]
havingTests =
  [("select a,sum(b) from t group by a having sum(b) > 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = (Just "t")
                 ,qeGroupBy = [Iden "a"]
                 ,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
                 })
  ]
having :: Parser ValueExpr
having = keyword_ "having" *> (valueExpr [])
queryExpr5 :: Parser QueryExpr
queryExpr5 = mkSelect
             <$> selectList
             <*> optionMaybe from
             <*> optionMaybe whereClause
             <*> option [] groupByClause
             <*> optionMaybe having
  where mkSelect sl fr wh gr hv =
            makeSelect {qeSelectList = sl
                       ,qeFrom = fr
                       ,qeWhere = wh
                       ,qeGroupBy = gr
                       ,qeHaving = hv}

Looking nice so far. Did you run the tests for each stage?

12.6. order by

orderByTests :: [(String,QueryExpr)]
orderByTests =
    [("select a from t order by a"
     ,ms [Iden "a"])
    ,("select a from t order by a, b"
     ,ms [Iden "a", Iden "b"])
    ]
  where
    ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
                      ,qeFrom = (Just "t")
                      ,qeOrderBy = o}
orderBy :: Parser [ValueExpr]
orderBy = keyword_ "order" *> keyword_ "by"
          *> commaSep1 (valueExpr [])
queryExpr6 :: Parser QueryExpr
queryExpr6 = Select
             <$> selectList
             <*> optionMaybe from
             <*> optionMaybe whereClause
             <*> option [] groupByClause
             <*> optionMaybe having
             <*> option [] orderBy
*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest queryExpr6) (selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests)
Cases: 13  Tried: 13  Errors: 0  Failures: 0
Counts {cases = 13, tried = 13, errors = 0, failures = 0}

TODO: talk about putting the maybes/default values inside from, where, etc??

13. From clause

In this tutorial, we extend the from clause support to the following: we will support implicit and explicit joins, including keywords natural, inner, outer, left, right, full, cross, on and using, plus parens and simple aliases (e.g. select a from t u, but not select a from t(a,b)). We don’t support oracle outer join syntax (+) or the other 'pre-ANSI' variations on this theme. No lateral keyword or apply or pivot.

{-# LANGUAGE TupleSections #-}

--import Text.Groom (groom)
--import qualified Text.Parsec as P
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Combinator
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>))
import Control.Monad
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import Debug.Trace

TODO: should explicitly import from these two modules (and same in QueryExpressions.lhs)

import ValueExpressions (ValueExpr(..), valueExpr, identifier, makeTest, parens)
import QueryExpressions (selectList,whereClause,groupByClause,having,orderBy
                        ,commaSep1, keyword_,blackListIdentifier)

13.1. Abstract syntax

Here are is the updated QueryExpr and the new TableRef abstract syntax types.

data QueryExpr
    = Select
      {qeSelectList :: [(ValueExpr,Maybe String)]
      ,qeFrom :: [TableRef]
      ,qeWhere :: Maybe ValueExpr
      ,qeGroupBy :: [ValueExpr]
      ,qeHaving :: Maybe ValueExpr
      ,qeOrderBy :: [ValueExpr]
      } deriving (Eq,Show)

makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
                    ,qeFrom = []
                    ,qeWhere = Nothing
                    ,qeGroupBy = []
                    ,qeHaving = Nothing
                    ,qeOrderBy = []}

data TableRef = TRSimple String
              | TRJoin TableRef JoinType TableRef (Maybe JoinCondition)
              | TRParens TableRef
              | TRAlias TableRef String
              | TRQueryExpr QueryExpr
                deriving (Eq,Show)

This syntax for table references can represent invalid syntax, for instance two nested aliases. The justification for this is that sometimes trying to accurately represent only exactly what is valid creates something much more complex. Maybe this is a good tradeoff in this situation, and maybe not.

data JoinType = JoinInner | JoinLeft | JoinRight | JoinFull | JoinCross
                deriving (Eq,Show)

data JoinCondition = JoinOn ValueExpr
                   | JoinUsing [String]
                   | JoinNatural
                   deriving (Eq,Show)

With the join condition, we’ve done the opposite to TableRef - we’ve combined natural and on/using, since only one of these can be present, even though this departs a little from the concrete syntax.

First we will develop the standalone from clause parser, then we will update the query expression syntax and parsing to incorporate our new from clause parser.

13.2. simple table name

Let’s start with something simple: a from clause can be multiple comma separated tablerefs, aka an implicit join.

multipleTRSimpleTests :: [(String, [TableRef])]
multipleTRSimpleTests = [("from a,b", [TRSimple "a", TRSimple "b"])]

from0 :: Parser [TableRef]
from0 = keyword_ "from" *> commaSep1 (TRSimple <$> identifier)
*Main> H.runTestTT $ H.TestList $ map (makeTest from0) multipleTRSimpleTests
Cases: 1  Tried: 1  Errors: 0  Failures: 0
Counts {cases = 1, tried = 1, errors = 0, failures = 0}

Let’s do the query expression, parens and alias first, before tackling joins.

13.3. subquery

Here is the example:

trQueryExprTests :: [(String, [TableRef])]
trQueryExprTests =
    [("from (select a from t)"
    ,[TRQueryExpr $ makeSelect {qeSelectList = [(Iden "a", Nothing)]
                               ,qeFrom = [TRSimple "t"]}])]

Here is the query expression parser we can use:

queryExpr1 :: Parser [TableRef] -> Parser QueryExpr
queryExpr1 from' = Select
                   <$> selectList
                   <*> option [] from'
                   <*> optionMaybe whereClause
                   <*> option [] groupByClause
                   <*> optionMaybe having
                   <*> option [] orderBy

from1 :: Parser [TableRef]
from1 =
    keyword_ "from" *> commaSep1 trefTerm
  where
    trefTerm = choice [TRSimple <$> identifier
                      ,TRQueryExpr <$> parens (queryExpr1 from1)]

13.4. parens

We can’t do a sensible example for these right now - we need explicit joins and then the parens can be used to override the associativity of a three way join, or to specify over what part of the expression to apply an alias.

trParensTests :: [(String, [TableRef])]
trParensTests = [("from (a)", [TRParens $ TRSimple "a"])]

We can write some more tests for parens after we’ve done the explicit joins.

from2 :: Parser [TableRef]
from2 =
    keyword_ "from" *> commaSep1 trefTerm
  where
    trefTerm = choice [TRSimple <$> identifier
                      ,try (TRQueryExpr <$> parens (queryExpr1 from2))
                      ,TRParens <$> parens trefTerm]

13.5. alias

trAliasTests :: [(String,[TableRef])]
trAliasTests = [("from a as b", [TRAlias (TRSimple "a") "b"])
               ,("from a b", [TRAlias (TRSimple "a") "b"])]

The alias can be treated like a postfix operator.

suffixWrapper :: (a -> Parser a) -> a -> Parser a
suffixWrapper p a = p a <|> return a

TODO: ?? not sure about this

from3 :: Parser [TableRef]
from3 =
    keyword_ "from" *> commaSep1 trefTerm
  where
    trefTerm = choice [TRSimple <$> identifier
                      ,try (TRQueryExpr <$> parens (queryExpr1 from3))
                      ,TRParens <$> parens trefTerm]
               >>= suffixWrapper alias
    alias tr = TRAlias tr <$> (optional (keyword_ "as") *> identifier)

How to make it keep nesting?

13.6. joins

Here is a casual sketch of the target grammar:

tref
(cross | [natural]
         ([inner]
          | left [outer]
          | right [outer]
          | full [outer]
         )
join tref
[on expr | using (...)]

Let’s start with parsers for the 'join operator' in the middle and for the join condition:

13.6.1. join type

joinType :: Parser JoinType
joinType = choice
    [JoinCross <$ keyword_ "cross" <* keyword_ "join"
    ,JoinInner <$ keyword_ "inner" <* keyword_ "join"
    ,JoinLeft <$ keyword_ "left"
           <* optional (keyword_ "outer")
           <* keyword_ "join"
    ,JoinRight <$ keyword_ "right"
            <* optional (keyword_ "outer")
            <* keyword_ "join"
    ,JoinFull <$ keyword_ "full"
           <* optional (keyword_ "outer")
           <* keyword_ "join"
    ,JoinInner <$ keyword_ "join"]
*Main> parseWithEof joinType "cross join"
Right JoinCross

*Main> parseWithEof joinType "inner join"
Right JoinInner

*Main> parseWithEof joinType "left outer join"
Right JoinLeft

*Main> parseWithEof joinType "left join"
Right JoinLeft

*Main> parseWithEof joinType "right outer join"
Right JoinRight

*Main> parseWithEof joinType "right join"
Right JoinRight

*Main> parseWithEof joinType "full outer join"
Right JFull

*Main> parseWithEof joinType "full join"
Right JoinFull

*Main> parseWithEof joinType "join"
Right JoinInner

I thought about factoring out the common bits with the joinType parser:

joinType0 :: Parser JoinType
joinType0 = choice
    [choice
     [JoinCross <$ try (keyword_ "cross")
     ,JoinInner <$ try (keyword_ "inner")
     ,choice [JoinLeft <$ try (keyword_ "left")
             ,JoinRight <$ try (keyword_ "right")
             ,JoinFull <$ try (keyword_ "full")]
      <* optional (try $ keyword_ "outer")]
     <* keyword_ "join"
    ,JoinInner <$ keyword_ "join"]

But I think the longer version is much easier to follow, even if it is a little more boring.

13.6.2. join condition

The idea with the join condition is that we pass a bool to say whether we’ve already seen the 'natural' keyword. If so, then we don’t try to parse 'on' or 'using'.

joinCondition :: Bool -> Parser JoinCondition
joinCondition nat =
    choice [guard nat >> return JoinNatural
           ,keyword_ "on" >> JoinOn <$> valueExpr []
           ,keyword_ "using" >> JoinUsing <$> parens (commaSep1 identifier)
           ]
*Main> parseWithEof (joinCondition False) "on a"
Right (JoinOn (Iden "a"))

*Main> parseWithEof (joinCondition False) "on a + b"
Right (JoinOn (BinOp (Iden "a") "+" (Iden "b")))

*Main> parseWithEof (joinCondition False) "using (a,b)"
Right (JoinUsing ["a","b"])

*Main> parseWithEof (joinCondition True) "using (a,b)"
Left (line 1, column 1):
unexpected 'u'
expecting end of input

*Main> parseWithEof (joinCondition True) ""
Right JoinNatural

13.6.3. simple binary join

Let’s try some simple binary joins:

simpleBinaryJoinTests :: [(String,[TableRef])]
simpleBinaryJoinTests =
    [("from a join b"
     ,[TRJoin (TRSimple "a") JoinInner (TRSimple "b") Nothing])

    ,("from a natural join b"
     ,[TRJoin (TRSimple "a") JoinInner (TRSimple "b") (Just JoinNatural)])

    ,("from a join b on a.x = b.y"
     ,[TRJoin (TRSimple "a") JoinInner (TRSimple "b")
       (Just $ JoinOn $ BinOp (DIden "a" "x") "="(DIden "b" "y"))])

    ,("from a join b using(x,y)"
     ,[TRJoin (TRSimple "a") JoinInner (TRSimple "b")
       (Just $ JoinUsing ["x","y"])])

    ,("from a cross join b"
     ,[TRJoin (TRSimple "a") JoinCross (TRSimple "b") Nothing])

    ]

We want to parse the first table, then optionally parse the 'natural' keyword, then the join type, then the second table, then optionally parse the join condition.

from4 :: Parser [TableRef]
from4 = keyword_ "from" >> (:[]) <$> do
     t0 <- simpleTref
     nat <- option False (True <$ keyword_ "natural")
     jt <- joinType
     t1 <- simpleTref
     jc <- optionMaybe (joinCondition nat)
     return $ TRJoin t0 jt t1 jc
  where
    simpleTref = TRSimple <$> identifier

Let’s start extending this into the full target parser. In this next version, I’ve tried to combine all the versions we’ve seen so far.

from5 :: Parser [TableRef]
from5 = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = do
        nat <- option False (True <$ keyword_ "natural")
        jt <- joinType
        t1 <- nonJoinTref
        jc <- optionMaybe (joinCondition nat)
        return $ TRJoin t0 jt t1 jc
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens (queryExpr1 from5))
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper alias
    alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> identifier))
*Main> H.runTestTT $ H.TestList $ map (makeTest from5) (multipleTRSimpleTests ++ trQueryExprTests ++ trParensTests ++ trAliasTests ++ simpleBinaryJoinTests)
 ### Failure in: 5:from a join b
(line 1, column 14):
unexpected end of input
expecting digit, letter, "_", "--" or "/*"
 ### Failure in: 6:from a natural join b
from a natural join b
expected: [TRJoin (TRSimple "a") JoinInner (TRSimple "b") (Just JoinNatural)]
 but got: [TRJoin (TRAlias (TRSimple "a") "natural") JoinInner (TRSimple "b") Nothing]
 ### Failure in: 7:from a join b on a.x = b.y
(line 1, column 15):
unexpected "o"
expecting "--" or "/*"
 ### Failure in: 8:from a join b using(x,y)
(line 1, column 15):
unexpected "u"
expecting "--" or "/*"
 ### Failure in: 9:from a cross join b
from a cross join b
expected: [TRJoin (TRSimple "a") JoinCross (TRSimple "b") Nothing]
 but got: [TRJoin (TRAlias (TRSimple "a") "cross") JoinInner (TRSimple "b") Nothing]
Cases: 10  Tried: 10  Errors: 0  Failures: 5
Counts {cases = 10, tried = 10, errors = 0, failures = 5}

What’s going wrong? If you look at some of the issues, it looks like we are getting keywords parsed as aliases. Let’s fix that first:

from6 :: Parser [TableRef]
from6 = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = do
        nat <- option False (True <$ keyword_ "natural")
        jt <- joinType
        t1 <- nonJoinTref
        jc <- optionMaybe (joinCondition nat)
        return $ TRJoin t0 jt t1 jc
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens (queryExpr1 from6))
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper alias
    alias tr = TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier)
    aliasIdentifier = blackListIdentifier
                      ["natural"
                      ,"inner"
                      ,"outer"
                      ,"cross"
                      ,"left"
                      ,"right"
                      ,"full"
                      ,"join"
                      ,"on"
                      ,"using"]

That didn’t solve the problem. I think we also have a problem since the alias can now fail after consuming input, we need to use try.

from7 :: Parser [TableRef]
from7 = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = do
        nat <- option False (True <$ keyword_ "natural")
        jt <- joinType
        t1 <- nonJoinTref
        jc <- optionMaybe (joinCondition nat)
        return $ TRJoin t0 jt t1 jc
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens (queryExpr1 from7))
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper (try . alias)
    alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
    aliasIdentifier = blackListIdentifier
                      ["natural"
                      ,"inner"
                      ,"outer"
                      ,"cross"
                      ,"left"
                      ,"right"
                      ,"full"
                      ,"join"
                      ,"on"
                      ,"using"]

The final step is to make it parse n-way explicit joins.

threeWayJoinTests :: [(String,[TableRef])]
threeWayJoinTests =
    [("from a join b join c"
     ,[TRJoin
       (TRJoin (TRSimple "a") JoinInner (TRSimple "b") Nothing)
       JoinInner (TRSimple "c") Nothing])]
from8 :: Parser [TableRef]
from8 = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = (do
         nat <- option False (True <$ keyword_ "natural")
         TRJoin t0 <$> joinType
                   <*> nonJoinTref
                   <*> optionMaybe (joinCondition nat))
        >>= suffixWrapper joinTrefSuffix
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens (queryExpr1 from8))
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper (try . alias)
    alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
    aliasIdentifier = blackListIdentifier
                      ["natural"
                      ,"inner"
                      ,"outer"
                      ,"cross"
                      ,"left"
                      ,"right"
                      ,"full"
                      ,"join"
                      ,"on"
                      ,"using"]

We get left associative with this code. I don’t know if this is correct.

We should do some more testing to make sure this code is good. TODO

13.7. query expressions

Let’s create the full query expression parser now:

queryExprJoinTests :: [(String,QueryExpr)]
queryExprJoinTests =
    [("select a from t"
     ,ms [TRSimple "t"])

    ,("select a from t,u"
     ,ms [TRSimple "t", TRSimple "u"])

    ,("select a from t inner join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t left join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinLeft (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t right join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinRight (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t full join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinFull (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t cross join u"
     ,ms [TRJoin (TRSimple "t") JoinCross (TRSimple "u") Nothing])

    ,("select a from t natural inner join u"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just JoinNatural)])

    ,("select a from t inner join u using(a,b)"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just $ JoinUsing ["a", "b"])])

    ,("select a from (select a from t)"
     ,ms [TRQueryExpr $ ms [TRSimple "t"]])

    ,("select a from t as u"
     ,ms [TRAlias (TRSimple "t") "u"])

    ,("select a from t u"
     ,ms [TRAlias (TRSimple "t") "u"])

    ,("select a from (t cross join u) as u"
     ,ms [TRAlias (TRParens $ TRJoin (TRSimple "t") JoinCross
                                     (TRSimple "u") Nothing) "u"])
    ]
  where
    ms f = makeSelect {qeSelectList = [(Iden "a",Nothing)]
                      ,qeFrom = f}

Here are all the other query expression tests updated with the new QueryExpr type.

singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
    [("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
    [("select a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
    ,("select a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)]})
    ,("select 1+2,3+4"
     ,makeSelect {qeSelectList =
                     [(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
                     ,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
    ]
selectListTests :: [(String,QueryExpr)]
selectListTests =
    [("select a as a, b as b"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a")
                                 ,(Iden "b", Just "b")]})

    ,("select a a, b b"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a")
                                 ,(Iden "b", Just "b")]})
    ] ++ multipleSelectItemsTests
      ++ singleSelectItemTests
fromTests :: [(String,QueryExpr)]
fromTests =
    [("select a from t"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = [TRSimple "t"]})]
whereTests :: [(String,QueryExpr)]
whereTests =
    [("select a from t where a = 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
    ]
groupByTests :: [(String,QueryExpr)]
groupByTests =
    [("select a,sum(b) from t group by a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a"]
                 })

    ,("select a,b,sum(c) from t group by a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)
                                 ,(App "sum" [Iden "c"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a",Iden "b"]
                 })
    ]
havingTests :: [(String,QueryExpr)]
havingTests =
  [("select a,sum(b) from t group by a having sum(b) > 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a"]
                 ,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
                 })
  ]
orderByTests :: [(String,QueryExpr)]
orderByTests =
    [("select a from t order by a"
     ,ms [Iden "a"])

    ,("select a from t order by a, b"
     ,ms [Iden "a", Iden "b"])
    ]
  where
    ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
                      ,qeFrom = [TRSimple "t"]
                      ,qeOrderBy = o}
*Main> H.runTestTT $ H.TestList $ map (makeTest (queryExpr1 from8)) (selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests ++ queryExprJoinTests)
 ### Failure in: 7:select a from t where a = 5
(line 1, column 25):
unexpected "="
expecting "--" or "/*"
 ### Failure in: 8:select a,sum(b) from t group by a
(line 1, column 33):
unexpected "a"
expecting "--" or "/*"
 ### Failure in: 9:select a,b,sum(c) from t group by a,b
(line 1, column 35):
unexpected "a"
expecting "--" or "/*"
 ### Failure in: 10:select a,sum(b) from t group by a having sum(b) > 5
(line 1, column 33):
unexpected "a"
expecting "--" or "/*"
 ### Failure in: 11:select a from t order by a
(line 1, column 26):
unexpected "a"
expecting "--" or "/*"
 ### Failure in: 12:select a from t order by a, b
(line 1, column 26):
unexpected "a"
expecting "--" or "/*"
Cases: 26  Tried: 26  Errors: 0  Failures: 6
Counts {cases = 26, tried = 26, errors = 0, failures = 6}

The problem is the table alias parser is trying to parse keywords again. Here is the from parser with the alias name blacklist expanded.

from :: Parser [TableRef]
from = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = (do
         nat <- option False (True <$ keyword_ "natural")
         TRJoin t0 <$> joinType
                   <*> nonJoinTref
                   <*> optionMaybe (joinCondition nat))
        >>= suffixWrapper joinTrefSuffix
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens queryExpr)
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper (try . alias)
    alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
    aliasIdentifier = blackListIdentifier
                      [-- join keywords
                       "natural"
                      ,"inner"
                      ,"outer"
                      ,"cross"
                      ,"left"
                      ,"right"
                      ,"full"
                      ,"join"
                      ,"on"
                      ,"using"
                       -- subsequent clause keywords
                      ,"where"
                      ,"group"
                      ,"having"
                      ,"order"
                      ]

Here is the final query expression parser:

queryExpr :: Parser QueryExpr
queryExpr = Select
            <$> selectList
            <*> option [] from
            <*> optionMaybe whereClause
            <*> option [] groupByClause
            <*> optionMaybe having
            <*> option [] orderBy

14. Simple SQL query parser

Here is the complete syntax, parser and tests for the value and query expressions so far as a self contained module

module SimpleSQLQueryParser0 where

import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char
import Text.Parsec.String.Combinator
import Text.Parsec (parse,ParseError)
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>), many)
import qualified Text.Parsec.String.Expr as E
import Control.Monad
--import Data.List (intercalate)
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import Debug.Trace

14.1. Supported SQL

14.1.1. comments

-- single line comment
/*
multiline
comment
*/

The /* */ comments do not nest.

14.1.2. value expressions

literals

postive integral literals and string literals with single quote, without escaping of single quote within string (so there is no way to create a string literal with a single quote in it).

1
500
'string literal'
identifiers

Unquoted identifiers only, an identifier may start with a letter or underscore, and contain letters, underscores and digits.

a
something
_test_
a123
dotted identifiers

Supports two part dotted identifiers only. Both parts must parse according to the rules for regular identifiers.

t.a
something.something_else
star

Star, plus dotted star using the identifier rules for the first part.

*
t.*
function application

The function name must parse as a valid identifier.

f()
g(1)
h(2,'something')
operators

Here is the range of operators supported.

a = b
a > b
a < b
a >= b
a <= b
a != b
a <> b
a and b
a or b
1 + 2
1 - 2
1 * 2
1 / 2
'some' || 'thing'
a like b
not a
case expression
case a
when 3 then 'got three'
when 5 then 'got five'
else 'neither'
end
case
when a = 3 then 'a is three'
when b = 4 then 'b is four'
else 'neither'
end

The else branch is optional in both cases.

parentheses
(1 + 2) * 3

Parentheses are explicit in the abstract syntax.

14.1.3. query expressions

TODO: examples

select queries only, no union, intersect or except. No common table expressions.

select list aliases, with the 'as' optional in the alias

'select * from t' 'select t.* from t' but not the alias version 'select * as (a,b,c) from t'.

from clause

implicit and explicit joins, including keywords natural, inner, outer, left, right, full, cross, on and using, plus parens and simple aliases (e.g. select a from t u, but not select a from t(a,b)).

where

group by lists but not the new group by options in SQL2003 (group by (), grouping sets, cube, rollup).

having

order by, with multiple columns, but not explicit asc or desc , and no 'nulls first' or 'nulls last' syntax.

No support for offset and fetch first, or variations.

14.2. Abstract syntax

data ValueExpr = StringLit String
               | NumLit Integer
               | Iden String
               | DIden String String -- a.b
               | Star
               | DStar String -- t.*
               | App String [ValueExpr]
               | PrefOp String ValueExpr
               | BinOp ValueExpr String ValueExpr
               | Case (Maybe ValueExpr) -- test value
                      [(ValueExpr,ValueExpr)] -- when branches
                      (Maybe ValueExpr) -- else value
               | Parens ValueExpr
                 deriving (Eq,Show)
data QueryExpr
    = Select
      {qeSelectList :: [(ValueExpr,Maybe String)]
      ,qeFrom :: [TableRef]
      ,qeWhere :: Maybe ValueExpr
      ,qeGroupBy :: [ValueExpr]
      ,qeHaving :: Maybe ValueExpr
      ,qeOrderBy :: [ValueExpr]
      } deriving (Eq,Show)
makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
                    ,qeFrom = []
                    ,qeWhere = Nothing
                    ,qeGroupBy = []
                    ,qeHaving = Nothing
                    ,qeOrderBy = []}
data TableRef = TRSimple String
              | TRJoin TableRef JoinType TableRef (Maybe JoinCondition)
              | TRParens TableRef
              | TRAlias TableRef String
              | TRQueryExpr QueryExpr
                deriving (Eq,Show)
data JoinType = JoinInner | JoinLeft | JoinRight | JoinFull | JoinCross
                deriving (Eq,Show)
data JoinCondition = JoinOn ValueExpr
                   | JoinUsing [String]
                   | JoinNatural
                   deriving (Eq,Show)

14.3. Value expression parsing

14.3.1. term components

num :: Parser ValueExpr
num = NumLit <$> integer
iden :: [String] -> Parser ValueExpr
iden blacklist = Iden <$> identifierBlacklist blacklist
parensValue :: Parser ValueExpr
parensValue = Parens <$> parens (valueExpr [])
stringLit :: Parser ValueExpr
stringLit = StringLit <$> stringToken
dIden :: Parser ValueExpr
dIden = DIden <$> identifier <*> (dot *> identifier)
star :: Parser ValueExpr
star = Star <$ symbol "*"
dstar :: Parser ValueExpr
dstar = DStar <$> (identifier <* dot <* symbol "*")
app :: Parser ValueExpr
app = App <$> identifier <*> parens (commaSep $ valueExpr [])

14.3.2. case

caseValue :: Parser ValueExpr
caseValue =
    Case
    <$> (keyword "case" *> optionMaybe caseVal)
    <*> many1 whenClause
    <*> optionMaybe elseClause
    <* keyword "end"
 where
   whenClause = (,) <$> (keyword "when" *> caseVal)
                    <*> (keyword "then" *> caseVal)
   elseClause = keyword "else" *> caseVal
   caseVal = valueExpr blackList
   blackList = ["case", "when", "then", "else", "end"]

14.3.3. term

term :: [String] -> Parser ValueExpr
term blackList = choice [caseValue
                        ,try app
                        ,try dstar
                        ,try dIden
                        ,iden blackList
                        ,num
                        ,parensValue
                        ,stringLit
                        ,star]

14.3.4. operators

table :: [[E.Operator ValueExpr]]
table = [[prefix "-", prefix "+"]
         ,[binary "^" E.AssocLeft]
         ,[binary "*" E.AssocLeft
          ,binary "/" E.AssocLeft
          ,binary "%" E.AssocLeft]
         ,[binary "+" E.AssocLeft
          ,binary "-" E.AssocLeft]
         ,[binary "<=" E.AssocRight
          ,binary ">=" E.AssocRight
          ,binaryK "like" E.AssocNone
          ,binary "!=" E.AssocRight
          ,binary "<>" E.AssocRight
          ,binary "||" E.AssocRight]
         ,[binary "<" E.AssocNone
          ,binary ">" E.AssocNone]
         ,[binary "=" E.AssocRight]
         ,[prefixK "not"]
         ,[binaryK "and" E.AssocLeft]
         ,[binaryK "or" E.AssocLeft]]
  where
    binary name assoc =
        E.Infix (mkBinOp name <$ symbol name) assoc
    mkBinOp nm a b = BinOp a nm b
    prefix name = E.Prefix (PrefOp name <$ symbol name)
    binaryK name assoc =
        E.Infix (mkBinOp name <$ keyword name) assoc
    prefixK name = E.Prefix (PrefOp name <$ keyword name)

14.3.5. valueExpr

valueExpr :: [String] -> Parser ValueExpr
valueExpr blackList = E.buildExpressionParser table (term blackList)

14.4. Query expression parsing

selectList :: Parser [(ValueExpr, Maybe String)]
selectList = keyword_ "select" *> commaSep1 selectItem
selectItem :: Parser (ValueExpr, Maybe String)
selectItem = (,) <$> valueExpr [] <*> optionMaybe (try alias)
  where alias = optional (keyword_ "as") *> identifierBlacklist ["from"]
whereClause :: Parser ValueExpr
whereClause = keyword_ "where" *> valueExpr []
groupByClause :: Parser [ValueExpr]
groupByClause = keyword_ "group" *> keyword_ "by"
                *> commaSep1 (valueExpr [])
having :: Parser ValueExpr
having = keyword_ "having" *> (valueExpr [])
orderBy :: Parser [ValueExpr]
orderBy = keyword_ "order" *> keyword_ "by"
          *> commaSep1 (valueExpr [])

14.4.1. from clause

from :: Parser [TableRef]
from = keyword_ "from" >> commaSep1 tref
  where
    tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
    joinTrefSuffix t0 = (do
         nat <- option False (True <$ keyword_ "natural")
         TRJoin t0 <$> joinType
                   <*> nonJoinTref
                   <*> optionMaybe (joinCondition nat))
        >>= suffixWrapper joinTrefSuffix
    nonJoinTref = choice [TRSimple <$> identifier
                         ,try (TRQueryExpr <$> parens queryExpr)
                         ,TRParens <$> parens tref]
                  >>= suffixWrapper (try . alias)
    alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
    aliasIdentifier = identifierBlacklist
                      [-- join keywords
                       "natural"
                      ,"inner"
                      ,"outer"
                      ,"cross"
                      ,"left"
                      ,"right"
                      ,"full"
                      ,"join"
                      ,"on"
                      ,"using"
                       -- subsequent clause keywords
                      ,"where"
                      ,"group"
                      ,"having"
                      ,"order"
                      ]
joinType :: Parser JoinType
joinType = choice
    [JoinCross <$ keyword_ "cross" <* keyword_ "join"
    ,JoinInner <$ keyword_ "inner" <* keyword_ "join"
    ,JoinLeft <$ keyword_ "left"
           <* optional (keyword_ "outer")
           <* keyword_ "join"
    ,JoinRight <$ keyword_ "right"
            <* optional (keyword_ "outer")
            <* keyword_ "join"
    ,JoinFull <$ keyword_ "full"
           <* optional (keyword_ "outer")
           <* keyword_ "join"
    ,JoinInner <$ keyword_ "join"]
joinCondition :: Bool -> Parser JoinCondition
joinCondition nat =
    choice [guard nat >> return JoinNatural
           ,keyword_ "on" >> JoinOn <$> valueExpr []
           ,keyword_ "using" >> JoinUsing <$> parens (commaSep1 identifier)
           ]

14.4.2. queryExpr

queryExpr :: Parser QueryExpr
queryExpr = Select
            <$> selectList
            <*> option [] from
            <*> optionMaybe whereClause
            <*> option [] groupByClause
            <*> optionMaybe having
            <*> option [] orderBy

14.5. tokens

whitespace :: Parser ()
whitespace =
    choice [simpleWhitespace *> whitespace
           ,lineComment *> whitespace
           ,blockComment *> whitespace
           ,return ()]
  where
    lineComment = try (string "--")
                  *> manyTill anyChar (void (char '\n') <|> eof)
    blockComment = try (string "/*")
                   *> manyTill anyChar (try $ string "*/")
    simpleWhitespace = void $ many1 (oneOf " \t\n")
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
  where
    firstChar = letter <|> char '_'
    nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = try $ lexeme $ do
    u <- many1 (oneOf "<>=+-^%/*!|")
    guard (s == u)
    return s
openParen :: Parser Char
openParen = lexeme $ char '('
closeParen :: Parser Char
closeParen = lexeme $ char ')'
stringToken :: Parser String
stringToken = lexeme (char '\'' *> manyTill anyChar (char '\''))
dot :: Parser Char
dot = lexeme $ char '.'
comma :: Parser Char
comma = lexeme $ char ','

14.6. helper functions

keyword :: String -> Parser String
keyword k = try $ do
    i <- identifier
    guard (i == k)
    return k
parens :: Parser a -> Parser a
parens = between openParen closeParen
commaSep :: Parser a -> Parser [a]
commaSep = (`sepBy` comma)
keyword_ :: String -> Parser ()
keyword_ = void . keyword
symbol_ :: String -> Parser ()
symbol_ = void . symbol
commaSep1 :: Parser a -> Parser [a]
commaSep1 = (`sepBy1` comma)
identifierBlacklist :: [String] -> Parser String
identifierBlacklist bl = do
    i <- identifier
    guard (i `notElem` bl)
    return i
suffixWrapper :: (a -> Parser a) -> a -> Parser a
suffixWrapper p a = p a <|> return a

14.7. the parser api

parseQueryExpr :: String -> Either ParseError QueryExpr
parseQueryExpr = parse (whitespace *> queryExpr <* eof) ""
parseValueExpr :: String -> Either ParseError ValueExpr
parseValueExpr = parse (whitespace *> valueExpr [] <* eof) ""

14.8. tests

data TestItem = Group String [TestItem]
              | ValueExpressionTest String ValueExpr
              | QueryExpressionTest String QueryExpr
numLitTests :: [(String,ValueExpr)]
numLitTests =
    [("1", NumLit 1)
    ,("54321", NumLit 54321)]

idenTests :: [(String,ValueExpr)]
idenTests =
    [("test", Iden "test")
    ,("_something3", Iden "_something3")]

operatorTests :: [(String,ValueExpr)]
operatorTests =
   map (\o -> (o ++ " a", PrefOp o (Iden "a"))) ["not", "+", "-"]
   ++ map (\o -> ("a " ++ o ++ " b", BinOp (Iden "a") o (Iden "b")))
          ["=",">","<", ">=", "<=", "!=", "<>"
          ,"and", "or", "+", "-", "*", "/", "||", "like"]

parensTests :: [(String,ValueExpr)]
parensTests = [("(1)", Parens (NumLit 1))]
basicTests :: [(String,ValueExpr)]
basicTests = numLitTests ++ idenTests ++ operatorTests ++ parensTests
stringLiteralTests :: [(String,ValueExpr)]
stringLiteralTests =
    [("''", StringLit "")
    ,("'test'", StringLit "test")]
dIdenTests :: [(String,ValueExpr)]
dIdenTests =
    [("t.a", DIden "t" "a")]
starTests :: [(String,ValueExpr)]
starTests = [("*", Star)]
dStarTests :: [(String,ValueExpr)]
dStarTests = [("t.*", DStar "t")]
appTests :: [(String,ValueExpr)]
appTests = [("f()", App "f" [])
           ,("f(1)", App "f" [NumLit 1])
           ,("f(1,a)", App "f" [NumLit 1, Iden "a"])]
caseTests :: [(String,ValueExpr)]
caseTests =
    [("case a when 1 then 2 end"
     ,Case (Just $ Iden "a") [(NumLit 1,NumLit 2)] Nothing)

    ,("case a when 1 then 2 when 3 then 4 end"
     ,Case (Just $ Iden "a")
           [(NumLit 1, NumLit 2)
           ,(NumLit 3, NumLit 4)]
           Nothing)

    ,("case a when 1 then 2 when 3 then 4 else 5 end"
     ,Case (Just $ Iden "a")
           [(NumLit 1, NumLit 2)
           ,(NumLit 3, NumLit 4)]
           (Just $ NumLit 5))

    ,("case when a=1 then 2 when a=3 then 4 else 5 end"
     ,Case Nothing
           [(BinOp (Iden "a") "=" (NumLit 1), NumLit 2)
           ,(BinOp (Iden "a") "=" (NumLit 3), NumLit 4)]
           (Just $ NumLit 5))
    ]
allValueExprTests :: [(String,ValueExpr)]
allValueExprTests = concat [basicTests
                            ,stringLiteralTests
                            ,dIdenTests
                            ,starTests
                            ,dStarTests
                            ,appTests
                            ,caseTests]
singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
    [("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
    [("select a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
    ,("select a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)]})
    ,("select 1+2,3+4"
     ,makeSelect {qeSelectList =
                     [(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
                     ,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
    ]
selectListTests :: [(String,QueryExpr)]
selectListTests =
    [("select a as a, b as b"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a")
                                 ,(Iden "b", Just "b")]})
    ,("select a a, b b"
     ,makeSelect {qeSelectList = [(Iden "a", Just "a")
                                 ,(Iden "b", Just "b")]})
    ] ++ multipleSelectItemsTests
      ++ singleSelectItemTests
fromTests :: [(String,QueryExpr)]
fromTests =
    [("select a from t"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = [TRSimple "t"]})]
whereTests :: [(String,QueryExpr)]
whereTests =
    [("select a from t where a = 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
    ]
groupByTests :: [(String,QueryExpr)]
groupByTests =
    [("select a,sum(b) from t group by a"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a"]
                 })
    ,("select a,b,sum(c) from t group by a,b"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(Iden "b",Nothing)
                                 ,(App "sum" [Iden "c"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a",Iden "b"]
                 })
    ]
havingTests :: [(String,QueryExpr)]
havingTests =
  [("select a,sum(b) from t group by a having sum(b) > 5"
     ,makeSelect {qeSelectList = [(Iden "a",Nothing)
                                 ,(App "sum" [Iden "b"],Nothing)]
                 ,qeFrom = [TRSimple "t"]
                 ,qeGroupBy = [Iden "a"]
                 ,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
                 })
  ]
orderByTests :: [(String,QueryExpr)]
orderByTests =
    [("select a from t order by a"
     ,ms [Iden "a"])
    ,("select a from t order by a, b"
     ,ms [Iden "a", Iden "b"])
    ]
  where
    ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
                      ,qeFrom = [TRSimple "t"]
                      ,qeOrderBy = o}
queryExprJoinTests :: [(String,QueryExpr)]
queryExprJoinTests =
    [("select a from t"
     ,ms [TRSimple "t"])
    ,("select a from t,u"
     ,ms [TRSimple "t", TRSimple "u"])

    ,("select a from t inner join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t left join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinLeft (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t right join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinRight (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t full join u on expr"
     ,ms [TRJoin (TRSimple "t") JoinFull (TRSimple "u")
          (Just $ JoinOn $ Iden "expr")])

    ,("select a from t cross join u"
     ,ms [TRJoin (TRSimple "t") JoinCross (TRSimple "u") Nothing])

    ,("select a from t natural inner join u"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just JoinNatural)])

    ,("select a from t inner join u using(a,b)"
     ,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
          (Just $ JoinUsing ["a", "b"])])

    ,("select a from (select a from t)"
     ,ms [TRQueryExpr $ ms [TRSimple "t"]])

    ,("select a from t as u"
     ,ms [TRAlias (TRSimple "t") "u"])

    ,("select a from t u"
     ,ms [TRAlias (TRSimple "t") "u"])

    ,("select a from (t cross join u) as u"
     ,ms [TRAlias (TRParens $ TRJoin (TRSimple "t") JoinCross
                                     (TRSimple "u") Nothing) "u"])
    ]
  where
    ms f = makeSelect {qeSelectList = [(Iden "a",Nothing)]
                      ,qeFrom = f}
allQueryExprTests :: [(String,QueryExpr)]
allQueryExprTests = concat [selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests ++ queryExprJoinTests]

todo: use external api parsing code

makeTest :: (Eq a, Show a) => Parser a -> (String,a) -> H.Test
makeTest parser (src,expected) = H.TestLabel src $ H.TestCase $ do
    let gote = parse (whitespace *> parser <* eof) "" src
    case gote of
      Left e -> H.assertFailure $ show e
      Right got -> H.assertEqual src expected got

15. Pretty printing

Here is a pretty printer for the parser version 0.

It uses Text.PrettyPrint module. I haven’t put much comments/explanation here.

module PrettyPrinting0 where
import SimpleSQLQueryParser0 (ValueExpr(..), QueryExpr(..), TableRef(..)
                             ,JoinType(..), JoinCondition(..))
import qualified SimpleSQLQueryParser0 as S
import Text.PrettyPrint (render, vcat, text, (<>), (<+>), empty, parens,
                         nest, Doc, punctuate, comma, sep, quotes,
                         doubleQuotes, hsep)
import Data.Maybe (maybeToList, catMaybes)
import FunctionsAndTypesForParsing
--import Text.Parsec.String.Combinator (eof)
import qualified Test.HUnit as H
import Text.Parsec (parse, ParseError)
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>), many)
--import Text.Parsec.String (Parser)

The basic concept in this pretty printer is we convert the ast into Doc values, then convert these into a string. We have a bunch of functions to convert String to `Doc`s, and combine these docs with different layouts.

By using a pretty printer library, we can get human readable source very easily compared with trying to convert directly to strings ourselves.

15.1. api

prettyQueryExpr :: QueryExpr -> String
prettyQueryExpr = render . queryExpr

15.2. value expressions

valueExpr :: ValueExpr -> Doc
valueExpr (StringLit s) = quotes $ text s
valueExpr (NumLit i) = text $ show i
valueExpr (Iden s) = text s
valueExpr (DIden q i) = text q <> text "." <> text i
valueExpr Star = text "*"
valueExpr (DStar q) = text q <> text "." <> text "*"
valueExpr (App f es) = text f <> parens (commaSep $ map valueExpr es)
valueExpr (PrefOp op e) = sep[text op, valueExpr e]
valueExpr (BinOp e0 op e1) = sep [valueExpr e0, text op, valueExpr e1]
valueExpr (Case test whens els) =
    sep [text "case" <+> maybe empty valueExpr test
        ,nest 5 $ sep (map wh whens
                       ++ [maybe empty
                           (\e -> text "else" <+> valueExpr e)
                           els])
        ,text "end"]
  where wh (w,t) = sep [text "when" <+> valueExpr w
                       ,text "then" <+> valueExpr t]
valueExpr (Parens e) = parens $ valueExpr e

15.3. query expressions

queryExpr :: QueryExpr -> Doc
queryExpr (Select sl fr wh gb hv ob) = sep
    [text "select" <+> nest 7 (commaSep $ map selectItem sl)
     -- from
    ,ml fr $ \f -> text "from" <+> nest 7 (commaSep $ map tref f)
     -- where
    ,me wh $ \w -> text "where" <+> nest 6 (valueExpr w)
     -- group by
    ,ml gb $ \g -> text "group by" <+> nest 9 (commaSep $ map valueExpr g)
     -- having
    ,me hv $ \h -> text "having" <+> nest 6 (valueExpr h)
     -- order by
    ,ml ob $ \o -> text "order by" <+> nest 9 (commaSep $ map valueExpr o)
    ]
  where
    selectItem (e,a) = valueExpr e <+> me a (\a' -> text "as" <+> text a')

    tref (TRSimple t) = text t
    tref (TRParens t) = parens $ tref t
    tref (TRAlias t a) = tref t <+> text "as" <+> text a
    tref (TRQueryExpr q) = parens $ queryExpr q
    tref (TRJoin t0 jt t1 jc) = sep
        [tref t0
        ,joinName jt jc <+> tref t1
        ,case jc of
             Just (JoinOn e) -> text "on" <+> valueExpr e
             Just (JoinUsing is) -> text "using" <+> parens (commaSep $ map text is)
             Just JoinNatural -> empty
             Nothing -> empty]
    joinName jt jc =
      hsep [case jc of
               Just JoinNatural -> text "natural"
               _ -> empty
           ,case jt of
               JoinInner -> text "inner join"
               JoinCross -> text "cross join"
               JoinLeft -> text "left join"
               JoinRight -> text "right join"
               JoinFull -> text "full join"]
    me e r = maybe empty r e
    ml [] _ = empty
    ml l r = r l

15.4. helpers

commaSep :: [Doc] -> Doc
commaSep = sep . punctuate comma

Have a look at the haddock for this module and see if you can work out how the code above works.

*PrettyPrinting0> either (error . show) valueExpr (parseWithEof (S.valueExpr []) "a and b")
a and b

*PrettyPrinting0> either (error . show) queryExpr (parseWithEof S.queryExpr "select a from t inner join u using(a,b)")
select a from t inner join u using (a, b)

15.5. tests

Now we can do some tests: we take the previous test data, and for each test add an additional test which pretty prints then parses the results to see that it is unchanged.

makeTest :: (Eq a, Show a) =>
            (String -> Either ParseError a)
         -> (a -> String)
         -> (String,a)
         -> H.Test
makeTest parser pretty (src,expected) = H.TestLabel src $ H.TestCase $ do
    let gote = parser src
    case gote of
      Left e -> H.assertFailure $ show e
      Right got -> do
        H.assertEqual src expected got
        let prsql = pretty got
            gotpretty = parser prsql
        case gotpretty of
          Left e -> H.assertFailure $ "pretty: " ++ prsql ++ "\n" ++ show e
          Right gotp -> H.assertEqual ("pretty: " ++ prsql) expected gotp

TODO: fix parsing issue

*PrettyPrinting0> H.runTestTT $ H.TestList $ map (makeTest S.parseQueryExpr (render . queryExpr)) S.allQueryExprTests
Cases: 26  Tried: 26  Errors: 0  Failures: 0
Counts {cases = 26, tried = 26, errors = 0, failures = 0}
*PrettyPrinting0> H.runTestTT $ H.TestList $ map (makeTest S.parseValueExpr (render . valueExpr)) S.allValueExprTests
 ### Failure in: 34:case when a=1 then 2 when a=3 then 4 else 5 end
(line 1, column 11):
unexpected "a"
expecting "--" or "/*"
Cases: 35  Tried: 35  Errors: 0  Failures: 1
Counts {cases = 35, tried = 35, errors = 0, failures = 1}

16. Error messages

In this tutorial, we will start looking at the error messages generated by Parsec and how we can influence to improve them.

TODO:

review the different methods of generating failure in parsec and what they look like, including choice.

examine how rearranging parsers can change the error messages, committing to a parsing branch to improve the messages

also possibly look at adding additional parsing to detect common errors to provide a nicer error messages

maybe talk about the idea of being permissive in the parsing proper stage, then doing additional syntax-type checks on the ast after

16.1. ways of generating errors in parsec

16.2. how parsec combines errors or drops them into the void

16.3. Parsing tpch

Let’s try the parser out on the TPC-H queries.

Summary of errors so far: q1: typed literal: type_name 'literal value' q2: scalar subquery q3: typed literal q4: typed literal q5: typed literal q6: typed literal q7: ?? q8: extract?? q9: extract ?? q10: typed literal q11: scalar sub query q12: ?? q13: not like?? q14: decimal literal q15: cte q16: count distinct q17: decimal literal q18: in subquery q19: in literal list q20: in subquery q21: exists subquery q22: substring

TODO: try to figure out each issue. Not very easy: the error messages are not very good.

Do some ad hoc stuff to try to improve each error message? Keep simplifying expressions then try to understand specific principles. Then make some generalizations that can be made across the parser

16.4. value expressions in isolation

We can start with the real error code now by considering the valueexpression parser in isolation.

ideas: error inside function parameter list bad identifier unrecognised operator unrecognised chars unmatching parens keyword errors in case unterminated string lit 3 dot components binary op with missing second arg associativity errors with binary ops? blacklist errors

16.5. query expressions

mispelling keywords at this level clauses in wrong order adding extra tokens or removing them at clause/value expression boundaries

16.6. summarizing?

16.7. updated parsing code with improved error message behaviour