1. Overview
WIP, a tutorial which demonstrates the basics of Parsec and goes on to build a SQL query parser.
You can view this tutorial as HTML online here:
and you can view the files directly in the github repository here:
1.1. Summary of sections
Introduction to parsing with Parsec, including a review of Text.Parsec.Char functions.
Creating a very simple expression language parser, and introducing some functions from Text.Parsec.Combinator.
Rewriting the simple expression parser code in a more succinct style.
Review and examples of all functions from Text.Parsec.Combinator, and some from Control.Applicative and Control.Monad.
The utility functions used in the previous tutorials, plus some notes on types in Parsec.
This covers using the Text.Parsec.Expr for expression parsing with prefix, postfix and infix operators with fixity.
Looks at an issue we have with the way the symbol parser in the Text.Parsec.Expr tutorial was used, and some possible fixes.
This covers the Text.Parsec.Perm module which is used for parsing different things in flexible order.
This covers Text.Parsec.Token which can be used to create token parsers easily.
This covers building a parser a subset of value expressions from SQL, which are an extension of the simple expression types and parsers covered in previous tutorials.
This covers building a parser to parse query expressions with select lists, simple from, where, group by, having and order by.
This extend the parser for query expressions to support a from clause with much more features including joins.
Here is the code from ValueExpressions, QueryExpressions and FromClause plus tests put together and rearranged as a coherent standalone module.
This quick module covers a simple pretty printer for our SQL ast.
In this document, we will explore error messages with parsec and how restructuring parser code can lead to better or worse error messages.
1.2. Going further
If you are interested in SQL parsing, check out the project to build a complete SQL parser here: http://jakewheat.github.io/simple-sql-parser/latest. The parsing code in the simple-sql-parser project is based on this tutorial code.
1.3. Extras
an executable which contains the boilerplate to run a parsec parser on a string passed as an argument
an executable which contains the boilerplate to run a parsec parser on a file passed as an argument
Contact: jakewheatmail@gmail.com
License: BSD3
2. Getting started
This is an introduction to parsing with Haskell and Parsec.
Prerequisites: you should know some basic Haskell and have GHC and cabal-install installed (installing the Haskell Platform will give you this).
This tutorial was originally written using GHC 7.6.3 and Parsec 3.1.3, which are the versions which come with the Haskell Platform 2013.2.0.0. It should also work fine with GHC 7.8.4 and GHC 7.10.2 and through to at least the latest release of Parsec 3.1.x.
This tutorial was written using Literate Haskell files available here: https://github.com/JakeWheat/intro_to_parsing.
I recommend you download them all, and follow along in your favourite editor, and use GHCi to experiment. To download the intro_to_parsing files, use:
git clone https://github.com/JakeWheat/intro_to_parsing.git
Here are the imports.
import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (anyChar)
import Text.Parsec.String.Char
import FunctionsAndTypesForParsing (regularParse, parseWithEof, parseWithLeftOver)
import Data.Char
import Text.Parsec.String.Combinator (many1)
2.1. First parser
The first parser:
anyChar :: Parser Char
This parser is in the module Text.Parsec.Char
. There is a wrapper in
this tutorial’s project, Text.Parsec.String.Char
, which gives this
function a simplified type.
Whenever we write a parser which parses to a value of type a
, we give
it the return type of Parser a
. In this case, we parse a character
so the return type is Parser Char
. The Parser
type itself is in
the module Text.Parsec.String
. We will cover this in more detail
later.
Let’s use this parser. I will assume you have GHC and cabal-install installed (which provides the 'cabal' executable) and both are in your PATH. The Haskell Platform is one way that provides this.
Change to the directory where you downloaded the intro_to_parsing source files (which will contain the GettingStarted.lhs file). Then you can set up a cabal sandbox and be ready to work with the code by running the following commands in that directory:
cabal v1-update
cabal v1-sandbox init
cabal v1-install parsec HUnit
cabal v1-repl
Now you will get the ghci prompt. Type in ':l GettingStarted.lhs'. You
can run the parser using a wrapper, enter the following at the ghci
prompt: regularParse anyChar "a"
.
Here is a transcript of running ghci via 'cabal repl':
$ cabal repl
Warning: The repl command is a part of the legacy v1 style of cabal usage.
Please switch to using either the new project style and the new-repl command
or the legacy v1-repl alias as new-style projects will become the default in
the next version of cabal-install. Please file a bug if you cannot replicate a
working v1- use case with the new-style commands.
For more information, see: https://wiki.haskell.org/Cabal/NewBuild
GHCi, version 8.6.5: http://www.haskell.org/ghc/ :? for help
Prelude> :l GettingStarted.lhs
[1 of 5] Compiling Text.Parsec.String.Char ( Text/Parsec/String/Char.hs, interpreted )
[2 of 5] Compiling Text.Parsec.String.Combinator ( Text/Parsec/String/Combinator.hs, interpreted )
[3 of 5] Compiling Text.Parsec.String.Parsec ( Text/Parsec/String/Parsec.hs, interpreted )
[4 of 5] Compiling FunctionsAndTypesForParsing ( FunctionsAndTypesForParsing.lhs, interpreted )
[5 of 5] Compiling Main ( GettingStarted.lhs, interpreted )
Ok, five modules loaded.
*Main> regularParse anyChar "a"
Right 'a'
*Main>
You can exit ghci by entering ':quit' or using Ctrl-d. From now on, to start ghci again, you can just change to the directory with GettingStarted.lhs and run 'cabal repl'. ghci should have readline support so you can browse through your command history using up and down arrow, etc.
This is the type of regularParse
. It is wrapper which takes a parser
function such as anyChar, and wraps it so you can parse a string to
either a parse error, or the return value from your parser function:
regularParse :: Parser a -> String -> Either ParseError a
Here are some examples of running this parser on various input:
*Main> regularParse anyChar "a"
Right 'a'
*Main> regularParse anyChar "b"
Right 'b'
*Main> regularParse anyChar "0"
Right '0'
*Main> regularParse anyChar " "
Right ' '
*Main> regularParse anyChar "\n"
Right '\n'
*Main> regularParse anyChar "aa"
Right 'a'
*Main> regularParse anyChar ""
Left (line 1, column 1):
unexpected end of input
*Main> regularParse anyChar " a"
Right ' '
You can see that if there are no characters, we get an error.
Otherwise, it takes the first character and returns it, and throws
away any trailing characters. The details of the helper function
regularParse
will come later.
Here are two alternatives to regularParse
you can also use for
experimenting for the time being:
parseWithEof :: Parser a -> String -> Either ParseError a
parseWithLeftOver :: Parser a -> String -> Either ParseError (a,String)
These can be useful when you are not sure if your parser is consuming all your input string or not. The eof parser will error if you haven’t consumed all the input, and the leftover parser can instead tell you what was not consumed from the input.
*Main> regularParse anyChar "a"
Right 'a'
*Main> parseWithEof anyChar "a"
Right 'a'
*Main> parseWithLeftOver anyChar "a"
Right ('a',"")
*Main> *Main> regularParse anyChar ""
Left (line 1, column 1):
unexpected end of input
*Main> parseWithEof anyChar ""
Left (line 1, column 1):
unexpected end of input
*Main> parseWithLeftOver anyChar ""
Left (line 1, column 1):
unexpected end of input
*Main> regularParse anyChar "aa"
Right 'a'
*Main> parseWithEof anyChar "aa"
Left (line 1, column 2):
unexpected 'a'
expecting end of input
*Main> parseWithLeftOver anyChar "aa"
Right ('a',"a")
*Main> parseWithLeftOver anyChar "abc"
Right ('a',"bc")
You can use these functions and ghci to experiment. Try running all the parsers in ghci on various input strings as you work through the document to get a good feel for all the different features. Tip: you can also write the parsers inline in the function call, for example:
*Main> regularParse (many1 digit) "1"
Right "1"
*Main> regularParse (many1 digit) "122"
Right "122"
This can be used to quickly try out new ad hoc parsing code.
2.2. Type signatures
The real Parsec functions have quite complex type signatures. This
makes a lot of things very tricky before you understand them, and can
make some of the error messages you’ll see really difficult to
understand. I’ve created some wrapper modules, which set the types of
all the functions from Parsec we use to be much more restricted. This
will make the types easy to understand, and reduce the amount of
tricky to understand compiler errors you get. You can use this
approach when writing your own parser code with Parsec. These wrapper
modules are created with the following name pattern:
Text.Parsec.Char
→ Text.Parsec.String.Char
.
Later on, we will look at the general types in more detail.
2.3. Text.Parsec.Char
Let’s go through some of the functions in Text.Parsec.Char
module from
the Parsec package. The haddock is available here:
http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html.
Here is the satisfy
function, with its full type signature.
satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char
This is one of the main primitive functions in Parsec. This looks at
the next character from the current input, and if the function (Char
→ Bool
) returns true for this character, it 'pops' it from the input
and returns it. In this way, the current position in the input string
is tracked behind the scenes.
In the simplified type wrappers, the satisfy
function’s type is this:
satisfy :: (Char -> Bool) -> Parser Char
This makes it a bit clearer what it is doing. All the functions in
Text.Parsec.Char
are reproduced in the local Text.Parsec.String.Char
module with simplified types
(<https://github.com/JakeWheat/intro_to_parsing/blob/master/Text/Parsec/String/Char.hs>).
Here are some examples of satisfy in action.
*Main> parseWithEof (satisfy (=='a')) "a"
Right 'a'
*Main> parseWithEof (satisfy (=='b')) "a"
Left (line 1, column 1):
unexpected "a"
*Main> parseWithEof (satisfy (`elem` "abc")) "a"
Right 'a'
*Main> parseWithEof (satisfy (`elem` "abc")) "d"
Left (line 1, column 1):
unexpected "d"
*Main> parseWithEof (satisfy isDigit) "d"
Left (line 1, column 1):
unexpected "d"
*Main> parseWithEof (satisfy isDigit) "1"
Right '1'
You can see that it is easy to use ==
, or elem
or one of the
functions from the Data.Char module.
If you look at the docs on hackage
http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Char.html,
you can view the source. The implementations of most of the functions
in Text.Parsec.Char
are straightforward. I recommend you look at the
source for all of these functions.
You can see in the source that the satisfy
function is a little more
primitive than the other functions.
Here is the parser we used above in the anyChar
parser:
anyChar :: Parser Char
If you look at the source via the haddock link above, you can see it
uses satisfy
.
Here are some other simple wrappers of satisfy
from
Text.Parsec.Char
which use different validation functions.
The char
parser parses a specific character which you supply:
char :: Char -> Parser Char
*Main> regularParse (char 'a') "a"
Right 'a'
*Main> regularParse (char 'a') "b"
Left (line 1, column 1):
unexpected "b"
expecting "a"
These parsers all parse single hardcoded characters
space :: Parser Char
newline :: Parser Char
tab :: Parser Char
They all return a Char
. You might be able to guess what Char
each
of them returns, you can double check your intuition using ghci.
These parser all parse one character from a hardcoded set of characters:
upper :: Parser Char
lower :: Parser Char
alphaNum :: Parser Char
letter :: Parser Char
digit :: Parser Char
hexDigit :: Parser Char
octDigit :: Parser Char
In these cases, the return value is less redundant.
oneOf
and noneOf
parse any of the characters in the given list
oneOf :: [Char] -> Parser Char
noneOf :: [Char] -> Parser Char
These are just simple wrappers of satisfy using elem
.
You should try all these parsers out in ghci, e.g.:
regularParse space " "
regularParse upper "A"
regularParse (char 'b') "B"
regularParse (oneOf "abc") "c"
Here are the final functions in Text.Parsec.Char
:
string
matches a complete string, one character at a time. I think
the implementation of this function is like it is for efficiency when
parsing from, e.g., Data.Text.Text
, instead of String
, but I’m not
sure. We will skip the detailed explanation of the implementation for
now.
string :: String -> Parser String
*Main> regularParse (string "one") "one"
Right "one"
*Main> regularParse (string "one") "two"
Left (line 1, column 1):
unexpected "t"
expecting "one"
Here is the spaces
parser, which, if you look at the source, you can
see uses a combinator (skipMany
). We will cover this combinator
shortly.
spaces :: Parser ()
*Main> regularParse spaces ""
Right ()
*Main> regularParse spaces " "
Right ()
*Main> regularParse spaces " "
Right ()
*Main> regularParse spaces " a "
Right ()
*Main> regularParse spaces "a a "
Right ()
It always succeeds.
2.4. A couple of helper executables
Here are two exes which you can use to parse either a string or a file to help you experiment. This will save you having to figure out how to write this boilerplate until later.
Now you can easily experiment using ghci, or with a string on the command line, or by putting the text to parse into a file and parsing that.
3. Very simple expression parsing
In this tutorial we will develop a parser for a very simple expression language, and start learning about the set of combinators which comes with Parsec.
import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char (oneOf, char, digit, satisfy)
import Text.Parsec.String.Combinator (many1, choice, chainl1)
import Control.Applicative ((<|>), many)
import Control.Monad (void)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing
3.1. num
The first element we will have in this expression language is positive integral numbers:
numberExamples :: [(String,Integer)]
numberExamples = [("1", 1)
,("23", 23)]
TODO: make examples with parsing failures for all of the example scripts below?
To parse a number, we need to parse one or more digits, and then read
the resulting string. We can use the combinator many1
to help with
this. We will also use do notation.
num :: Parser Integer
num = do
n <- many1 digit
return (read n)
Let’s try it out.
*Main> regularParse num "1"
Right 1
*Main> regularParse num "123456"
Right 123456
*Main> regularParse num "aa"
Left (line 1, column 1):
unexpected "a"
expecting digit
How does it work? First, we parse one or more (many1
) digits (digit
),
and give the result the name 'n'. Then we convert the string to an
integer using read
.
The many1
function’s type looks like this:
many1 :: Parser a -> Parser [a]
It applies the parser given one or more times, returning the result.
Let’s see what happens when we use the many
combinator which parses
zero or more items instead of one or more.
num1 :: Parser Integer
num1 = do
n <- many digit
return (read n)
*Main> regularParse num1 "1"
Right 1
*Main> regularParse num1 "123456"
Right 123456
*Main> regularParse num1 "aa"
Right *** Exception: Prelude.read: no parse
3.2. var
For var, we have to decide on a syntax for the identifiers. Let’s go for a common choice: identifiers must start with a letter or underscore, and then they can be followed by zero or more letters, underscores or digits in any combination.
varExamples :: [(String,String)]
varExamples = [("test", "test")
,("_stuff", "_stuff")
,("_1234", "_1234")]
var :: Parser String
var = do
fc <- firstChar
rest <- many nonFirstChar
return (fc:rest)
where
firstChar = satisfy (\a -> isLetter a || a == '_')
nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')
This time, we create two helper parsers: firstChar
, which parses a
letter or underscore, and nonFirstChar
which parses a digit, letter
or underscore. This time, we use the many
function instead of
many1
.
Try it out in ghci. I like to try things which you expect to work, and also to try things which you expect to not work and make sure you get an error.
3.3. parens
The parens parser will eventually parse any expression inside parentheses. First it will just parse integers inside parentheses.
data Parentheses = Parentheses Integer
deriving (Eq,Show)
parensExamples :: [(String, Parentheses)]
parensExamples = [("(1)", Parentheses 1)
,("(17)", Parentheses 17)]
parens :: Parser Parentheses
parens = do
void $ char '('
e <- many1 digit
void $ char ')'
return (Parentheses (read e))
There is a new function: void
. This might be familiar to you
already. This is used to ignore the result of the char
parser, since
we are not interested in this value. You can also write the function
without void
, but ghc will give you a warning if you have warnings
turned on.
One way of turning warnings on in ghci is to enter :set -Wall
at the
ghci prompt.
parens' :: Parser Parentheses
parens' = do
char '('
e <- many1 digit
char ')'
return (Parentheses (read e))
*Main> :set -Wall
*Main> :l "VerySimpleExpressions.lhs"
...
FirstRealParsing.lhs:140:7: Warning:
A do-notation statement discarded a result of type Char.
Suppress this warning by saying "_ <- char '('",
or by using the flag -fno-warn-unused-do-bind
FirstRealParsing.lhs:142:7: Warning:
A do-notation statement discarded a result of type Char.
Suppress this warning by saying "_ <- char ')'",
or by using the flag -fno-warn-unused-do-bind
...
As you can see, another way to suppress the warning is to use
_ ← char '('
.
One issue with this parser is that it doesn’t handle whitespace:
*Main> regularParse parens "(1)"
Right (Parentheses 1)
*Main> regularParse parens "( 1)"
Left (line 1, column 2):
unexpected " "
expecting digit
*Main> regularParse parens "(1 )"
Left (line 1, column 3):
unexpected " "
expecting digit or ")"
We will look at this issue below.
3.4. add
Now we will write a little parser to parse strings like 'a+b' where a and b are numbers.
data SingleAdd = SingleAdd Integer Integer
deriving (Eq,Show)
singleAddExamples :: [(String, SingleAdd)]
singleAddExamples = [("1+2", SingleAdd 1 2)
,("101+202", SingleAdd 101 202)]
add :: Parser SingleAdd
add = do
e0 <- many1 digit
void $ char '+'
e1 <- many1 digit
return (SingleAdd (read e0) (read e1))
It has the same whitespace issues as the parens parser.
*Main> regularParse add "1+2"
Right (SingleAdd 1 2)
*Main> regularParse add "1 +2"
Left (line 1, column 2):
unexpected " "
expecting digit or "+"
3.5. whitespace
Here is a parser which will skip zero or more whitespace characters.
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
We can use this to make our parsers handle whitespace better.
*Main> regularParse whitespace " "
Right ()
*Main> regularParse whitespace " "
Right ()
*Main> regularParse whitespace "\t"
Right ()
*Main> regularParse whitespace " \n "
Right ()
*Main> regularParse whitespace ""
Right ()
Notice that it always succeeds.
Here is the parens parser rewritten with a common approach to whitespace handling:
parensW :: Parser Parentheses
parensW = do
whitespace
void $ char '('
whitespace
e <- many1 digit
whitespace
void $ char ')'
whitespace
return (Parentheses (read e))
*Main> regularParse parensW "(1)"
Right (Parentheses 1)
*Main> regularParse parensW " (1)"
Right (Parentheses 1)
*Main> regularParse parensW " (1 )"
Right (Parentheses 1)
*Main> regularParse parensW " ( 1 ) "
Right (Parentheses 1)
Looks good.
In the original parsec documentation, one of the concepts mentioned is the idea of 'lexeme' parsing. This is a style in which every token parser should also consume and ignore any trailing whitespace.
This is a simple convention which with a bit of care allows skipping
whitespace exactly once wherever it needs to be skipped. To complete
the lexeme style, we should also always skip leading whitespace at the
top level only. This feels more elegant than spamming all the parsing
code with many calls to whitespace
.
lexeme :: Parser a -> Parser a
lexeme p = do
x <- p
whitespace
return x
parseWithWhitespace :: Parser a -> String -> Either ParseError a
parseWithWhitespace p = parseWithEof wrapper
where
wrapper = do
whitespace
p
Here is the parens parser rewritten to use lexeme:
parensL :: Parser Parentheses
parensL = do
void $ lexeme $ char '('
e <- lexeme $ many1 digit
void $ lexeme $ char ')'
return (Parentheses (read e))
*Main> parseWithWhitespace parensL "(1)"
Right (Parentheses 1)
*Main> parseWithWhitespace parensL " (1)"
Right (Parentheses 1)
*Main> parseWithWhitespace parensL " ( 1)"
Right (Parentheses 1)
*Main> parseWithWhitespace parensL " ( 1 ) "
Right (Parentheses 1)
The parseWithWhitespace
function can also use (>>)
to make it a
bit shorter, wrapper = whiteSpace >> p
.
Here is the shorter version of this function using (>>)
:
parseWithWhitespace' :: Parser a -> String -> Either ParseError a
parseWithWhitespace' p = parseWithEof (whitespace >> p)
Try rewriting the SingleAdd parser to use lexeme
, and test it out to
convince yourself that it skips whitespace correctly.
3.6. simple expr
Now we are ready to write a parser which parses simple expressions made from these components. Here is the data type for these expressions.
data SimpleExpr = Num Integer
| Var String
| Add SimpleExpr SimpleExpr
| Parens SimpleExpr
deriving (Eq,Show)
It’s so simple that it is almost useless at the moment.
simpleExprExamples :: [(String,SimpleExpr)]
simpleExprExamples =
[("a", Var "a")
,("1", Num 1)
,("2 + 3", Add (Num 2) (Num 3))
,("(42)", Parens (Num 42))]
TODO: some more complex examples
Here are all our component parsers with lexeme
, and with the
SimpleExpr
constructors:
numE :: Parser SimpleExpr
numE = do
n <- lexeme $ many1 digit
return $ Num $ read n
There doesn’t seem to be a unique obviously correct place to put the lexeme call in the var parser:
varE :: Parser SimpleExpr
varE = lexeme $ do
fc <- firstChar
rest <- many nonFirstChar
return $ Var (fc:rest)
where
firstChar = satisfy (\a -> isLetter a || a == '_')
nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')
Here is an alternative, with the call to lexeme in a different place, but gives effectively the same function.
varE' :: Parser SimpleExpr
varE' = do
fc <- firstChar
rest <- lexeme $ many nonFirstChar
return $ Var (fc:rest)
where
firstChar = satisfy (\a -> isLetter a || a == '_')
nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')
parensE :: Parser SimpleExpr
parensE = do
void $ lexeme $ char '('
e <- lexeme $ many1 digit
void $ lexeme $ char ')'
return $ Parens $ Num $ read e
In the parens parser, we can reuse the numE
parser like this:
parensE' :: Parser SimpleExpr
parensE' = do
void $ lexeme $ char '('
e <- numE
void $ lexeme $ char ')'
return $ Parens e
Here is the add parser using numE
also.
addE :: Parser SimpleExpr
addE = do
e0 <- numE
void $ lexeme $ char '+'
e1 <- numE
return $ Add e0 e1
3.6.1. choice
To combine these, we can use an operator called (<|>)
:
numOrVar :: Parser SimpleExpr
numOrVar = numE <|> varE
It tries the first parser, and it if fails (without consuming any input), it tries the second parser. More about the 'consuming input' concept later.
Here is another way to write the numOrVar parser:
numOrVar' :: Parser SimpleExpr
numOrVar' = choice [numE,varE]
choice
is just wrapper around (<|>)
. You can choose which one to
use based on which is more readable in each particular case.
*Main> parseWithWhitespace numOrVar "a"
Right (Var "a")
*Main> parseWithWhitespace numOrVar "1"
Right (Num 1)
*Main> parseWithWhitespace numOrVar "!"
Left (line 1, column 1):
unexpected "!"
expecting digit
Here is the first version of the simpleExpr parser:
simpleExpr :: Parser SimpleExpr
simpleExpr = numE <|> varE <|> addE <|> parensE
*Main> parseWithWhitespace simpleExpr "12"
Right (Num 12)
*Main> parseWithWhitespace simpleExpr "aa"
Right (Var "aa")
*Main> parseWithWhitespace simpleExpr "1+2"
Left (line 1, column 2):
unexpected '+'
expecting digit or end of input
*Main> parseWithWhitespace simpleExpr "(1)"
Right (Parens (Num 1))
*Main> parseWithWhitespace simpleExpr "(aa)"
Left (line 1, column 2):
unexpected "a"
expecting digit
It works well for some of the parsers. One problem is that the addE
and parensE
parsers don’t parse general expressions as the
components, but just numE
. Another problem is that the addE
doesn’t work at all: the numE
parser parses the first number, and
the addE
parser is never tried. This is an example of (<|>)
not
trying the second parser if the first parser succeeds, even if a later
alternative would consume more input or successfully parse the whole
input.
Let’s try and rearrange the order:
simpleExpr1 :: Parser SimpleExpr
simpleExpr1 = addE <|> numE <|> varE <|> parensE
*Main> parseWithWhitespace simpleExpr1 "12"
Left (line 1, column 3):
unexpected end of input
expecting digit or "+"
*Main> parseWithWhitespace simpleExpr1 "aa"
Right (Var "aa")
*Main> parseWithWhitespace simpleExpr1 "1+2"
Right (Add (Num 1) (Num 2))
*Main> parseWithWhitespace simpleExpr1 "(1)"
Right (Parens (Num 1))
We swapped one problem for another. Let’s fix this using the try
function.
simpleExpr2 :: Parser SimpleExpr
simpleExpr2 = try addE <|> numE <|> varE <|> parensE
*Main> parseWithWhitespace simpleExpr2 "12"
Right (Num 12)
*Main> parseWithWhitespace simpleExpr2 "aa"
Right (Var "aa")
*Main> parseWithWhitespace simpleExpr2 "1+2"
Right (Add (Num 1) (Num 2))
*Main> parseWithWhitespace simpleExpr2 "(1)"
Right (Parens (Num 1))
Now everything seems to work fine. The try
function is very powerful
and easy to use, and can be used where in a more traditional parsing
approach you would have to use left factoring or something else.
The try
function implements backtracking. When this is used with
(<|>)
, it means that if the first parser fails, it will undo the
consumed input and carry on with the next option, instead of failing
completely. This works even if the try
is nested deeply within the
first parser given to (<|>)
.
try
has its downsides (some of which we will see later), and I
usually try to minimise its use or eliminate it completely. I found I
often got into a complete mess when I used try
too much when writing
parsers for something a little tricky like SQL, and that although
doing some left-factoring appeared at first to be tedious and appeared
to make the code less readable, I eventually decided that for me it
made the code more readable since what was happening was more
transparent.
Now we are going to fix this parser to parse arbitrarily nested
expressions. In a way, the method used will roughly mean we are left
factoring the numE
and addE
common prefix.
Here is the naive implementation:
parensE3 :: Parser SimpleExpr
parensE3 = do
void $ lexeme $ char '('
e <- simpleExpr3
void $ lexeme $ char ')'
return $ Parens e
addE3 :: Parser SimpleExpr
addE3 = do
e0 <- simpleExpr3
void $ lexeme $ char '+'
e1 <- simpleExpr3
return $ Add e0 e1
simpleExpr3 :: Parser SimpleExpr
simpleExpr3 = try addE3 <|> numE <|> varE <|> parensE3
If you run this parser, it will enter an infinite loop, since
simpleExpr3
and addE3
will keep calling each other recursively
without making any progress.
*Main> parseWithWhitespace simpleExpr3 "a+b"
C-c Interrupted.
Let’s try without add
.
parensE4 :: Parser SimpleExpr
parensE4 = do
void $ lexeme $ char '('
e <- simpleExpr4
void $ lexeme $ char ')'
return $ Parens e
simpleExpr4 :: Parser SimpleExpr
simpleExpr4 = numE <|> varE <|> parensE4
*Main> parseWithWhitespace simpleExpr4 "a"
Right (Var "a")
*Main> parseWithWhitespace simpleExpr4 "1"
Right (Num 1)
*Main> parseWithWhitespace simpleExpr4 "(1)"
Right (Parens (Num 1))
*Main> parseWithWhitespace simpleExpr4 "((a))"
Right (Parens (Parens (Var "a")))
At least this part seems to work OK.
Let’s try to stop the add parser from calling itself indirectly:
Here is a parameterized parens parser where we supply the nested expression parser as an argument. This is used here to try to make the code easier to follow and avoid rewriting this parser out again and again in full.
parensEN :: Parser SimpleExpr -> Parser SimpleExpr
parensEN simpleExprImpl = do
void $ lexeme $ char '('
e <- simpleExprImpl
void $ lexeme $ char ')'
return $ Parens e
Here is a new parser, which parses expressions except add.
term :: Parser SimpleExpr -> Parser SimpleExpr
term simpleExprImpl = numE <|> varE <|> parensEN simpleExprImpl
term5 :: Parser SimpleExpr
term5 = term simpleExpr5
addE5 :: Parser SimpleExpr
addE5 = do
e0 <- term5
void $ lexeme $ char '+'
e1 <- term5
return $ Add e0 e1
simpleExpr5 :: Parser SimpleExpr
simpleExpr5 = try addE5 <|> term5
*Main> parseWithWhitespace simpleExpr5 "1"
Right (Num 1)
*Main> parseWithWhitespace simpleExpr5 "a"
Right (Var "a")
*Main> parseWithWhitespace simpleExpr5 "(a)"
Right (Parens (Var "a"))
*Main> parseWithWhitespace simpleExpr5 "1+2"
Right (Add (Num 1) (Num 2))
*Main> parseWithWhitespace simpleExpr5 "1+a"
Right (Add (Num 1) (Var "a"))
*Main> parseWithWhitespace simpleExpr5 "(1+a)"
Right (Parens (Add (Num 1) (Var "a")))
*Main> parseWithWhitespace simpleExpr5 "1+a+b"
Left (line 1, column 4):
unexpected '+'
expecting end of input
Almost. Let’s see what happens when the second term
in add
is
changed to the general expression parser.
term6 :: Parser SimpleExpr
term6 = term simpleExpr6
addE6 :: Parser SimpleExpr
addE6 = do
e0 <- term6
void $ lexeme $ char '+'
e1 <- simpleExpr6
return $ Add e0 e1
simpleExpr6 :: Parser SimpleExpr
simpleExpr6 = try addE6 <|> term6
*Main> parseWithWhitespace simpleExpr6 "a + b + c"
Right (Add (Var "a") (Add (Var "b") (Var "c")))
Maybe it looks like we’ve made it. But there is a problem. We’ve parsed the + operator as if it has right associativity:
a + b + c -> a + (b + c)
But it should be left associative:
a + b + c -> (a + b) + c
Let’s left factor the parsing and fix this:
term7 :: Parser SimpleExpr
term7 = term simpleExpr7
simpleExpr7 :: Parser SimpleExpr
simpleExpr7 = do
-- first parse a term
e <- term7
-- then see if it is followed by an '+ expr' suffix
maybeAddSuffix e
where
-- this function takes an expression, and parses a
-- '+ expr' suffix, returning an Add expression
-- it recursively calls itself via the maybeAddSuffix function
addSuffix e0 = do
void $ lexeme $ char '+'
e1 <- term7
maybeAddSuffix (Add e0 e1)
-- this is the wrapper for addSuffix, which adapts it so that if
-- addSuffix fails, it returns just the original expression
maybeAddSuffix e = addSuffix e <|> return e
*Main> parseWithWhitespace simpleExpr7 "a + b + c"
Right (Add (Add (Var "a") (Var "b")) (Var "c"))
Now the parser seems to work for everything it should.
There is a combinator function in Parsec we can use which abstracts
this sort of pattern, chainl1
.
simpleExpr8 :: Parser SimpleExpr
simpleExpr8 = chainl1 term8 op
where
op = do
void $ lexeme $ char '+'
return Add
term8 = term simpleExpr8
How does this work? Here is the type of chainl1
:
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 term op = ...
The type of the Add constructor in pseudocode is:
Add :: SimpleExpr -> SimpleExpr -> SimpleExpr
The op
parser here now just parses the operator itself, i.e. '+'
(and not the second expression like our simpleExpr7 parser). The
return from the op
function is a function which accepts two elements
and combines them using the appropriate operator representation. In
this case, the represenation is the Add
constructor.
You can look at the source http://hackage.haskell.org/package/parsec-3.1.3/docs/src/Text-Parsec-Combinator.html#chainl1 and see if you can understand how it works. If you can’t work it out, you could come back to it later when you have more experience writing parsing code.
3.7. Testing with the examples
TODO: write a little manual tester that accepts a parser and a list of examples, and checks they all parse correctly.
3.8. Testing with quickcheck
Let’s see if we can check with quickcheck. It’s a bit tricky testing parsers in this way, but one way to do something useful is to generate random asts, convert them to concrete syntax, parse them, and check the result. We can write a simple 'pretty printer' to convert an ast to concrete syntax.
3.8.1. a pretty printer
TODO: a really simple pretty printer just pasting strings together, no layout.
3.8.2. the quick check code
TODO: write a quickcheck property and arbitary instance and show running it at the ghci prompt
4. Applicative style parsing code
Now we can go back over the expression parsing code written in the last tutorial, and make it much more concise, and also make it more readable. We are going to do this mainly by using functions from the typeclass Applicative.
Remember you can (and should) use the functions regularParse
and its
variations (TODO list them here) to try out the all these parsers in
ghci, and you can write your own variations to experiment with if you
are unsure about anything.
import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (oneOf, char, digit, letter, satisfy)
import Text.Parsec.String.Combinator (many1, chainl1)
import Control.Applicative ((<$>), (<*>), (<*), (*>), (<|>), many, (<$))
import Control.Monad (void, ap)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing
Here is the SimpleExpr type again:
data SimpleExpr = Num Integer
| Var String
| Add SimpleExpr SimpleExpr
| Parens SimpleExpr
deriving (Eq,Show)
Here is the basic pattern behind most of the rewrites we are going to cover. Here is a function which takes a constructor and two parsers for the two arguments for the constructor. It parses the two arguments, then applies the constructor to the results:
myParser1 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser1 ctor pa pb = do
a <- pa
b <- pb
return $ ctor a b
TODO: concrete example, plus examples at the bottom of this section (for ctor <$> a, ctor <$> a <*> b, ctor <$> a <*> b <*> c).
This can be rewritten without the do syntactic sugar like this:
myParser2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser2 ctor pa pb =
pa >>= \a -> pb >>= \b -> return $ ctor a b
And can also be rewritten like this:
myParser3 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser3 ctor pa pb = ctor `fmap` pa `ap` pb
(This uses functions from Applicative instead of Monad.) We replace
the use of >>=
with fmap
and ap
. This isn’t always possible, but
it often is.
Here is the version using the operators for fmap
and ap
(fmap
changed to <$>
, and ap
changed to <*>
).
myParser4 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
myParser4 ctor pa pb = ctor <$> pa <*> pb
This style takes less typing, and is often much simpler to write and read.
This pattern 'scales', you can use:
Ctor <$> pa
for a single argument constructor. This might also be familiar to you as
fmap Ctor pa
or
Ctor `fmap` pa
All of which mean the same thing, just slightly different spellings.
This can also be written using Monad operators:
pa >>= liftM ctor
or
liftM ctor =<< pa
(liftM
is in the module Control.Monad
)
These liftM
versions effectively mean the same thing as the previous
versions with fmap
and <$>
.
You can use
Ctor <$> pa <*> pb <*> pc
for three args, and so on. So you use <$>
between the pure
constructor and the first arg, then <*>
between each subsequent arg.
Let’s go over the simple expression parsers and try to rewrite them using this style. We will see a few other new functions. I will break things down into a lot of steps.
4.1. lexeme
Here is the old lexeme parser, 'D' suffix for 'do notation'.
lexemeD :: Parser a -> Parser a
lexemeD p = do
x <- p
whitespace
return x
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
First, we can move the whitespace
from its own separate line.
lexemeA0 :: Parser a -> Parser a
lexemeA0 p = do
x <- p <* whitespace
return x
The expression pa <* pb
means run pa
, then run pb
, and return
the result of pa
. It is sort of equivalent to this code:
(<*) :: Parser a -> Parser b -> Parser a
(<*) pa pb = do
a <- pa
void pb
return a
(It isn’t implemented this way, since (<*)
only needs Applicative
and not Monad.)
Now we can use the usual monad syntax rewrites, first eliminate the
name x
.
lexemeA1 :: Parser a -> Parser a
lexemeA1 p = do
p <* whitespace
Now remove the redundant do:
lexemeA :: Parser a -> Parser a
lexemeA p = p <* whitespace
4.2. num
Now let’s tackle the num
parser.
numD :: Parser SimpleExpr
numD = do
n <- lexemeD $ many1 digit
return $ Num $ read n
Let’s move 'read' to the first line.
numA0 :: Parser SimpleExpr
numA0 = do
n <- read <$> lexemeA (many1 digit)
return $ Num n
This uses (<$>)
which we saw above. You may have done code rewrites
like this using fmap
with IO in other Haskell code.
Now let’s move the Num
ctor as well:
numA1 :: Parser SimpleExpr
numA1 = do
n <- (Num . read) <$> lexemeA (many1 digit)
return n
You can also write it in this way:
numA2 :: Parser SimpleExpr
numA2 = do
n <- Num <$> read <$> lexemeA (many1 digit)
return n
Why does this work? It it equivalent to the previous version partly because of the applicative laws.
Let’s break it down:
numA2'' :: Parser SimpleExpr
numA2'' = do
n <- numb
return n
where
numb :: Parser SimpleExpr
numb = Num <$> int
int :: Parser Integer
int = read <$> lexemeA (many1 digit)
In terms of style, which do you think looks better:
(a . b) <$> p
or a <$> b <$> p
.
The next step for num, we can eliminate the temporary name n
and the
do
:
numA3 :: Parser SimpleExpr
numA3 = (Num . read) <$> lexemeA (many1 digit)
In more 'industrial' parser code, I would usually write some tokenization parsers separately like this:
integerA4 :: Parser Integer
integerA4 = read <$> lexemeA (many1 digit)
Then the num expression parser looks like this:
numA4 :: Parser SimpleExpr
numA4 = Num <$> integerA4
and we also get a integer parser which we can reuse if we need to parse an integer in another context.
4.3. var
Here is the previous var parser:
varD :: Parser SimpleExpr
varD = lexemeA $ do
fc <- firstChar
rest <- many nonFirstChar
return $ Var (fc:rest)
where
firstChar = satisfy (\a -> isLetter a || a == '_')
nonFirstChar = satisfy (\a -> isDigit a || isLetter a || a == '_')
The first thing we can do is to make the firstChar
and
nonFirstChar
a little easier to read, using (<|>)
, char
,
letter
and digit
:
varA0 :: Parser SimpleExpr
varA0 = lexemeA $ do
fl <- firstChar
rest <- many nonFirstChar
return $ Var (fl:rest)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
Here is another way of making the function a little better:
varA0' :: Parser SimpleExpr
varA0' = lexemeA $ do
fl <- satisfy validFirstChar
rest <- many (satisfy validNonFirstChar)
return $ Var (fl:rest)
where
validFirstChar a = isLetter a || a == '_'
validNonFirstChar a = validFirstChar a || isDigit a
We can lift the (:)
using the Applicative operators.
varA1 :: Parser SimpleExpr
varA1 = do
i <- iden
return $ Var i
where
iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
We used the prefix version of (:)
to use it with (<$>)
and
(<*>)
. The lexemeA call was moved to the iden
helper function.
Now tidy it up using (<$>)
with the Var
constructor:
varA2 :: Parser SimpleExpr
varA2 = Var <$> iden
where
iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
We could also split the iden
into a separate top level function,
with the same idea as with splitting the integer
parser.
4.4. parens
Here is the starting point:
parensD :: Parser SimpleExpr
parensD = do
void $ lexemeA $ char '('
e <- simpleExprD
void $ lexemeA $ char ')'
return $ Parens e
Here is the rewrite in one step:
parensA0 :: Parser SimpleExpr
parensA0 =
Parens <$> (lexemeA (char '(')
*> simpleExprD
<* lexemeA (char ')'))
Here you can see that there is a (*>)
which works in the opposite
direction to (<*)
. The precendence of these operators means that we
have to use some extra parentheses (!) here.
TODO: lost the chained <*. Put something below about this so there is a concrete example.
4.5. simple expr
Here is the old version:
termD :: Parser SimpleExpr
termD = numD <|> varD <|> parensD
simpleExprD :: Parser SimpleExpr
simpleExprD = chainl1 termD op
where
op = do
void $ lexemeA $ char '+'
return Add
We can simplify the op
function using the techniques we’ve already
seen:
simpleExprA0 :: Parser SimpleExpr
simpleExprA0 = chainl1 termD op
where op = lexemeA (char '+') *> return Add
The pattern p *> return f
can use a different operator like this:
f <$ p
. Here it is in the expression parser:
simpleExprA1 :: Parser SimpleExpr
simpleExprA1 = chainl1 termD op
where op = Add <$ lexemeA (char '+')
You could also write the op
parser inline:
simpleExprA2 :: Parser SimpleExpr
simpleExprA2 = chainl1 termD (Add <$ lexemeA (char '+'))
Maybe this last step makes it less readable?
4.6. summary
Here is the finished job for all the simple expression code without separate token parsers:
num' :: Parser SimpleExpr
num' = (Num . read) <$> lexemeA (many1 digit)
var' :: Parser SimpleExpr
var' = Var <$> iden
where
iden = lexemeA ((:) <$> firstChar <*> many nonFirstChar)
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
parens' :: Parser SimpleExpr
parens' =
Parens <$> (lexemeA (char '(')
*> simpleExpr'
<* lexemeA (char ')'))
term' :: Parser SimpleExpr
term' = num' <|> var' <|> parens'
simpleExpr' :: Parser SimpleExpr
simpleExpr' = chainl1 term' op
where op = Add <$ lexemeA (char '+')
Here they are with separate token parsers and a helper function:
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
Here is a lexeme wrapper for parsing single character symbols.
symbol :: Char -> Parser ()
symbol c = void $ lexeme $ char c
Here is another little helper function. It barely pays its way in this short example, but even though it is only used once, I think it is worth it to make the code clearer.
betweenParens :: Parser a -> Parser a
betweenParens p = symbol '(' *> p <* symbol ')'
Now the expression parsers:
num :: Parser SimpleExpr
num = Num <$> integer
var :: Parser SimpleExpr
var = Var <$> identifier
parens :: Parser SimpleExpr
parens = Parens <$> betweenParens simpleExpr
term :: Parser SimpleExpr
term = num <|> var <|> parens
simpleExpr :: Parser SimpleExpr
simpleExpr = chainl1 term' op
where op = Add <$ lexemeA (char '+')
Splitting the lexer parser layer out means that we have one place
where we have to remember to add lexeme
wrappers, and also I think
makes the code easier to follow.
5. Combinator review
In this tutorial we will go through all the functions in Text.Parsec.Combinator, and some useful ones in Control.Applicative and Control.Monad as well.
import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char (oneOf, char, digit
,string, letter, satisfy)
import Text.Parsec.String.Combinator (many1, choice, chainl1, between
,count, option, optionMaybe, optional)
import Control.Applicative ((<$>), (<*>), (<$), (<*), (*>), (<|>), many)
import Control.Monad (void, ap, mzero)
import Data.Char (isLetter, isDigit)
import FunctionsAndTypesForParsing
5.1. Text.Parsec.Combinator
You should look at the source for these functions and try to understand how they are implemented.
The style of the source code in the Parsec library sources is a little different to what we used at the end of the last tutorial. You can try reimplementing each of the Text.Parsec.Combinator module functions using the Applicative style. See if you can find a way to reassure yourself that the rewritten versions you make are correct, perhaps via writing automated tests, or perhaps some other method.
You should be able to easily understand the implementation of all the
functions in Text.Parsec.Combinator except possibly anyToken
and
notFollowedBy
.
5.1.1. choice
choice :: [Parser a] -> Parser a
choice ps
tries to apply the parsers in the list ps
in order,
until one of them succeeds. It returns the value of the succeeding
parser.
a :: Parser Char
a = char 'a'
b :: Parser Char
b = char 'b'
aOrB :: Parser Char
aOrB = choice [a,b]
*Main> regularParse aOrB "a"
Right 'a'
*Main> regularParse aOrB "b"
Right 'b'
*Main> regularParse aOrB "c"
Left (line 1, column 1):
unexpected "c"
expecting "a" or "b"
using with try
If a parser fails with (<|>)
or choice
, then it will only try the
next parser if the last parser consumed no input.
TODO: make the parsers return the keyword and update the examples
byKeyword :: Parser ()
byKeyword = void $ string "by"
betweenKeyword :: Parser ()
betweenKeyword = void $ string "between"
Since both of these have the same prefix - b - if we combine them
using choice
then it doesn’t work correctly:
*Main> regularParse byKeyword "by"
Right ()
*Main> regularParse byKeyword "between"
Left (line 1, column 1):
unexpected "e"
expecting "by"
*Main> regularParse betweenKeyword "between"
Right ()
*Main> regularParse betweenKeyword "by"
Left (line 1, column 1):
unexpected "y"
expecting "between"
*Main> regularParse (choice [betweenKeyword,byKeyword]) "between"
Right ()
*Main> regularParse (choice [betweenKeyword,byKeyword]) "by"
Left (line 1, column 1):
unexpected "y"
expecting "between"
*Main> regularParse (choice [byKeyword,betweenKeyword]) "between"
Left (line 1, column 1):
unexpected "e"
expecting "by"
*Main> regularParse (choice [byKeyword,betweenKeyword]) "by"
Right ()
If we use try
on the first option, then it all works fine.
*Main> regularParse (choice [try byKeyword,betweenKeyword]) "by"
Right ()
*Main> regularParse (choice [try byKeyword,betweenKeyword]) "between"
Right ()
5.1.2. count
count :: Int -> Parser a -> Parser [a]
count n p
parses n
occurrences of p
. If n
is smaller or equal
to zero, the parser is equivalent to return []
. It returns a list of
the n values returned by p
.
*Main> regularParse (count 5 a) "aaaaa"
Right "aaaaa"
*Main> regularParse (count 5 a) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a"
*Main> regularParse (count 5 a) "aaaab"
Left (line 1, column 5):
unexpected "b"
expecting "a"
*Main> regularParse (count 5 aOrB) "aabaa"
Right "aabaa"
5.1.3. between
between :: Parser open -> Parser close -> Parser a -> Parser a
between open close p
parses open
, followed by p
and
close
. It returns the value returned by p
.
We can replace the betweenParens from the previous tutorial using this:
betweenParens :: Parser a -> Parser a
betweenParens p = between (symbol '(') (symbol ')') p
It hardly seems worth it to make this change, but it might be slightly quicker to read and understand if you aren’t already familiar with some code or haven’t viewed it for a while. This is good for 'code maintenance', where we need to fix bugs or add new features quickly to code we haven’t looked at for two years or something.
Here are the support functions for this parser.
symbol :: Char -> Parser Char
symbol c = lexeme $ char c
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
whitespace :: Parser ()
whitespace = void $ oneOf " \n\t"
5.1.4. option
option :: a -> Parser a -> Parser a
option x p
tries to apply parser p
. If p
fails without
consuming input, it returns the value x
, otherwise the value returned
by p
.
*Main> regularParse (option "" (count 5 aOrB)) "aaaaa"
Right "aaaaa"
*Main> regularParse (option "" (count 5 aOrB)) "caaaa"
Right ""
*Main> regularParse (option "" (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"
*Main> regularParse (option "" (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"
*Main> regularParse (option "" (try (count 5 aOrB))) "aaaa"
Right ""
5.1.5. optionMaybe
optionMaybe :: Parser a -> Parser (Maybe a)
optionMaybe p
tries to apply parser p
. If p
fails without consuming
input, it returns Nothing
, otherwise it returns Just
the value returned
by p
.
*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaaa"
Right (Just "aaaaa")
*Main> regularParse (optionMaybe (count 5 aOrB)) "caaaa"
Right Nothing
*Main> regularParse (optionMaybe (count 5 aOrB)) "caaa"
Right Nothing
*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"
*Main> regularParse (optionMaybe (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"
*Main> regularParse (optionMaybe (try $ count 5 aOrB)) "aaaac"
Right Nothing
5.1.6. optional
optional :: Parser a -> Parser ()
optional p
tries to apply parser p
. It will parse p
or
nothing. It only fails if p
fails after consuming input. It discards
the result of p
.
*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaaa"
Right ((),"")
*Main> parseWithLeftOver (optional (count 5 aOrB)) "caaaa"
Right ((),"caaaa")
*Main> parseWithLeftOver (optional (count 5 aOrB)) "caaa"
Right ((),"caaa")
*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaa"
Left (line 1, column 5):
unexpected end of input
expecting "a" or "b"
*Main> parseWithLeftOver (optional (count 5 aOrB)) "aaaac"
Left (line 1, column 5):
unexpected "c"
expecting "a" or "b"
*Main> parseWithLeftOver (optional (try $ count 5 aOrB)) "aaaac"
Right ((),"aaaac")
5.1.7. skipMany1
skipMany1 :: Parser a -> Parser ()
skipMany1 p
applies the parser p
one or more times, skipping its result.
5.1.8. many1
many1 :: Parser a -> Parser [a]
many1 p applies the parser p one or more times. Returns a list of the returned values of p.
word = many1 letter
5.1.9. sepBy
sepBy :: Parser a -> Parser sep -> Parser [a]
sepBy p sep parses zero or more occurrences of p, separated by sep. Returns a list of values returned by p.
commaSep p = p `sepBy` (symbol ",")
5.1.10. sepBy1
sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep parses one or more occurrences of p, separated by sep. Returns a list of values returned by p.
5.1.11. endBy
endBy :: Parser a -> Parser sep -> Parser [a]
endBy p sep parses zero or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p.
cStatements = cStatement `endBy` semi
5.1.12. endBy1
endBy1 :: Parser a -> Parser sep -> Parser [a]
endBy1 p sep parses one or more occurrences of p, seperated and ended by sep. Returns a list of values returned by p.
5.1.13. sepEndBy
sepEndBy :: Parser a -> Parser sep -> Parser [a]
sepEndBy p sep parses zero or more occurrences of p, separated and optionally ended by sep, ie. haskell style statements. Returns a list of values returned by p.
haskellStatements = haskellStatement `sepEndBy` semi
5.1.14. sepEndBy1
sepEndBy1 :: Parser a -> Parser sep -> Parser [a]
sepEndBy1 p sep parses one or more occurrences of p, separated and optionally ended by sep. Returns a list of values returned by p.
5.1.15. chainl
chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainl p op x parser zero or more occurrences of p, separated by op. Returns a value obtained by a left associative application of all functions returned by op to the values returned by p. If there are zero occurrences of p, the value x is returned.
5.1.16. chainl1
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p op x parser one or more occurrences of p, separated by op Returns a value obtained by a left associative application of all functions returned by op to the values returned by p. . This parser can for example be used to eliminate left recursion which typically occurs in expression grammars.
5.1.17. chainr
chainr :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
chainr p op x parser zero or more occurrences of p, separated by op Returns a value obtained by a right associative application of all functions returned by op to the values returned by p. If there are no occurrences of p, the value x is returned.
5.1.18. chainr1
chainr1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainr1 p op x parser one or more occurrences of p
, separated by op
Returns a value obtained by a right associative application of all
functions returned by op to the values returned by p.
5.1.19. eof
eof :: Parser ()
This parser only succeeds at the end of the input. This is not a primitive parser but it is defined using notFollowedBy.
eof = notFollowedBy anyToken <?> "end of input"
The (<?>) operator is used for error messages. We will come back to error messages after writing the basic SQL parser.
5.1.20. notFollowedBy
notFollowedBy :: Show a => Parser a -> Parser ()
notFollowedBy p only succeeds when parser p fails. This parser does not consume any input. This parser can be used to implement the 'longest match' rule. For example, when recognizing keywords (for example let), we want to make sure that a keyword is not followed by a legal identifier character, in which case the keyword is actually an identifier (for example lets). We can program this behaviour as follows:
keywordLet = try (do{ string "let"
; notFollowedBy alphaNum
})
5.1.21. manyTill
manyTill :: Parser a -> Parser end -> Parser [a]
manyTill p end applies parser p zero or more times until parser end succeeds. Returns the list of values returned by p. This parser can be used to scan comments:
simpleComment = do{ string "<!--"
; manyTill anyChar (try (string "-->"))
}
Note the overlapping parsers anyChar and string "-->", and therefore the use of the try combinator.
5.1.22. lookAhead
lookAhead :: Parser a -> Parser a
lookAhead p parses p without consuming any input.
If p fails and consumes some input, so does lookAhead. Combine with try if this is undesirable.
5.1.23. anyToken
anyToken :: Parser Char
The parser anyToken accepts any kind of token. It is for example used to implement eof. Returns the accepted token.
5.2. Control.Applicative
Here are the functions from Applicative that are used:
(<$>)
, (<*>)
, (<$)
, (<*)
, (*>)
, (<|>)
, many
TODO: examples for all of these
We’ve already seen all of these, except (<$)
. This is often used to
parse a keyword and return a no argument constructor:
data Something = Type1 | Type2 | Type3
something :: Parser Something
something = choice [Type1 <$ string "type1"
,Type2 <$ string "type2"
,Type3 <$ string "type3"]
There is also (<**>)
which is (<*>)
with the arguments flipped.
TODO: double check using these from Parsec instead of Control.Applicative: possible performance implictions?
5.3. Control.Monad
5.3.1. return
One use of return is to always succeed, and return a value:
alwaysX :: Parser Char
alwaysX = return 'x'
*Main> parseWithLeftOver (a <|> alwaysX) "a"
Right ('a',"")
*Main> parseWithLeftOver (a <|> alwaysX) "b"
Right ('x',"b")
5.3.2. mzero
This function is used in the implementation of choice
:
choice' :: [Parser a] -> Parser a
choice' ps = foldr (<|>) mzero ps
TODO: go through a bunch of functions + do notation examples
>>= =<< >> void mapM, mapM_ sequence,sequence_ guard return mzero mplus when, unless liftMN ap quick note about fail, will return to this in the error messages stage
TODO: using trace
6. Functions and types for parsing
In this file is the source and explanation for the parsing functions which we’ve been using, and some limited notes about the wrappers and full types in Parsec.
module FunctionsAndTypesForParsing where
import Text.Parsec (ParseError)
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (parse)
import Text.Parsec.String.Char (oneOf)
import Text.Parsec.String.Combinator (eof,manyTill,anyToken)
import Control.Applicative ((<$>), (<*>), (<*), (*>), many)
import Control.Monad (void)
6.1. Functions for parsing
Here are the testing functions which were used earlier:
The basic parse function: this is a pretty simple wrapper. The parse function from parsec just adds a filename to use in parse errors, which is set as the empty string here.
regularParse :: Parser a -> String -> Either ParseError a
regularParse p = parse p ""
'parse' is a basic function in the family of functions for running parsers in Parsec. You can compose the parser functions in the Parser monad, then run the top level function using 'parse' and get back an 'Either ParserError a' as the result. There are a few alternatives to 'parse' in Parsec, mostly when you are using a more general parser type instead of 'Parser a' (which is an alias for 'ParsecT String () Identity a'). Have a look in the Text.Parsec.Prim module for these http://hackage.haskell.org/package/parsec-3.1.3/docs/Text-Parsec-Prim.html.
This function will run the parser, but additionally fail if it doesn’t consume all the input.
parseWithEof :: Parser a -> String -> Either ParseError a
parseWithEof p = parse (p <* eof) ""
This function will apply the parser, then also return any left over input which wasn’t parsed.
parseWithLeftOver :: Parser a -> String -> Either ParseError (a,String)
parseWithLeftOver p = parse ((,) <$> p <*> leftOver) ""
where leftOver = manyTill anyToken eof
TODO: what happens when you use 'many anyToken <* eof' variations instead? Maybe should talk about greediness? Or talk about it in a better place in the tutorial.
parseWithWSEof :: Parser a -> String -> Either ParseError a
parseWithWSEof p = parseWithEof (whiteSpace *> p)
where whiteSpace = void $ many $ oneOf " \n\t"
You should have a look at the two helper executables, and see if you can understand the code now. You can see them online here:
6.2. type signatures revisited
todo: update this to refer to real parsec instead of the string wrappers here.
I think you should always use type signatures with Parsec. Because the Parsec code is really generalized, without the type GHC will refuse to compile this code. Try commenting out the type signature above and loading into ghci to see the error message.
There is an alternative: you can get this code to compile without a type signature by using the NoMonomorphismRestriction language pragma. You can also see the type signature that GHC will choose for this function by commenting the type signature and using -Wall and -XNoMonomorphismRestriction together. Using NoMonomorphismRestriction is a popular solution to these sorts of problems in haskell.
It’s up to you whether you prefer to always write type signatures when you are developing parsing code, or use the NoMonomorphismRestriction pragma. Even if you can use NoMonomorphismRestriction, when using explicit type signatures you usually get much simpler compiler error messages.
6.3. Parser
The definition of Parser and a partial explanation of the full type signature.
type Parser = Parsec String ()
This means that a function returning Parser a parses from a String with () as the initial state.
The Parsec type is defined like this:
type Parsec s u = ParsecT s u Identity
ParsecT is a monad transformer, I think it is the primitive one in the Parsec library, and the 'Parsec' type is a type alias which sets the base monad to be Identity.
Here is the haddock for the ParsecT type:
ParsecT s u m a
is a parser with stream type s
, user state type u
,
underlying monad m
and return type a
.
The full types that you see like this:
satisfy :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char
refer to the same things (stream type s, user state type u, underlying monad m).
We are using String as the stream type (i.e. the input type), () as
the user state type (this effectively means no user state, since ()
only has one value), and the underlying monad is Identity: we are
using no other underlying monad, so Parser a
expands to ParsecT
String () Identity a
.
I.e. the source is String, the user state is (), and the underlying monad is Identity.
6.4. Other information
TODO: Here is some other information on Parsec and Haskell: links, tutorials on fp, section in rwh, lyah?, old parsec docs, parsec docs on hackage, other parser combinator libs (uu, polyparse, trifecta?)
7. Parsing expressions with fixity
Text.Parsec.Expr allows building expression parsers with a range of operators with different precedences and associativities easily. Fixity is the (not completely standard) term for precendence and associativity together.
Text.Parsec.Expr
can be great to quickly get a parser for simple
expressions or a simple programming language with simple expressions
up and running.
import Text.Parsec.String (Parser)
import Text.Parsec.String.Combinator (many1, between)
import Text.Parsec.String.Char (letter, char, digit, string, oneOf)
import Control.Applicative ((<$>), (<*>), (<*), (<|>), many, (<$))
import Control.Monad (void)
import qualified Text.Parsec.String.Expr as E
import FunctionsAndTypesForParsing
Let’s extend the SimpleExpression type and parsers to cover a range of operators with different precedences and associativity.
7.1. expressions with plus and times
Let’s start with a simple case: + and * with the usual fixity. Here is the abstract syntax:
data PlusTimesExpr = PteVar String
| PteNum Integer
| PteParens PlusTimesExpr
| Plus PlusTimesExpr PlusTimesExpr
| Times PlusTimesExpr PlusTimesExpr
deriving (Eq,Show)
plusTimesExpr :: Parser PlusTimesExpr
plusTimesExpr = E.buildExpressionParser pteTable pteTerm
pteTable :: [[E.Operator PlusTimesExpr]]
pteTable = [[E.Infix (Times <$ symbol "*") E.AssocLeft]
,[E.Infix (Plus <$ symbol "+") E.AssocLeft]
]
Here you can see the operator parsers are the same as the previous
SimpleExpr
parser which used chainl1
: Times <$ symbol "*"
and
Plus <$ symbol "+"
. We just wrapped these up in the E.Infix
constructor with the associativity, and put them in a list of lists
which represents the precendence classes.
Here is the term parser and components. All this is just the same as
the SimpleExpr
parser a previous tutorial.
pteTerm :: Parser PlusTimesExpr
pteTerm = pteVar <|> pteNum <|> pteParens
pteNum :: Parser PlusTimesExpr
pteNum = PteNum <$> integer
pteVar :: Parser PlusTimesExpr
pteVar = PteVar <$> identifier
pteParens :: Parser PlusTimesExpr
pteParens = PteParens <$> between (symbol "(") (symbol ")") plusTimesExpr
support functions:
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = lexeme $ string s
Here you can see the precendence in action:
*Main> regularParse plusTimesExpr "a + b * c"
Right (Plus (PteVar "a") (Times (PteVar "b") (PteVar "c")))
*Main> regularParse plusTimesExpr "a * b + c"
Right (Plus (Times (PteVar "a") (PteVar "b")) (PteVar "c"))
7.2. a full featured expression type
Now let’s try a much bigger example with lots more operators. Now we are thinking ahead to the first version of the SQL query parser, and preparing for this.
Here are our new operators in precedence order:
7.2.1. unary + -
+a
-3
7.2.2. exponentiation
a ^ 3
associativity: left
7.2.3. multiplication, division, modulo
a * 3
3 / b
a % 5
associativity: left
7.2.4. addition, subtraction
a + b
a - b
associativity: left
7.2.5. less than, greater than
a < b
a > b
associativity: none
7.2.6. equals
a = 3
associativity: right
7.2.7. not
not a
7.2.8. and
a and b
associativity: left
7.2.9. or
a or b
associativity: left
Here is the abstract syntax type:
data SimpleExpr = Num Integer
| Var String
| Parens SimpleExpr
| PrefixOp String SimpleExpr
| BinaryOp SimpleExpr String SimpleExpr
deriving (Eq,Show)
Here is the new expression parser:
simpleExpr :: Parser SimpleExpr
simpleExpr = E.buildExpressionParser table term
table :: [[E.Operator SimpleExpr]]
table = [[prefix "-", prefix "+"]
,[binary "^" E.AssocLeft]
,[binary "*" E.AssocLeft
,binary "/" E.AssocLeft
,binary "%" E.AssocLeft]
,[binary "+" E.AssocLeft
,binary "-" E.AssocLeft]
,[binary "<" E.AssocNone
,binary ">" E.AssocNone]
,[binary "=" E.AssocRight]
,[prefix "not"]
,[binary "and" E.AssocLeft]
,[binary "or" E.AssocLeft]
]
where
binary name assoc =
E.Infix (mkBinOp name <$ symbol name) assoc
mkBinOp nm a b = BinaryOp a nm b
prefix name = E.Prefix (PrefixOp name <$ symbol name)
TODO: expand and explain the bits.
Here is the term parser.
term :: Parser SimpleExpr
term = var <|> num <|> parens
num :: Parser SimpleExpr
num = Num <$> integer
var :: Parser SimpleExpr
var = Var <$> identifier
parens :: Parser SimpleExpr
parens = between (symbol "(") (symbol ")") simpleExpr
TODO: write lots of parsing examples, including parse failures with ambiguity.
issue: double prefix op, link to bug on parsec bug tracker.
The source in Text.Parsec.Expr is not too big. You can have a look and try to understand it. There are several standard approaches in parsing theory to parse expressions with data driven precendences and associativity. I don’t know which one Text.Parsec.Expr uses, but if you find these and read about them, then the source of Text.Parsec.Expr might be a bit more understandable.
8. An issue with token parsers
This is a tutorial about an issue with the token parsing we have so far.
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Combinator (many1, notFollowedBy)
import Text.Parsec.String.Char (digit, string, oneOf, satisfy, char, letter)
import Control.Applicative ((<$>), (<*>), (<*), many, (<$), (<|>))
import Control.Monad (void, guard)
import qualified Text.Parsec.String.Expr as E
import FunctionsAndTypesForParsing
Here is a simplified expression type and parser:
data SimpleExpr = Num Integer
| BinaryOp SimpleExpr String SimpleExpr
deriving (Eq,Show)
simpleExpr :: Parser SimpleExpr
simpleExpr = E.buildExpressionParser table num
table :: [[E.Operator SimpleExpr]]
table = [[binary "<=" E.AssocNone
,binary ">=" E.AssocNone]
,[binary "<" E.AssocNone
,binary ">" E.AssocNone]
]
where
binary name assoc =
E.Infix (mkBinOp name <$ symbol name) assoc
mkBinOp nm a b = BinaryOp a nm b
num :: Parser SimpleExpr
num = Num <$> integer
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
symbol :: String -> Parser String
symbol s = lexeme $ string s
Let’s try it out:
*Main> regularParse simpleExpr "1=2"
Right (Num 1)
*Main> regularParse simpleExpr "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))
*Main> regularParse simpleExpr "1>2"
Left (line 1, column 2):
unexpected "2"
expecting ">="
What happened? The parser tried to parse > as >=, failed, and since the failure consumed some input (the first >), it failed completely.
We are going to change the symbol parser to fix this. Here is a parameterized version of the simpleExpr parser so we can try a few variations out.
simpleExprP :: (String -> Parser String) -> Parser SimpleExpr
simpleExprP sym = E.buildExpressionParser (tableP sym) num
tableP :: (String -> Parser String) -> [[E.Operator SimpleExpr]]
tableP sym = [[binary "<=" E.AssocNone
,binary ">=" E.AssocNone]
,[binary "<" E.AssocNone
,binary ">" E.AssocNone]]
where
binary name assoc =
E.Infix (mkBinOp name <$ sym name) assoc
mkBinOp nm a b = BinaryOp a nm b
Let’s reproduce the failure:
*Main> regularParse (simpleExprP symbol) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))
*Main> regularParse (simpleExprP symbol) "1>2"
Left (line 1, column 2):
unexpected "2"
expecting ">="
We are going to look at two possible solutions.
-
Let’s use
try
:
*Main> regularParse (simpleExprP (try . symbol)) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))
*Main> regularParse (simpleExprP (try . symbol)) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))
This seems to have done the job. There is still a problem
though. Consider a case when the precedence is the other way round -
the <
and >
are higher precedence than ⇐
and >=
,
simpleExprP1 :: (String -> Parser String) -> Parser SimpleExpr
simpleExprP1 sym = E.buildExpressionParser (tableP1 sym) num
tableP1 :: (String -> Parser String) -> [[E.Operator SimpleExpr]]
tableP1 sym = [[binary "<" E.AssocNone
,binary ">" E.AssocNone]
,[binary "<=" E.AssocNone
,binary ">=" E.AssocNone]]
where
binary name assoc =
E.Infix (mkBinOp name <$ sym name) assoc
mkBinOp nm a b = BinaryOp a nm b
*Main> regularParse (simpleExprP1 (try . symbol)) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))
*Main> regularParse (simpleExprP1 (try . symbol)) "1>=2"
Left (line 1, column 3):
unexpected "="
expecting digit
Although the precendence order is a little contrived in this case, this issue could easily crop up for real when we start adding more operators. Let’s fix it now.
This could be solved by adding a try
at a earlier place in the
parsing. Because of how the buildExpressionParser
function works,
it’s not obvious where the try
could go.
Let’s try tackling the problem in a different way. One way of looking at this is to consider that the symbol parser stops parsing too soon:
*Main> parseWithLeftOver (symbol ">") ">="
Right (">","=")
What it should do is keep parsing symbol characters until it gets a result string which can’t be a symbol, and stop one character before this..
Here is a slightly naive way of doing it, which will be good enough for quite a while:
symbol1 :: String -> Parser String
symbol1 s = try $ lexeme $ do
u <- many1 (oneOf "<>=+-^%/*")
guard (s == u)
return s
Here is a similar alternative:
symbol2 :: String -> Parser String
symbol2 s = try $ lexeme $ do
void $ string s
notFollowedBy (oneOf "<>=+-^%/*")
return s
Let’s try them out:
*Main> parseWithLeftOver (symbol1 ">") ">="
Left (line 1, column 3):
unexpected end of input
*Main> parseWithLeftOver (symbol1 ">") ">"
Right (">","")
*Main> parseWithLeftOver (symbol1 ">") ">= 3"
Left (line 1, column 3):
unexpected " "
*Main> parseWithLeftOver (symbol1 ">=") ">= 3"
Right (">="," 3")
The error messages don’t seem very good, but it parses and fails to parse correctly.
*Main> parseWithLeftOver (symbol2 ">") ">="
Left (line 1, column 3):
unexpected '='
*Main> parseWithLeftOver (symbol2 ">") ">"
Right (">","")
*Main> parseWithLeftOver (symbol2 ">") ">= 3"
Left (line 1, column 3):
unexpected '='
*Main> parseWithLeftOver (symbol2 ">=") ">= 3"
Right (">="," 3")
This one appears to give better error messages in this limited scenario, apart from that they both work the same.
Let’s try them out in the full expression parser:
*Main> regularParse (simpleExprP symbol1) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))
*Main> regularParse (simpleExprP symbol1) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))
*Main> regularParse (simpleExprP symbol2) "1>=2"
Right (BinaryOp (Num 1) ">=" (Num 2))
*Main> regularParse (simpleExprP symbol2) "1>2"
Right (BinaryOp (Num 1) ">" (Num 2))
They both work fine here. Let’s see some error messages in this context.
*Main> parseWithEof (simpleExprP symbol1) "1>*2"
Left (line 1, column 4):
unexpected "2"
expecting operator
*Main> parseWithEof (simpleExprP symbol2) "1>*2"
Left (line 1, column 4):
unexpected '*'
expecting operator
Both error messages are a bit crap. So much for the second variation producing better error messages.
Let’s look at the equivalent issue with respect to keyword parsing. We can get a similar problem here.
keyword :: String -> Parser String
keyword s = try $ string s
*Main> parseWithEof (keyword "not") "not"
Right "not"
*Main> parseWithEof (keyword "not") "nothing"
Left (line 1, column 4):
unexpected 'h'
expecting end of input
*Main> parseWithEof (keyword "not" <|> keyword "nothing") "nothing"
Left (line 1, column 4):
unexpected 'h'
expecting end of input
*Main> parseWithEof (keyword "nothing" <|> keyword "not") "nothing"
Right "nothing"
We can fix this overlapping prefix issue by reordering the
choices. But let’s fix the keyword
parser in a similar way to the
symbol parser.
TODO: I don’t know if symbol is the right name, I don’t think Parsec usually uses symbol in this way. Maybe it should be called operator.
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
keyword1 :: String -> Parser String
keyword1 k = try $ do
i <- identifier
guard (i == k)
return k
TODO: later note in error messages about choosing identifier here instead of e.g. many1 letter.
*Main> parseWithEof (keyword1 "not") "not"
Right "not"
*Main> parseWithEof (keyword1 "not") "nothing"
Left (line 1, column 8):
unexpected end of input
expecting digit, letter or "_"
*Main> parseWithEof (keyword1 "not" <|> keyword1 "nothing") "nothing"
Right "nothing"
*Main> parseWithEof (keyword1 "nothing" <|> keyword1 "not") "nothing"
Right "nothing"
*Main> parseWithEof (keyword1 "not" <|> keyword1 "nothing") "not"
Right "not"
*Main> parseWithEof (keyword1 "nothing" <|> keyword1 "not") "not"
Right "not"
Try implementing the keyword2
parser which uses notFollowedBy
instead of guard
, using something analogous to the change from
symbol1
to symbol2
above.
After this, you can try reimplementing the expression parser from the Text.Parsec.Expr tutorial using the new symbol and keyword parsers.
9. Permutation parsing
This can parse a bunch of different things in any order. TODO: examples.
10. Token parsing
This can be used quickly create a set of token parsers handling lots of little issues which you otherwise have to deal with manually.
11. Value expressions
In this tutorial, we will build a parser for a subset of SQL value expressions. These are roughly the same as the expressions used in languages like Haskell or C. This will follow on from the work on expressions in previous tutorials.
Our value expressions will support literals, identifiers, asterisk, some simple operators, case expression and parentheses.
The phrase 'value expression' is from the ANSI SQL standards. What we will develop here isn’t exactly ANSI SQL value expressions, and we won’t use them exactly how the standards do, but the differences really aren’t important right now. I will come back to this in a later tutorial.
TODO: not so sure anymore what is a 'value expression' in the standard, and what is a 'scalar expression'. Find out and document it.
{-# LANGUAGE TupleSections #-}
module ValueExpressions (ValueExpr(..)
,valueExpr
,keyword
,symbol
,identifier
,makeTest
,blackListValueExpr
,comma
,parens
) where
import Text.Parsec.String (Parser)
import Text.Parsec.String.Char (oneOf, digit, string, anyChar, char, letter)
import Text.Parsec.String.Combinator (many1, manyTill, eof, choice, between
,sepBy, optionMaybe)
import Text.Parsec.String.Parsec (try,parse)
import Control.Applicative (many, (<*),(<$>), (*>), (<|>),(<$),(<*>))
import Control.Monad (void,guard)
import qualified Text.Parsec.String.Expr as E
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import Debug.Trace (trace)
11.1. Lineup
11.1.1. comments
We will start supporting comments. It will support the two standard comment syntaxes from the standard:
-- single line comment
/*
multiline
comment
*/
The /* */
comments do not nest.
11.1.2. literals
It will just support positive integral and string literals at this time. Proper SQL supports more literal types including some quite weird syntax which we will skip for now.
1
500
'string literal'
11.1.3. identifiers
We will use simple identifiers: an identifier may start with a letter or underscore, and contain letters, underscores and numbers. Full SQL identifiers are more complicated to support so we will skip this for now also.
a
something
_test_
a123
11.1.4. 'dotted identifiers'
We will do some limited support for identifiers with two parts separated by a dot. I don’t want to get into the exact meaning or the various names used to describe these since it is a bit confusing, especially in SQL. We can just stick to the syntax. Both parts must parse according to the identifier rules above.
t.a
something.something_else
11.1.5. star
We will support the star as special expression which can be used at the top level of select lists (and a few other places in SQL). We will also support a 'dotted star'.
*
t.*
11.1.6. function application
This represents any syntax which looks like the normal function application used in languages like C. The function name must parse as a valid identifier according to the rules above.
f()
g(1)
h(2,'something')
11.1.7. operators
We will only support a small range of binary operators for now plus a
single prefix unary operator (not
). We will attempt to support
correct precedence and associativity for these via the
Text.Parsec.Expr module. Here is a complete list of all the supported
operators.
a = b
a > b
a < b
a >= b
a <= b
a != b
a <> b (two spellings of the same thing, we will parse them as
separate operators though)
a and b
a or b
1 + 2
1 - 2
1 * 2
1 / 2
'some' || 'thing'
a like b
not a
11.1.8. case expression
There are two standard variations of case expressions in SQL. One is more like a switch statement in C (but is an expression, not a statement):
case a
when 3 then 'got three'
when 5 then 'got five'
else 'neither'
end
The other has a boolean expression in each branch:
case
when a = 3 then 'a is three'
when b = 4 then 'b is four'
else 'neither'
end
The else branch is optional (if it is missing, it implicitly means 'else null').
11.1.9. parentheses
It will parse and represent parentheses explicitly in the abstract syntax, like we did with the previous expression parsers.
(1 + 2) * 3
11.2. abstract syntax for value expressions
Here is a type to represent value expressions:
data ValueExpr = StringLit String
| NumLit Integer
| Iden String
| DIden String String -- a.b
| Star
| DStar String -- t.*
| App String [ValueExpr]
| PrefOp String ValueExpr
| BinOp ValueExpr String ValueExpr
| Case (Maybe ValueExpr) -- test value
[(ValueExpr,ValueExpr)] -- when branches
(Maybe ValueExpr) -- else value
| Parens ValueExpr
deriving (Eq,Show)
Here is the plan for tackling this:
Let’s write some simple automated tests to check our progress and check for regressions.
We can start by using the code already written to produce a partial expression parser, then add parsing for each new constructor one at a time.
11.3. automated testing framework
Let’s start with some examples we can turn into automated tests. Here
are the constructors above which I think we’ve already more or less
implemented the parsers for in previous tutorials: NumLit
, Iden
,
PrefixOp
, BinaryOp
and Parens
.
numLitTests :: [(String,ValueExpr)]
numLitTests =
[("1", NumLit 1)
,("54321", NumLit 54321)]
idenTests :: [(String,ValueExpr)]
idenTests =
[("test", Iden "test")
,("_something3", Iden "_something3")]
operatorTests :: [(String,ValueExpr)]
operatorTests =
map (\o -> (o ++ " a", PrefOp o (Iden "a"))) ["not", "+", "-"]
++ map (\o -> ("a " ++ o ++ " b", BinOp (Iden "a") o (Iden "b")))
["=",">","<", ">=", "<=", "!=", "<>"
,"and", "or", "+", "-", "*", "/", "||", "like"]
*ValueExpressions> import Data.List
*ValueExpressions Data.List> putStrLn $ intercalate "\n" $ map show operatorTests
("not a",PrefOp "not" (Iden "a"))
("+ a",PrefOp "+" (Iden "a"))
("- a",PrefOp "-" (Iden "a"))
("a = b",BinOp (Iden "a") "=" (Iden "b"))
("a > b",BinOp (Iden "a") ">" (Iden "b"))
("a < b",BinOp (Iden "a") "<" (Iden "b"))
("a >= b",BinOp (Iden "a") ">=" (Iden "b"))
("a <= b",BinOp (Iden "a") "<=" (Iden "b"))
("a != b",BinOp (Iden "a") "!=" (Iden "b"))
("a <> b",BinOp (Iden "a") "<>" (Iden "b"))
("a and b",BinOp (Iden "a") "and" (Iden "b"))
("a or b",BinOp (Iden "a") "or" (Iden "b"))
("a + b",BinOp (Iden "a") "+" (Iden "b"))
("a - b",BinOp (Iden "a") "-" (Iden "b"))
("a * b",BinOp (Iden "a") "*" (Iden "b"))
("a / b",BinOp (Iden "a") "/" (Iden "b"))
("a || b",BinOp (Iden "a") "||" (Iden "b"))
("a like b",BinOp (Iden "a") "like" (Iden "b"))
parensTests :: [(String,ValueExpr)]
parensTests = [("(1)", Parens (NumLit 1))]
basicTests :: [(String,ValueExpr)]
basicTests = numLitTests ++ idenTests ++ operatorTests ++ parensTests
Here is a test runner which uses HUnit:
makeTest :: (Eq a, Show a) => Parser a -> (String,a) -> H.Test
makeTest parser (src,expected) = H.TestLabel src $ H.TestCase $ do
let gote = parse (whitespace *> parser <* eof) "" src
case gote of
Left e -> H.assertFailure $ show e
Right got -> H.assertEqual src expected got
11.4. the parsing we already have
11.4.1. tokens
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = try $ lexeme $ do
u <- many1 (oneOf "<>=+-^%/*!|")
guard (s == u)
return s
openParen :: Parser Char
openParen = lexeme $ char '('
closeParen :: Parser Char
closeParen = lexeme $ char ')'
The whitespace parser is below, since it includes some new code to deal with comments.
11.4.2. helper functions
keyword :: String -> Parser String
keyword k = try $ do
i <- identifier
guard (i == k)
return k
TODO: find a place to discuss putting try in the keyword and symbol parsers. I think maybe this should come in the error message tutorial?
parens :: Parser a -> Parser a
parens = between openParen closeParen
11.4.3. terms
num :: Parser ValueExpr
num = NumLit <$> integer
iden :: Parser ValueExpr
iden = Iden <$> identifier
I’m going to parameterize the parens parser again to avoid rewriting lots of very similar code in this tutorial.
parensValue :: Parser ValueExpr -> Parser ValueExpr
parensValue val = Parens <$> parens val
term0 :: Parser ValueExpr
term0 = iden <|> num <|> parensValue valueExpr0
11.4.4. operator table and the first value expression parser
I’ve added all the new operators in this table.
table :: [[E.Operator ValueExpr]]
table = [[prefix "-", prefix "+"]
,[binary "^" E.AssocLeft]
,[binary "*" E.AssocLeft
,binary "/" E.AssocLeft
,binary "%" E.AssocLeft]
,[binary "+" E.AssocLeft
,binary "-" E.AssocLeft]
,[binary "<=" E.AssocRight
,binary ">=" E.AssocRight
,binaryK "like" E.AssocNone
,binary "!=" E.AssocRight
,binary "<>" E.AssocRight
,binary "||" E.AssocRight]
,[binary "<" E.AssocNone
,binary ">" E.AssocNone]
,[binary "=" E.AssocRight]
,[prefixK "not"]
,[binaryK "and" E.AssocLeft]
,[binaryK "or" E.AssocLeft]]
where
binary name assoc =
E.Infix (mkBinOp name <$ symbol name) assoc
mkBinOp nm a b = BinOp a nm b
prefix name = E.Prefix (PrefOp name <$ symbol name)
binaryK name assoc =
E.Infix (mkBinOp name <$ keyword name) assoc
prefixK name = E.Prefix (PrefOp name <$ keyword name)
valueExpr0 :: Parser ValueExpr
valueExpr0 = E.buildExpressionParser table term0
Now we can run the tests:
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr0) basicTests
Cases: 23 Tried: 23 Errors: 0 Failures: 0
Counts {cases = 23, tried = 23, errors = 0, failures = 0}
11.5. new parsers
Let’s start extending things one constructor at a time, but first we will do comments.
11.5.1. whitespace and comments
Here is our old whitespace parser:
whitespace0 :: Parser ()
whitespace0 = void $ many $ oneOf " \n\t"
We are going to change this to do comments as well.
TODO: build these two parsers up in stages, show why each bit is there
lineComment :: Parser ()
lineComment = void (try (string "--") *>
manyTill anyChar (void (char '\n') <|> eof))
blockComment :: Parser ()
blockComment = void (-- no nesting of block comments in SQL
try (string "/*")
-- TODO: why is try used here
*> manyTill anyChar (try $ string "*/"))
todo: discuss how to compose to create the whitespace parser todo: create the whitespace parser without many1 and return () and show the problem.
Here is the final parser for whitespace:
whitespace :: Parser ()
whitespace =
choice [simpleWhitespace *> whitespace
,lineComment *> whitespace
,blockComment *> whitespace
,return ()]
where
lineComment = try (string "--")
*> manyTill anyChar (void (char '\n') <|> eof)
blockComment = try (string "/*")
*> manyTill anyChar (try $ string "*/")
simpleWhitespace = void $ many1 (oneOf " \t\n")
11.5.2. string literal
Our string literal is any characters except single quote enclosed in single quotes. We aren’t going to support other string literal syntaxes or escaping single quotes within a string right now.
Here are some examples:
stringLiteralTests :: [(String,ValueExpr)]
stringLiteralTests =
[("''", StringLit "")
,("'test'", StringLit "test")]
We need a new token parser for string literals:
stringToken :: Parser String
stringToken = lexeme (char '\'' *> manyTill anyChar (char '\''))
TODO: more explanation
And the string literal expression parser:
stringLit :: Parser ValueExpr
stringLit = StringLit <$> stringToken
Here is the new value expression parser:
term1 :: Parser ValueExpr
term1 = iden <|> num <|> parensValue valueExpr1 <|> stringLit
valueExpr1 :: Parser ValueExpr
valueExpr1 = E.buildExpressionParser table term1
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr1) (basicTests ++ stringLiteralTests)
Cases: 25 Tried: 25 Errors: 0 Failures: 0
Counts {cases = 25, tried = 25, errors = 0, failures = 0}
So far, so good.
11.5.3. dotted identifier
dIdenTests :: [(String,ValueExpr)]
dIdenTests =
[("t.a", DIden "t" "a")]
Here is a new token parser to use:
dot :: Parser Char
dot = lexeme $ char '.'
dIden :: Parser ValueExpr
dIden = DIden <$> identifier <*> (dot *> identifier)
term2 :: Parser ValueExpr
term2 = iden <|> num <|> parensValue valueExpr2
<|> stringLit <|> dIden
valueExpr2 :: Parser ValueExpr
valueExpr2 = E.buildExpressionParser table term2
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr2) (basicTests ++ stringLiteralTests ++ dIdenTests)
### Failure in: 25:t.a
(line 1, column 2):
unexpected '.'
expecting digit, letter, "_", "--", "/*", operator or end of input
Cases: 26 Tried: 26 Errors: 0 Failures: 1
Counts {cases = 26, tried = 26, errors = 0, failures = 1}
Do you know why this happened? Can you think of a way to solve this?
Here is the usual blunt hammer technique: reorder the choices to put
the longest choice first and use try
when there is a common prefix.
term3 :: Parser ValueExpr
term3 = try dIden <|> iden <|> num <|> parensValue valueExpr3
<|> stringLit
valueExpr3 :: Parser ValueExpr
valueExpr3 = E.buildExpressionParser table term3
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr3) (basicTests ++ stringLiteralTests ++ dIdenTests)
Cases: 26 Tried: 26 Errors: 0 Failures: 0
Counts {cases = 26, tried = 26, errors = 0, failures = 0}
11.5.4. star
star - no surprises
starTests :: [(String,ValueExpr)]
starTests = [("*", Star)]
star :: Parser ValueExpr
star = Star <$ symbol "*"
term4 :: Parser ValueExpr
term4 = try dIden <|> iden <|> num <|> parensValue valueExpr4
<|> stringLit <|> star
valueExpr4 :: Parser ValueExpr
valueExpr4 = E.buildExpressionParser table term4
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr4) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests)
Cases: 27 Tried: 27 Errors: 0 Failures: 0
Counts {cases = 27, tried = 27, errors = 0, failures = 0}
11.5.5. dotted star
Here is dotted star.
dStarTests :: [(String,ValueExpr)]
dStarTests = [("t.*", DStar "t")]
dstar :: Parser ValueExpr
dstar = DStar <$> (identifier <* dot <* symbol "*")
We’ll have the same issue we’ve seen before, so let’s go straight to the solution.
term5 :: Parser ValueExpr
term5 = try dstar <|> try dIden <|> iden <|> num
<|> parensValue valueExpr5 <|> stringLit <|> star
valueExpr5 :: Parser ValueExpr
valueExpr5 = E.buildExpressionParser table term5
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr5) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests)
Cases: 28 Tried: 28 Errors: 0 Failures: 0
Counts {cases = 28, tried = 28, errors = 0, failures = 0}
11.5.6. app
The App constructor is used for syntax which looks like regular function application: f(), f(a), f(a,b), etc.
appTests :: [(String,ValueExpr)]
appTests = [("f()", App "f" [])
,("f(1)", App "f" [NumLit 1])
,("f(1,a)", App "f" [NumLit 1, Iden "a"])]
Here is the parser, parameterized like the parensValue
parser for
the same reason, so we can reuse the code for different versions of
the value expression parser:
app :: Parser ValueExpr -> Parser ValueExpr
app val = App <$> identifier <*> parens (commaSep val)
And here is the commaSep
helper:
commaSep :: Parser a -> Parser [a]
commaSep = (`sepBy` comma)
comma :: Parser Char
comma = lexeme $ char ','
It is another parser with the identifier
prefix, so another try
.
term6 :: Parser ValueExpr
term6 = try (app valueExpr6) <|> try dstar <|> try dIden <|> iden
<|> num <|> parensValue valueExpr6 <|> stringLit <|> star
valueExpr6 :: Parser ValueExpr
valueExpr6 = E.buildExpressionParser table term6
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr6) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests ++ appTests)
Cases: 31 Tried: 31 Errors: 0 Failures: 0
Counts {cases = 31, tried = 31, errors = 0, failures = 0}
Everything looks good.
11.5.7. case
Here is something a little more interesting.
caseTests :: [(String,ValueExpr)]
caseTests =
[("case a when 1 then 2 end"
,Case (Just $ Iden "a") [(NumLit 1,NumLit 2)] Nothing)
,("case a when 1 then 2 when 3 then 4 end"
,Case (Just $ Iden "a")
[(NumLit 1, NumLit 2)
,(NumLit 3, NumLit 4)]
Nothing)
,("case a when 1 then 2 when 3 then 4 else 5 end"
,Case (Just $ Iden "a")
[(NumLit 1, NumLit 2)
,(NumLit 3, NumLit 4)]
(Just $ NumLit 5))
,("case when a=1 then 2 when a=3 then 4 else 5 end"
,Case Nothing
[(BinOp (Iden "a") "=" (NumLit 1), NumLit 2)
,(BinOp (Iden "a") "=" (NumLit 3), NumLit 4)]
(Just $ NumLit 5))
]
How can we approach this? We know that there will always be a case keyword at the start, and an end keyword at the end. Each when branch seems to be self contained.
Here is a rough pseudo-code sketch:
keyword "case"
optional test_expression
many1 when_clause
optional else clause
keyword "end"
Here is the case parser based simply on this pseudo-code.
caseValue0 :: Parser ValueExpr -> Parser ValueExpr
caseValue0 val = do
void $ keyword "case"
testExp <- optionMaybe val
whens <- many1 whenClause
els <- optionMaybe elseClause
void $ keyword "end"
return $ Case testExp whens els
where
whenClause = (,) <$> (keyword "when" *> val)
<*> (keyword "then" *> val)
elseClause = keyword "else" *> val
Let’s try it on its own first and see if we have any problems:
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (caseValue0 valueExpr6)) caseTests in: 3:case when a=1 then 2 when a=3 then 4 else 5 end
(line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"
Cases: 4 Tried: 4 Errors: 0 Failures: 1
Counts {cases = 4, tried = 4, errors = 0, failures = 1}
Three work, and one fails. What happened here? Try to run the parsing code in your head to see if you can work out what the problem is.
Here is a method we can use when you can’t manage to locate a problem in this way.
Let’s single out the failure first:
*ValueExpressions> parseWithEof (caseValue0 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
Left (line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"
Here is the caseValue
parser rewritten with trace
interleaved:
caseValue1 :: Parser ValueExpr -> Parser ValueExpr
caseValue1 val = do
trace "start case" $ return ()
void $ keyword "case"
trace "read case keyword" $ return ()
testExp <- optionMaybe val
trace ("read testExpr: " ++ show testExp) $ return ()
whens <- many1 whenClause
trace ("read whens: " ++ show (length whens)) $ return ()
els <- optionMaybe elseClause
trace ("read else: " ++ show els) $ return ()
void $ keyword "end"
trace "read end keyword" $ return ()
return $ Case testExp whens els
where
whenClause = do
trace "start when clause" $ return ()
void $ keyword "when"
trace "read when keyword" $ return ()
w <- val
trace ("read when exp: " ++ show w) $ return ()
void $ keyword "then"
trace "read then keyword" $ return ()
t <- val
trace ("read then exp: " ++ show t) $ return ()
return (w,t)
elseClause = do
trace "start else clause" $ return ()
void $ keyword "else"
trace "read else keyword" $ return ()
v <- val
trace ("read else exp: " ++ show v) $ return ()
return v
Now when we run the test, we can see a trace of what happens:
*ValueExpressions> parseWithEof (caseValue1 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
start case
read case keyword
read testExpr: Just (Iden "when")
start when clause
Left (line 1, column 12):
unexpected "="
expecting operator, digit, letter, "_", "--" or "/*"
Look at the trace and see if you spot the problem.
We’ve parsed the when
keyword as an identifier for the optional
first expression. The way we deal with this is to use a new identifier
parser which has a blacklist of keywords which can’t be identifiers.
caseValue2 :: Parser ValueExpr -> Parser ValueExpr
caseValue2 val = do
trace "start case" $ return ()
void $ keyword "case"
trace "read case keyword" $ return ()
testExp <- optionMaybe caseVal
trace ("read testExpr: " ++ show testExp) $ return ()
whens <- many1 whenClause
trace ("read whens: " ++ show (length whens)) $ return ()
els <- optionMaybe elseClause
trace ("read else: " ++ show els) $ return ()
void $ keyword "end"
trace "read end keyword" $ return ()
return $ Case testExp whens els
where
whenClause = do
trace "start when clause" $ return ()
void $ keyword "when"
trace "read when keyword" $ return ()
w <- caseVal
trace ("read when exp: " ++ show w) $ return ()
void $ keyword "then"
trace "read then keyword" $ return ()
t <- caseVal
trace ("read then exp: " ++ show t) $ return ()
return (w,t)
elseClause = do
trace "start else clause" $ return ()
void $ keyword "else"
trace "read else keyword" $ return ()
v <- caseVal
trace ("read else exp: " ++ show v) $ return ()
return v
-- here is the fix, we replace calls to `val` with calls to
-- this function which prevents any of the case keywords from
-- being parsed as identifiers
caseVal = try $ do
v <- val
guard $ case v of
Iden i | i `elem` ["case", "when", "then", "else", "end"] -> False
_ -> True
return v
Maybe this blacklist used in this way isn’t permissive enough, or is too permissive, but let’s not get lost in these details right now, and come back to it later.
*ValueExpressions> parseWithEof (caseValue2 valueExpr6) "case when a=1 then 2 when a=3 then 4 else 5 end"
start case
read case keyword
read testExpr: Nothing
start when clause
read when keyword
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
start else clause
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Right (Case Nothing [(BinOp (Iden "a") "=" (NumLit 1),NumLit 2),(BinOp (Iden "a") "=" (NumLit 3),NumLit 4)] (Just (NumLit 5)))
Looks good. Have a close look through the trace to see if you can follow it all and it makes sense. TODO: There appear to be some messages missing - I think it is some sort of memoization effect.
Let’s run all the tests with the tracing still in so if there are any failures we can zero in on them.
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (caseValue2 valueExpr6)) caseTests
Cases: 4 Tried: 0 Errors: 0 Failures: 0start case
read case keyword
read testExpr: Just (Iden "a")
start when clause
read when keyword
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read whens: 1
start else clause
read else: Nothing
read end keyword
Cases: 4 Tried: 1 Errors: 0 Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else: Nothing
read end keyword
Cases: 4 Tried: 2 Errors: 0 Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 4 Tried: 3 Errors: 0 Failures: 0read testExpr: Nothing
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 4 Tried: 4 Errors: 0 Failures: 0
Counts {cases = 4, tried = 4, errors = 0, failures = 0}
Now the full expression parser:
term7 :: Parser ValueExpr
term7 = caseValue2 valueExpr7
<|> try (app valueExpr7) <|> try dstar <|> try dIden <|> iden
<|> num <|> parensValue valueExpr7 <|> stringLit <|> star
valueExpr7 :: Parser ValueExpr
valueExpr7 = E.buildExpressionParser table term7
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr7) (basicTests ++ stringLiteralTests ++ dIdenTests ++ starTests ++ dStarTests ++ appTests ++ caseTests)
Cases: 35 Tried: 0 Errors: 0 Failures: 0start case
Cases: 35 Tried: 31 Errors: 0 Failures: 0read case keyword
read testExpr: Just (Iden "a")
start when clause
read when keyword
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read whens: 1
start else clause
read else: Nothing
read end keyword
Cases: 35 Tried: 32 Errors: 0 Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else: Nothing
read end keyword
Cases: 35 Tried: 33 Errors: 0 Failures: 0read testExpr: Just (Iden "a")
read when exp: NumLit 1
read then keyword
read then exp: NumLit 2
read when exp: NumLit 3
read then keyword
read then exp: NumLit 4
read whens: 2
read else keyword
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 35 Tried: 34 Errors: 0 Failures: 0read testExpr: Nothing
read when exp: BinOp (Iden "a") "=" (NumLit 1)
read then keyword
read then exp: NumLit 2
read when exp: BinOp (Iden "a") "=" (NumLit 3)
read then keyword
read then exp: NumLit 4
read whens: 2
read else exp: NumLit 5
read else: Just (NumLit 5)
read end keyword
Cases: 35 Tried: 35 Errors: 0 Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}
Everything looks good. Let’s refactor the case parser and tidy everything up.
blackListValueExpr :: [String] -> Parser ValueExpr -> Parser ValueExpr
blackListValueExpr blackList val = try $ do
v <- val
guard $ case v of
Iden i | i `elem` blackList -> False
_ -> True
return v
Here is the parser with the trace`s removed, and using the new
`blackListValueExpr
parser.
caseValue3 :: Parser ValueExpr -> Parser ValueExpr
caseValue3 val = do
void $ keyword "case"
testExp <- optionMaybe caseVal
whens <- many1 whenClause
els <- optionMaybe elseClause
void $ keyword "end"
return $ Case testExp whens els
where
whenClause = do
void $ keyword "when"
w <- caseVal
void $ keyword "then"
t <- caseVal
return (w,t)
elseClause = do
void $ keyword "else"
v <- caseVal
return v
caseVal = blackListValueExpr blackList val
blackList = ["case", "when", "then", "else", "end"]
And here it is after some shortening:
caseValue4 :: Parser ValueExpr -> Parser ValueExpr
caseValue4 val =
Case
<$> (keyword "case" *> optionMaybe caseVal)
<*> many1 whenClause
<*> optionMaybe elseClause
<* keyword "end"
where
whenClause = (,) <$> (keyword "when" *> caseVal)
<*> (keyword "then" *> caseVal)
elseClause = keyword "else" *> caseVal
caseVal = blackListValueExpr blackList val
blackList = ["case", "when", "then", "else", "end"]
I think we passed the point where (<|>)
is more readable than
choice
:
term8 :: Parser ValueExpr
term8 = choice [caseValue4 valueExpr8
,try $ app valueExpr8
,try dstar
,try dIden
,iden
,num
,parensValue valueExpr8
,stringLit
,star]
valueExpr8 :: Parser ValueExpr
valueExpr8 = E.buildExpressionParser table term8
allExpressionTests :: [(String,ValueExpr)]
allExpressionTests = concat [basicTests
,stringLiteralTests
,dIdenTests
,starTests
,dStarTests
,appTests
,caseTests]
Let’s double check:
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest valueExpr8) allExpressionTests
Cases: 35 Tried: 35 Errors: 0 Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}
TODO: review code above for new syntax patterns to talk about. Or maybe don’t talk about them?
Let’s predict that we will need more blacklisting when we work on the query expression parsing, and create a parser which can be used in this way:
term :: [String] -> Parser ValueExpr
term blackList = choice [caseValue4 (valueExpr blackList)
,try (app (valueExpr blackList))
,try dstar
,try dIden
,blackListValueExpr blackList iden
,num
,parensValue (valueExpr blackList)
,stringLit
,star]
TODO: need to change app, dstar and diden for blacklisting?
valueExpr :: [String] -> Parser ValueExpr
valueExpr blackList = E.buildExpressionParser table (term blackList)
Final sanity check:
*ValueExpressions> H.runTestTT $ H.TestList $ map (makeTest (valueExpr [])) allExpressionTests
Cases: 35 Tried: 35 Errors: 0 Failures: 0
Counts {cases = 35, tried = 35, errors = 0, failures = 0}
12. Query expressions
We can now start on the 'select' parser. In the SQL standard, it refers to these things as 'query expressions' to distinguish then from 'value expressions', so we will reuse this language here.
The subset of SQL we will support is this:
TODO: write lots of examples here
select queries only, no union, intersect or except. No common table expressions.
we will support all the value expressions that the value expression parser above supports
we will support select lists with optional aliases, with the 'as' optional in the alias
we will support the * as in 'select * from t', and the variation 'select t.* from t', but not the alias version select * as (a,b,c) from t.
we support two part dotted identifiers in value expressions, but no other sort (such as 3-part dotted value expression identifiers, or schema qualified table names or function names).
for the from clause, we will only support optional 'from table_name'.
supports where
we will support regular group by lists, but not the new group by options in SQL2003 (group by (), grouping sets, cube, rollup).
we will support having
we support order by, with multiple columns, but not explicit asc or desc , and no 'nulls first' or 'nulls last' syntax.
No support for offset and fetch first, or variations.
{-# LANGUAGE TupleSections #-}
module QueryExpressions where
--import Text.Groom (groom)
--import qualified Text.Parsec as P
import Text.Parsec.String (Parser)
import Text.Parsec (try,optionMaybe, optional, sepBy1,option)
import Control.Applicative ((<$>),(*>),(<*>))
import Control.Monad (void,guard)
--import Debug.Trace
--import Data.List (intercalate)
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import ValueExpressions (ValueExpr(..), valueExpr, identifier, symbol, keyword, comma)
Here is the datatype for query expressions to get started with. In this tutorial, we will only support an optional single table in the from clause, and this will be expanded in the next tutorial.
TODO: rearrange the from/where/groupby/having/orderby to a separate datatype which is optional in a select, and the from part is mandatory in this new type. This follows the standard and is more accurate syntax.
data QueryExpr
= Select
{qeSelectList :: [(ValueExpr,Maybe String)]
,qeFrom :: Maybe String
,qeWhere :: Maybe ValueExpr
,qeGroupBy :: [ValueExpr]
,qeHaving :: Maybe ValueExpr
,qeOrderBy :: [ValueExpr]
} deriving (Eq,Show)
Here is a default value which can be used to easily construct query expression values.
makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
,qeFrom = Nothing
,qeWhere = Nothing
,qeGroupBy = []
,qeHaving = Nothing
,qeOrderBy = []}
12.1. select lists
Let’s start with something simple:
select [value expr]
TODO: shorten the names of these examples
singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
[("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]
Here are a couple of wrappers for symbol
and keyword
which wrap
them with void.
keyword_ :: String -> Parser ()
keyword_ = void . keyword
symbol_ :: String -> Parser ()
symbol_ = void . symbol
singleSelectItem :: Parser QueryExpr
singleSelectItem = do
keyword_ "select"
e <- valueExpr []
return $ makeSelect {qeSelectList = [(e,Nothing)]}
You can use the old test runner to check these:
*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest singleSelectItem) parseSingleSelectItemTestData
Cases: 1 Tried: 1 Errors: 0 Failures: 0
Counts {cases = 1, tried = 1, errors = 0, failures = 0}
Let’s rewrite it in the Applicative and mostly point free style.
singleSelectItemApplicative :: Parser QueryExpr
singleSelectItemApplicative =
(\sl -> makeSelect {qeSelectList = sl})
<$> (keyword_ "select" *> (((:[]) . (,Nothing)) <$> valueExpr []))
That didn’t go so well. Using a wrapper:
singleSelectItemApplicative' :: Parser QueryExpr
singleSelectItemApplicative' =
ms <$> (keyword_ "select" *> valueExpr [])
where
ms e = makeSelect {qeSelectList = [(e,Nothing)]}
Now let’s write something that supports multiple value expressions, e.g.
select 1+2, 3+4;
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
[("select a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
,("select a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)]})
,("select 1+2,3+4"
,makeSelect {qeSelectList =
[(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
]
selectMultipleItems :: Parser QueryExpr
selectMultipleItems = do
keyword_ "select"
es <- commaSep1 (valueExpr [])
return $ makeSelect {qeSelectList = map (,Nothing) es}
commaSep1 :: Parser a -> Parser [a]
commaSep1 = (`sepBy1` comma)
12.1.1. aliases
We can write names for the columns produced from a select list using
the keyword as
, and we can miss out the as
:
select a as a1, b as b1, f(c) as c1;
-- no as
select a a1, b b1;
selectListTests :: [(String,QueryExpr)]
selectListTests =
[("select a as a1, b as b1"
,makeSelect {qeSelectList = [(Iden "a", Just "a1")
,(Iden "b", Just "b1")]})
,("select a a1, b b1"
,makeSelect {qeSelectList = [(Iden "a", Just "a1")
,(Iden "b", Just "b1")]})
] ++ multipleSelectItemsTests
++ singleSelectItemTests
Finally, here is the select list parser and the helper for select items:
selectItem0 :: Parser (ValueExpr, Maybe String)
selectItem0 = (,) <$> valueExpr [] <*> optionMaybe (try alias)
where alias = optional (keyword_ "as") *> identifier
selectList0 :: Parser [(ValueExpr, Maybe String)]
selectList0 = keyword_ "select" *> commaSep1 selectItem0
queryExpr0 :: Parser QueryExpr
queryExpr0 = mkSelect <$> selectList0
where mkSelect sl = makeSelect {qeSelectList = sl}
12.2. from clause
from :: Parser String
from = keyword_ "from" *> identifier
fromTests :: [(String,QueryExpr)]
fromTests =
[("select a from t"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = (Just "t")})]
queryExpr1 :: Parser QueryExpr
queryExpr1 = mkSelect
<$> selectList0
<*> optionMaybe from
where mkSelect sl fr = makeSelect {qeSelectList = sl
,qeFrom = fr}
*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest queryExpr1) (selectListTests ++ fromTests)
### Failure in: 6:select a from t
(line 1, column 16):
unexpected end of input
expecting digit, letter, "_", "--" or "/*"
Cases: 7 Tried: 7 Errors: 0 Failures: 1
Counts {cases = 7, tried = 7, errors = 0, failures = 1}
This is a keyword issue again. We are parsing the from
as if it was
a column alias and then getting stuck.
blackListIdentifier :: [String] -> Parser String
blackListIdentifier bl = do
i <- identifier
guard (i `notElem` bl)
return i
selectItem :: Parser (ValueExpr, Maybe String)
selectItem = (,) <$> valueExpr [] <*> optionMaybe (try alias)
where alias = optional (keyword_ "as") *> blackListIdentifier ["from"]
selectList :: Parser [(ValueExpr, Maybe String)]
selectList = keyword_ "select" *> commaSep1 selectItem
queryExpr2 :: Parser QueryExpr
queryExpr2 = mkSelect
<$> selectList
<*> optionMaybe from
where mkSelect sl fr = makeSelect {qeSelectList = sl
,qeFrom = fr}
That did the job for now.
12.3. where
The where, group by, having, and order by parsers are simple.
whereTests :: [(String,QueryExpr)]
whereTests =
[("select a from t where a = 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = Just "t"
,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
]
whereClause :: Parser ValueExpr
whereClause = keyword_ "where" *> valueExpr []
queryExpr3 :: Parser QueryExpr
queryExpr3 = mkSelect
<$> selectList
<*> optionMaybe from
<*> optionMaybe whereClause
where mkSelect sl fr wh =
makeSelect {qeSelectList = sl
,qeFrom = fr
,qeWhere = wh}
12.4. group by
groupByTests :: [(String,QueryExpr)]
groupByTests =
[("select a,sum(b) from t group by a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = Just "t"
,qeGroupBy = [Iden "a"]
})
,("select a,b,sum(c) from t group by a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)
,(App "sum" [Iden "c"],Nothing)]
,qeFrom = Just "t"
,qeGroupBy = [Iden "a",Iden "b"]
})
]
groupByClause :: Parser [ValueExpr]
groupByClause = keyword_ "group" *> keyword_ "by"
*> commaSep1 (valueExpr [])
queryExpr4 :: Parser QueryExpr
queryExpr4 = mkSelect
<$> selectList
<*> optionMaybe from
<*> optionMaybe whereClause
<*> option [] groupByClause
where mkSelect sl fr wh gr =
makeSelect {qeSelectList = sl
,qeFrom = fr
,qeWhere = wh
,qeGroupBy = gr}
12.5. having
havingTests :: [(String,QueryExpr)]
havingTests =
[("select a,sum(b) from t group by a having sum(b) > 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = (Just "t")
,qeGroupBy = [Iden "a"]
,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
})
]
having :: Parser ValueExpr
having = keyword_ "having" *> (valueExpr [])
queryExpr5 :: Parser QueryExpr
queryExpr5 = mkSelect
<$> selectList
<*> optionMaybe from
<*> optionMaybe whereClause
<*> option [] groupByClause
<*> optionMaybe having
where mkSelect sl fr wh gr hv =
makeSelect {qeSelectList = sl
,qeFrom = fr
,qeWhere = wh
,qeGroupBy = gr
,qeHaving = hv}
Looking nice so far. Did you run the tests for each stage?
12.6. order by
orderByTests :: [(String,QueryExpr)]
orderByTests =
[("select a from t order by a"
,ms [Iden "a"])
,("select a from t order by a, b"
,ms [Iden "a", Iden "b"])
]
where
ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = (Just "t")
,qeOrderBy = o}
orderBy :: Parser [ValueExpr]
orderBy = keyword_ "order" *> keyword_ "by"
*> commaSep1 (valueExpr [])
queryExpr6 :: Parser QueryExpr
queryExpr6 = Select
<$> selectList
<*> optionMaybe from
<*> optionMaybe whereClause
<*> option [] groupByClause
<*> optionMaybe having
<*> option [] orderBy
*QueryExpressions> H.runTestTT $ H.TestList $ map (makeTest queryExpr6) (selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests)
Cases: 13 Tried: 13 Errors: 0 Failures: 0
Counts {cases = 13, tried = 13, errors = 0, failures = 0}
TODO: talk about putting the maybes/default values inside from, where, etc??
13. From clause
In this tutorial, we extend the from clause support to the following: we will support implicit and explicit joins, including keywords natural, inner, outer, left, right, full, cross, on and using, plus parens and simple aliases (e.g. select a from t u, but not select a from t(a,b)). We don’t support oracle outer join syntax (+) or the other 'pre-ANSI' variations on this theme. No lateral keyword or apply or pivot.
{-# LANGUAGE TupleSections #-}
--import Text.Groom (groom)
--import qualified Text.Parsec as P
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Combinator
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>))
import Control.Monad
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import Debug.Trace
TODO: should explicitly import from these two modules (and same in QueryExpressions.lhs)
import ValueExpressions (ValueExpr(..), valueExpr, identifier, makeTest, parens)
import QueryExpressions (selectList,whereClause,groupByClause,having,orderBy
,commaSep1, keyword_,blackListIdentifier)
13.1. Abstract syntax
Here are is the updated QueryExpr
and the new TableRef
abstract
syntax types.
data QueryExpr
= Select
{qeSelectList :: [(ValueExpr,Maybe String)]
,qeFrom :: [TableRef]
,qeWhere :: Maybe ValueExpr
,qeGroupBy :: [ValueExpr]
,qeHaving :: Maybe ValueExpr
,qeOrderBy :: [ValueExpr]
} deriving (Eq,Show)
makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
,qeFrom = []
,qeWhere = Nothing
,qeGroupBy = []
,qeHaving = Nothing
,qeOrderBy = []}
data TableRef = TRSimple String
| TRJoin TableRef JoinType TableRef (Maybe JoinCondition)
| TRParens TableRef
| TRAlias TableRef String
| TRQueryExpr QueryExpr
deriving (Eq,Show)
This syntax for table references can represent invalid syntax, for instance two nested aliases. The justification for this is that sometimes trying to accurately represent only exactly what is valid creates something much more complex. Maybe this is a good tradeoff in this situation, and maybe not.
data JoinType = JoinInner | JoinLeft | JoinRight | JoinFull | JoinCross
deriving (Eq,Show)
data JoinCondition = JoinOn ValueExpr
| JoinUsing [String]
| JoinNatural
deriving (Eq,Show)
With the join condition, we’ve done the opposite to TableRef - we’ve
combined natural
and on
/using
, since only one of these can be
present, even though this departs a little from the concrete
syntax.
First we will develop the standalone from clause parser, then we will update the query expression syntax and parsing to incorporate our new from clause parser.
13.2. simple table name
Let’s start with something simple: a from clause can be multiple comma separated tablerefs, aka an implicit join.
multipleTRSimpleTests :: [(String, [TableRef])]
multipleTRSimpleTests = [("from a,b", [TRSimple "a", TRSimple "b"])]
from0 :: Parser [TableRef]
from0 = keyword_ "from" *> commaSep1 (TRSimple <$> identifier)
*Main> H.runTestTT $ H.TestList $ map (makeTest from0) multipleTRSimpleTests
Cases: 1 Tried: 1 Errors: 0 Failures: 0
Counts {cases = 1, tried = 1, errors = 0, failures = 0}
Let’s do the query expression, parens and alias first, before tackling joins.
13.3. subquery
Here is the example:
trQueryExprTests :: [(String, [TableRef])]
trQueryExprTests =
[("from (select a from t)"
,[TRQueryExpr $ makeSelect {qeSelectList = [(Iden "a", Nothing)]
,qeFrom = [TRSimple "t"]}])]
Here is the query expression parser we can use:
queryExpr1 :: Parser [TableRef] -> Parser QueryExpr
queryExpr1 from' = Select
<$> selectList
<*> option [] from'
<*> optionMaybe whereClause
<*> option [] groupByClause
<*> optionMaybe having
<*> option [] orderBy
from1 :: Parser [TableRef]
from1 =
keyword_ "from" *> commaSep1 trefTerm
where
trefTerm = choice [TRSimple <$> identifier
,TRQueryExpr <$> parens (queryExpr1 from1)]
13.4. parens
We can’t do a sensible example for these right now - we need explicit joins and then the parens can be used to override the associativity of a three way join, or to specify over what part of the expression to apply an alias.
trParensTests :: [(String, [TableRef])]
trParensTests = [("from (a)", [TRParens $ TRSimple "a"])]
We can write some more tests for parens after we’ve done the explicit joins.
from2 :: Parser [TableRef]
from2 =
keyword_ "from" *> commaSep1 trefTerm
where
trefTerm = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from2))
,TRParens <$> parens trefTerm]
13.5. alias
trAliasTests :: [(String,[TableRef])]
trAliasTests = [("from a as b", [TRAlias (TRSimple "a") "b"])
,("from a b", [TRAlias (TRSimple "a") "b"])]
The alias can be treated like a postfix operator.
suffixWrapper :: (a -> Parser a) -> a -> Parser a
suffixWrapper p a = p a <|> return a
TODO: ?? not sure about this
from3 :: Parser [TableRef]
from3 =
keyword_ "from" *> commaSep1 trefTerm
where
trefTerm = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from3))
,TRParens <$> parens trefTerm]
>>= suffixWrapper alias
alias tr = TRAlias tr <$> (optional (keyword_ "as") *> identifier)
How to make it keep nesting?
13.6. joins
Here is a casual sketch of the target grammar:
tref
(cross | [natural]
([inner]
| left [outer]
| right [outer]
| full [outer]
)
join tref
[on expr | using (...)]
Let’s start with parsers for the 'join operator' in the middle and for the join condition:
13.6.1. join type
joinType :: Parser JoinType
joinType = choice
[JoinCross <$ keyword_ "cross" <* keyword_ "join"
,JoinInner <$ keyword_ "inner" <* keyword_ "join"
,JoinLeft <$ keyword_ "left"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinRight <$ keyword_ "right"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinFull <$ keyword_ "full"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinInner <$ keyword_ "join"]
*Main> parseWithEof joinType "cross join"
Right JoinCross
*Main> parseWithEof joinType "inner join"
Right JoinInner
*Main> parseWithEof joinType "left outer join"
Right JoinLeft
*Main> parseWithEof joinType "left join"
Right JoinLeft
*Main> parseWithEof joinType "right outer join"
Right JoinRight
*Main> parseWithEof joinType "right join"
Right JoinRight
*Main> parseWithEof joinType "full outer join"
Right JFull
*Main> parseWithEof joinType "full join"
Right JoinFull
*Main> parseWithEof joinType "join"
Right JoinInner
I thought about factoring out the common bits with the joinType parser:
joinType0 :: Parser JoinType
joinType0 = choice
[choice
[JoinCross <$ try (keyword_ "cross")
,JoinInner <$ try (keyword_ "inner")
,choice [JoinLeft <$ try (keyword_ "left")
,JoinRight <$ try (keyword_ "right")
,JoinFull <$ try (keyword_ "full")]
<* optional (try $ keyword_ "outer")]
<* keyword_ "join"
,JoinInner <$ keyword_ "join"]
But I think the longer version is much easier to follow, even if it is a little more boring.
13.6.2. join condition
The idea with the join condition is that we pass a bool to say whether we’ve already seen the 'natural' keyword. If so, then we don’t try to parse 'on' or 'using'.
joinCondition :: Bool -> Parser JoinCondition
joinCondition nat =
choice [guard nat >> return JoinNatural
,keyword_ "on" >> JoinOn <$> valueExpr []
,keyword_ "using" >> JoinUsing <$> parens (commaSep1 identifier)
]
*Main> parseWithEof (joinCondition False) "on a"
Right (JoinOn (Iden "a"))
*Main> parseWithEof (joinCondition False) "on a + b"
Right (JoinOn (BinOp (Iden "a") "+" (Iden "b")))
*Main> parseWithEof (joinCondition False) "using (a,b)"
Right (JoinUsing ["a","b"])
*Main> parseWithEof (joinCondition True) "using (a,b)"
Left (line 1, column 1):
unexpected 'u'
expecting end of input
*Main> parseWithEof (joinCondition True) ""
Right JoinNatural
13.6.3. simple binary join
Let’s try some simple binary joins:
simpleBinaryJoinTests :: [(String,[TableRef])]
simpleBinaryJoinTests =
[("from a join b"
,[TRJoin (TRSimple "a") JoinInner (TRSimple "b") Nothing])
,("from a natural join b"
,[TRJoin (TRSimple "a") JoinInner (TRSimple "b") (Just JoinNatural)])
,("from a join b on a.x = b.y"
,[TRJoin (TRSimple "a") JoinInner (TRSimple "b")
(Just $ JoinOn $ BinOp (DIden "a" "x") "="(DIden "b" "y"))])
,("from a join b using(x,y)"
,[TRJoin (TRSimple "a") JoinInner (TRSimple "b")
(Just $ JoinUsing ["x","y"])])
,("from a cross join b"
,[TRJoin (TRSimple "a") JoinCross (TRSimple "b") Nothing])
]
We want to parse the first table, then optionally parse the 'natural' keyword, then the join type, then the second table, then optionally parse the join condition.
from4 :: Parser [TableRef]
from4 = keyword_ "from" >> (:[]) <$> do
t0 <- simpleTref
nat <- option False (True <$ keyword_ "natural")
jt <- joinType
t1 <- simpleTref
jc <- optionMaybe (joinCondition nat)
return $ TRJoin t0 jt t1 jc
where
simpleTref = TRSimple <$> identifier
Let’s start extending this into the full target parser. In this next version, I’ve tried to combine all the versions we’ve seen so far.
from5 :: Parser [TableRef]
from5 = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = do
nat <- option False (True <$ keyword_ "natural")
jt <- joinType
t1 <- nonJoinTref
jc <- optionMaybe (joinCondition nat)
return $ TRJoin t0 jt t1 jc
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from5))
,TRParens <$> parens tref]
>>= suffixWrapper alias
alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> identifier))
*Main> H.runTestTT $ H.TestList $ map (makeTest from5) (multipleTRSimpleTests ++ trQueryExprTests ++ trParensTests ++ trAliasTests ++ simpleBinaryJoinTests)
### Failure in: 5:from a join b
(line 1, column 14):
unexpected end of input
expecting digit, letter, "_", "--" or "/*"
### Failure in: 6:from a natural join b
from a natural join b
expected: [TRJoin (TRSimple "a") JoinInner (TRSimple "b") (Just JoinNatural)]
but got: [TRJoin (TRAlias (TRSimple "a") "natural") JoinInner (TRSimple "b") Nothing]
### Failure in: 7:from a join b on a.x = b.y
(line 1, column 15):
unexpected "o"
expecting "--" or "/*"
### Failure in: 8:from a join b using(x,y)
(line 1, column 15):
unexpected "u"
expecting "--" or "/*"
### Failure in: 9:from a cross join b
from a cross join b
expected: [TRJoin (TRSimple "a") JoinCross (TRSimple "b") Nothing]
but got: [TRJoin (TRAlias (TRSimple "a") "cross") JoinInner (TRSimple "b") Nothing]
Cases: 10 Tried: 10 Errors: 0 Failures: 5
Counts {cases = 10, tried = 10, errors = 0, failures = 5}
What’s going wrong? If you look at some of the issues, it looks like we are getting keywords parsed as aliases. Let’s fix that first:
from6 :: Parser [TableRef]
from6 = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = do
nat <- option False (True <$ keyword_ "natural")
jt <- joinType
t1 <- nonJoinTref
jc <- optionMaybe (joinCondition nat)
return $ TRJoin t0 jt t1 jc
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from6))
,TRParens <$> parens tref]
>>= suffixWrapper alias
alias tr = TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier)
aliasIdentifier = blackListIdentifier
["natural"
,"inner"
,"outer"
,"cross"
,"left"
,"right"
,"full"
,"join"
,"on"
,"using"]
That didn’t solve the problem. I think we also have a problem since
the alias
can now fail after consuming input, we need to use try
.
from7 :: Parser [TableRef]
from7 = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = do
nat <- option False (True <$ keyword_ "natural")
jt <- joinType
t1 <- nonJoinTref
jc <- optionMaybe (joinCondition nat)
return $ TRJoin t0 jt t1 jc
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from7))
,TRParens <$> parens tref]
>>= suffixWrapper (try . alias)
alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
aliasIdentifier = blackListIdentifier
["natural"
,"inner"
,"outer"
,"cross"
,"left"
,"right"
,"full"
,"join"
,"on"
,"using"]
The final step is to make it parse n-way explicit joins.
threeWayJoinTests :: [(String,[TableRef])]
threeWayJoinTests =
[("from a join b join c"
,[TRJoin
(TRJoin (TRSimple "a") JoinInner (TRSimple "b") Nothing)
JoinInner (TRSimple "c") Nothing])]
from8 :: Parser [TableRef]
from8 = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = (do
nat <- option False (True <$ keyword_ "natural")
TRJoin t0 <$> joinType
<*> nonJoinTref
<*> optionMaybe (joinCondition nat))
>>= suffixWrapper joinTrefSuffix
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens (queryExpr1 from8))
,TRParens <$> parens tref]
>>= suffixWrapper (try . alias)
alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
aliasIdentifier = blackListIdentifier
["natural"
,"inner"
,"outer"
,"cross"
,"left"
,"right"
,"full"
,"join"
,"on"
,"using"]
We get left associative with this code. I don’t know if this is correct.
We should do some more testing to make sure this code is good. TODO
13.7. query expressions
Let’s create the full query expression parser now:
queryExprJoinTests :: [(String,QueryExpr)]
queryExprJoinTests =
[("select a from t"
,ms [TRSimple "t"])
,("select a from t,u"
,ms [TRSimple "t", TRSimple "u"])
,("select a from t inner join u on expr"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t left join u on expr"
,ms [TRJoin (TRSimple "t") JoinLeft (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t right join u on expr"
,ms [TRJoin (TRSimple "t") JoinRight (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t full join u on expr"
,ms [TRJoin (TRSimple "t") JoinFull (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t cross join u"
,ms [TRJoin (TRSimple "t") JoinCross (TRSimple "u") Nothing])
,("select a from t natural inner join u"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just JoinNatural)])
,("select a from t inner join u using(a,b)"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just $ JoinUsing ["a", "b"])])
,("select a from (select a from t)"
,ms [TRQueryExpr $ ms [TRSimple "t"]])
,("select a from t as u"
,ms [TRAlias (TRSimple "t") "u"])
,("select a from t u"
,ms [TRAlias (TRSimple "t") "u"])
,("select a from (t cross join u) as u"
,ms [TRAlias (TRParens $ TRJoin (TRSimple "t") JoinCross
(TRSimple "u") Nothing) "u"])
]
where
ms f = makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = f}
Here are all the other query expression tests updated with the new QueryExpr type.
singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
[("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
[("select a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
,("select a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)]})
,("select 1+2,3+4"
,makeSelect {qeSelectList =
[(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
]
selectListTests :: [(String,QueryExpr)]
selectListTests =
[("select a as a, b as b"
,makeSelect {qeSelectList = [(Iden "a", Just "a")
,(Iden "b", Just "b")]})
,("select a a, b b"
,makeSelect {qeSelectList = [(Iden "a", Just "a")
,(Iden "b", Just "b")]})
] ++ multipleSelectItemsTests
++ singleSelectItemTests
fromTests :: [(String,QueryExpr)]
fromTests =
[("select a from t"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]})]
whereTests :: [(String,QueryExpr)]
whereTests =
[("select a from t where a = 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]
,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
]
groupByTests :: [(String,QueryExpr)]
groupByTests =
[("select a,sum(b) from t group by a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a"]
})
,("select a,b,sum(c) from t group by a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)
,(App "sum" [Iden "c"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a",Iden "b"]
})
]
havingTests :: [(String,QueryExpr)]
havingTests =
[("select a,sum(b) from t group by a having sum(b) > 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a"]
,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
})
]
orderByTests :: [(String,QueryExpr)]
orderByTests =
[("select a from t order by a"
,ms [Iden "a"])
,("select a from t order by a, b"
,ms [Iden "a", Iden "b"])
]
where
ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]
,qeOrderBy = o}
*Main> H.runTestTT $ H.TestList $ map (makeTest (queryExpr1 from8)) (selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests ++ queryExprJoinTests)
### Failure in: 7:select a from t where a = 5
(line 1, column 25):
unexpected "="
expecting "--" or "/*"
### Failure in: 8:select a,sum(b) from t group by a
(line 1, column 33):
unexpected "a"
expecting "--" or "/*"
### Failure in: 9:select a,b,sum(c) from t group by a,b
(line 1, column 35):
unexpected "a"
expecting "--" or "/*"
### Failure in: 10:select a,sum(b) from t group by a having sum(b) > 5
(line 1, column 33):
unexpected "a"
expecting "--" or "/*"
### Failure in: 11:select a from t order by a
(line 1, column 26):
unexpected "a"
expecting "--" or "/*"
### Failure in: 12:select a from t order by a, b
(line 1, column 26):
unexpected "a"
expecting "--" or "/*"
Cases: 26 Tried: 26 Errors: 0 Failures: 6
Counts {cases = 26, tried = 26, errors = 0, failures = 6}
The problem is the table alias parser is trying to parse keywords
again. Here is the from
parser with the alias name blacklist
expanded.
from :: Parser [TableRef]
from = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = (do
nat <- option False (True <$ keyword_ "natural")
TRJoin t0 <$> joinType
<*> nonJoinTref
<*> optionMaybe (joinCondition nat))
>>= suffixWrapper joinTrefSuffix
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens queryExpr)
,TRParens <$> parens tref]
>>= suffixWrapper (try . alias)
alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
aliasIdentifier = blackListIdentifier
[-- join keywords
"natural"
,"inner"
,"outer"
,"cross"
,"left"
,"right"
,"full"
,"join"
,"on"
,"using"
-- subsequent clause keywords
,"where"
,"group"
,"having"
,"order"
]
Here is the final query expression parser:
queryExpr :: Parser QueryExpr
queryExpr = Select
<$> selectList
<*> option [] from
<*> optionMaybe whereClause
<*> option [] groupByClause
<*> optionMaybe having
<*> option [] orderBy
14. Simple SQL query parser
Here is the complete syntax, parser and tests for the value and query expressions so far as a self contained module
module SimpleSQLQueryParser0 where
import Text.Parsec.String (Parser)
import Text.Parsec.String.Parsec (try)
import Text.Parsec.String.Char
import Text.Parsec.String.Combinator
import Text.Parsec (parse,ParseError)
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>), many)
import qualified Text.Parsec.String.Expr as E
import Control.Monad
--import Data.List (intercalate)
import Data.Maybe ()
import qualified Test.HUnit as H
import FunctionsAndTypesForParsing
import Debug.Trace
14.1. Supported SQL
14.1.1. comments
-- single line comment
/*
multiline
comment
*/
The /* */
comments do not nest.
14.1.2. value expressions
literals
postive integral literals and string literals with single quote, without escaping of single quote within string (so there is no way to create a string literal with a single quote in it).
1
500
'string literal'
identifiers
Unquoted identifiers only, an identifier may start with a letter or underscore, and contain letters, underscores and digits.
a
something
_test_
a123
dotted identifiers
Supports two part dotted identifiers only. Both parts must parse according to the rules for regular identifiers.
t.a
something.something_else
star
Star, plus dotted star using the identifier rules for the first part.
*
t.*
function application
The function name must parse as a valid identifier.
f()
g(1)
h(2,'something')
operators
Here is the range of operators supported.
a = b
a > b
a < b
a >= b
a <= b
a != b
a <> b
a and b
a or b
1 + 2
1 - 2
1 * 2
1 / 2
'some' || 'thing'
a like b
not a
case expression
case a
when 3 then 'got three'
when 5 then 'got five'
else 'neither'
end
case
when a = 3 then 'a is three'
when b = 4 then 'b is four'
else 'neither'
end
The else branch is optional in both cases.
parentheses
(1 + 2) * 3
Parentheses are explicit in the abstract syntax.
14.1.3. query expressions
TODO: examples
select queries only, no union, intersect or except. No common table expressions.
select list aliases, with the 'as' optional in the alias
'select * from t' 'select t.* from t' but not the alias version 'select * as (a,b,c) from t'.
from clause
implicit and explicit joins, including keywords natural, inner, outer, left, right, full, cross, on and using, plus parens and simple aliases (e.g. select a from t u, but not select a from t(a,b)).
where
group by lists but not the new group by options in SQL2003 (group by (), grouping sets, cube, rollup).
having
order by, with multiple columns, but not explicit asc or desc , and no 'nulls first' or 'nulls last' syntax.
No support for offset and fetch first, or variations.
14.2. Abstract syntax
data ValueExpr = StringLit String
| NumLit Integer
| Iden String
| DIden String String -- a.b
| Star
| DStar String -- t.*
| App String [ValueExpr]
| PrefOp String ValueExpr
| BinOp ValueExpr String ValueExpr
| Case (Maybe ValueExpr) -- test value
[(ValueExpr,ValueExpr)] -- when branches
(Maybe ValueExpr) -- else value
| Parens ValueExpr
deriving (Eq,Show)
data QueryExpr
= Select
{qeSelectList :: [(ValueExpr,Maybe String)]
,qeFrom :: [TableRef]
,qeWhere :: Maybe ValueExpr
,qeGroupBy :: [ValueExpr]
,qeHaving :: Maybe ValueExpr
,qeOrderBy :: [ValueExpr]
} deriving (Eq,Show)
makeSelect :: QueryExpr
makeSelect = Select {qeSelectList = []
,qeFrom = []
,qeWhere = Nothing
,qeGroupBy = []
,qeHaving = Nothing
,qeOrderBy = []}
data TableRef = TRSimple String
| TRJoin TableRef JoinType TableRef (Maybe JoinCondition)
| TRParens TableRef
| TRAlias TableRef String
| TRQueryExpr QueryExpr
deriving (Eq,Show)
data JoinType = JoinInner | JoinLeft | JoinRight | JoinFull | JoinCross
deriving (Eq,Show)
data JoinCondition = JoinOn ValueExpr
| JoinUsing [String]
| JoinNatural
deriving (Eq,Show)
14.3. Value expression parsing
14.3.1. term components
num :: Parser ValueExpr
num = NumLit <$> integer
iden :: [String] -> Parser ValueExpr
iden blacklist = Iden <$> identifierBlacklist blacklist
parensValue :: Parser ValueExpr
parensValue = Parens <$> parens (valueExpr [])
stringLit :: Parser ValueExpr
stringLit = StringLit <$> stringToken
dIden :: Parser ValueExpr
dIden = DIden <$> identifier <*> (dot *> identifier)
star :: Parser ValueExpr
star = Star <$ symbol "*"
dstar :: Parser ValueExpr
dstar = DStar <$> (identifier <* dot <* symbol "*")
app :: Parser ValueExpr
app = App <$> identifier <*> parens (commaSep $ valueExpr [])
14.3.2. case
caseValue :: Parser ValueExpr
caseValue =
Case
<$> (keyword "case" *> optionMaybe caseVal)
<*> many1 whenClause
<*> optionMaybe elseClause
<* keyword "end"
where
whenClause = (,) <$> (keyword "when" *> caseVal)
<*> (keyword "then" *> caseVal)
elseClause = keyword "else" *> caseVal
caseVal = valueExpr blackList
blackList = ["case", "when", "then", "else", "end"]
14.3.3. term
term :: [String] -> Parser ValueExpr
term blackList = choice [caseValue
,try app
,try dstar
,try dIden
,iden blackList
,num
,parensValue
,stringLit
,star]
14.3.4. operators
table :: [[E.Operator ValueExpr]]
table = [[prefix "-", prefix "+"]
,[binary "^" E.AssocLeft]
,[binary "*" E.AssocLeft
,binary "/" E.AssocLeft
,binary "%" E.AssocLeft]
,[binary "+" E.AssocLeft
,binary "-" E.AssocLeft]
,[binary "<=" E.AssocRight
,binary ">=" E.AssocRight
,binaryK "like" E.AssocNone
,binary "!=" E.AssocRight
,binary "<>" E.AssocRight
,binary "||" E.AssocRight]
,[binary "<" E.AssocNone
,binary ">" E.AssocNone]
,[binary "=" E.AssocRight]
,[prefixK "not"]
,[binaryK "and" E.AssocLeft]
,[binaryK "or" E.AssocLeft]]
where
binary name assoc =
E.Infix (mkBinOp name <$ symbol name) assoc
mkBinOp nm a b = BinOp a nm b
prefix name = E.Prefix (PrefOp name <$ symbol name)
binaryK name assoc =
E.Infix (mkBinOp name <$ keyword name) assoc
prefixK name = E.Prefix (PrefOp name <$ keyword name)
14.3.5. valueExpr
valueExpr :: [String] -> Parser ValueExpr
valueExpr blackList = E.buildExpressionParser table (term blackList)
14.4. Query expression parsing
selectList :: Parser [(ValueExpr, Maybe String)]
selectList = keyword_ "select" *> commaSep1 selectItem
selectItem :: Parser (ValueExpr, Maybe String)
selectItem = (,) <$> valueExpr [] <*> optionMaybe (try alias)
where alias = optional (keyword_ "as") *> identifierBlacklist ["from"]
whereClause :: Parser ValueExpr
whereClause = keyword_ "where" *> valueExpr []
groupByClause :: Parser [ValueExpr]
groupByClause = keyword_ "group" *> keyword_ "by"
*> commaSep1 (valueExpr [])
having :: Parser ValueExpr
having = keyword_ "having" *> (valueExpr [])
orderBy :: Parser [ValueExpr]
orderBy = keyword_ "order" *> keyword_ "by"
*> commaSep1 (valueExpr [])
14.4.1. from clause
from :: Parser [TableRef]
from = keyword_ "from" >> commaSep1 tref
where
tref = nonJoinTref >>= suffixWrapper joinTrefSuffix
joinTrefSuffix t0 = (do
nat <- option False (True <$ keyword_ "natural")
TRJoin t0 <$> joinType
<*> nonJoinTref
<*> optionMaybe (joinCondition nat))
>>= suffixWrapper joinTrefSuffix
nonJoinTref = choice [TRSimple <$> identifier
,try (TRQueryExpr <$> parens queryExpr)
,TRParens <$> parens tref]
>>= suffixWrapper (try . alias)
alias tr = try (TRAlias tr <$> (optional (keyword_ "as") *> aliasIdentifier))
aliasIdentifier = identifierBlacklist
[-- join keywords
"natural"
,"inner"
,"outer"
,"cross"
,"left"
,"right"
,"full"
,"join"
,"on"
,"using"
-- subsequent clause keywords
,"where"
,"group"
,"having"
,"order"
]
joinType :: Parser JoinType
joinType = choice
[JoinCross <$ keyword_ "cross" <* keyword_ "join"
,JoinInner <$ keyword_ "inner" <* keyword_ "join"
,JoinLeft <$ keyword_ "left"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinRight <$ keyword_ "right"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinFull <$ keyword_ "full"
<* optional (keyword_ "outer")
<* keyword_ "join"
,JoinInner <$ keyword_ "join"]
joinCondition :: Bool -> Parser JoinCondition
joinCondition nat =
choice [guard nat >> return JoinNatural
,keyword_ "on" >> JoinOn <$> valueExpr []
,keyword_ "using" >> JoinUsing <$> parens (commaSep1 identifier)
]
14.4.2. queryExpr
queryExpr :: Parser QueryExpr
queryExpr = Select
<$> selectList
<*> option [] from
<*> optionMaybe whereClause
<*> option [] groupByClause
<*> optionMaybe having
<*> option [] orderBy
14.5. tokens
whitespace :: Parser ()
whitespace =
choice [simpleWhitespace *> whitespace
,lineComment *> whitespace
,blockComment *> whitespace
,return ()]
where
lineComment = try (string "--")
*> manyTill anyChar (void (char '\n') <|> eof)
blockComment = try (string "/*")
*> manyTill anyChar (try $ string "*/")
simpleWhitespace = void $ many1 (oneOf " \t\n")
lexeme :: Parser a -> Parser a
lexeme p = p <* whitespace
integer :: Parser Integer
integer = read <$> lexeme (many1 digit)
identifier :: Parser String
identifier = lexeme ((:) <$> firstChar <*> many nonFirstChar)
where
firstChar = letter <|> char '_'
nonFirstChar = digit <|> firstChar
symbol :: String -> Parser String
symbol s = try $ lexeme $ do
u <- many1 (oneOf "<>=+-^%/*!|")
guard (s == u)
return s
openParen :: Parser Char
openParen = lexeme $ char '('
closeParen :: Parser Char
closeParen = lexeme $ char ')'
stringToken :: Parser String
stringToken = lexeme (char '\'' *> manyTill anyChar (char '\''))
dot :: Parser Char
dot = lexeme $ char '.'
comma :: Parser Char
comma = lexeme $ char ','
14.6. helper functions
keyword :: String -> Parser String
keyword k = try $ do
i <- identifier
guard (i == k)
return k
parens :: Parser a -> Parser a
parens = between openParen closeParen
commaSep :: Parser a -> Parser [a]
commaSep = (`sepBy` comma)
keyword_ :: String -> Parser ()
keyword_ = void . keyword
symbol_ :: String -> Parser ()
symbol_ = void . symbol
commaSep1 :: Parser a -> Parser [a]
commaSep1 = (`sepBy1` comma)
identifierBlacklist :: [String] -> Parser String
identifierBlacklist bl = do
i <- identifier
guard (i `notElem` bl)
return i
suffixWrapper :: (a -> Parser a) -> a -> Parser a
suffixWrapper p a = p a <|> return a
14.7. the parser api
parseQueryExpr :: String -> Either ParseError QueryExpr
parseQueryExpr = parse (whitespace *> queryExpr <* eof) ""
parseValueExpr :: String -> Either ParseError ValueExpr
parseValueExpr = parse (whitespace *> valueExpr [] <* eof) ""
14.8. tests
data TestItem = Group String [TestItem]
| ValueExpressionTest String ValueExpr
| QueryExpressionTest String QueryExpr
numLitTests :: [(String,ValueExpr)]
numLitTests =
[("1", NumLit 1)
,("54321", NumLit 54321)]
idenTests :: [(String,ValueExpr)]
idenTests =
[("test", Iden "test")
,("_something3", Iden "_something3")]
operatorTests :: [(String,ValueExpr)]
operatorTests =
map (\o -> (o ++ " a", PrefOp o (Iden "a"))) ["not", "+", "-"]
++ map (\o -> ("a " ++ o ++ " b", BinOp (Iden "a") o (Iden "b")))
["=",">","<", ">=", "<=", "!=", "<>"
,"and", "or", "+", "-", "*", "/", "||", "like"]
parensTests :: [(String,ValueExpr)]
parensTests = [("(1)", Parens (NumLit 1))]
basicTests :: [(String,ValueExpr)]
basicTests = numLitTests ++ idenTests ++ operatorTests ++ parensTests
stringLiteralTests :: [(String,ValueExpr)]
stringLiteralTests =
[("''", StringLit "")
,("'test'", StringLit "test")]
dIdenTests :: [(String,ValueExpr)]
dIdenTests =
[("t.a", DIden "t" "a")]
starTests :: [(String,ValueExpr)]
starTests = [("*", Star)]
dStarTests :: [(String,ValueExpr)]
dStarTests = [("t.*", DStar "t")]
appTests :: [(String,ValueExpr)]
appTests = [("f()", App "f" [])
,("f(1)", App "f" [NumLit 1])
,("f(1,a)", App "f" [NumLit 1, Iden "a"])]
caseTests :: [(String,ValueExpr)]
caseTests =
[("case a when 1 then 2 end"
,Case (Just $ Iden "a") [(NumLit 1,NumLit 2)] Nothing)
,("case a when 1 then 2 when 3 then 4 end"
,Case (Just $ Iden "a")
[(NumLit 1, NumLit 2)
,(NumLit 3, NumLit 4)]
Nothing)
,("case a when 1 then 2 when 3 then 4 else 5 end"
,Case (Just $ Iden "a")
[(NumLit 1, NumLit 2)
,(NumLit 3, NumLit 4)]
(Just $ NumLit 5))
,("case when a=1 then 2 when a=3 then 4 else 5 end"
,Case Nothing
[(BinOp (Iden "a") "=" (NumLit 1), NumLit 2)
,(BinOp (Iden "a") "=" (NumLit 3), NumLit 4)]
(Just $ NumLit 5))
]
allValueExprTests :: [(String,ValueExpr)]
allValueExprTests = concat [basicTests
,stringLiteralTests
,dIdenTests
,starTests
,dStarTests
,appTests
,caseTests]
singleSelectItemTests :: [(String,QueryExpr)]
singleSelectItemTests =
[("select 1", makeSelect {qeSelectList = [(NumLit 1,Nothing)]})]
multipleSelectItemsTests :: [(String,QueryExpr)]
multipleSelectItemsTests =
[("select a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]})
,("select a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)]})
,("select 1+2,3+4"
,makeSelect {qeSelectList =
[(BinOp (NumLit 1) "+" (NumLit 2),Nothing)
,(BinOp (NumLit 3) "+" (NumLit 4),Nothing)]})
]
selectListTests :: [(String,QueryExpr)]
selectListTests =
[("select a as a, b as b"
,makeSelect {qeSelectList = [(Iden "a", Just "a")
,(Iden "b", Just "b")]})
,("select a a, b b"
,makeSelect {qeSelectList = [(Iden "a", Just "a")
,(Iden "b", Just "b")]})
] ++ multipleSelectItemsTests
++ singleSelectItemTests
fromTests :: [(String,QueryExpr)]
fromTests =
[("select a from t"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]})]
whereTests :: [(String,QueryExpr)]
whereTests =
[("select a from t where a = 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]
,qeWhere = Just $ BinOp (Iden "a") "=" (NumLit 5)})
]
groupByTests :: [(String,QueryExpr)]
groupByTests =
[("select a,sum(b) from t group by a"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a"]
})
,("select a,b,sum(c) from t group by a,b"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(Iden "b",Nothing)
,(App "sum" [Iden "c"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a",Iden "b"]
})
]
havingTests :: [(String,QueryExpr)]
havingTests =
[("select a,sum(b) from t group by a having sum(b) > 5"
,makeSelect {qeSelectList = [(Iden "a",Nothing)
,(App "sum" [Iden "b"],Nothing)]
,qeFrom = [TRSimple "t"]
,qeGroupBy = [Iden "a"]
,qeHaving = Just $ BinOp (App "sum" [Iden "b"]) ">" (NumLit 5)
})
]
orderByTests :: [(String,QueryExpr)]
orderByTests =
[("select a from t order by a"
,ms [Iden "a"])
,("select a from t order by a, b"
,ms [Iden "a", Iden "b"])
]
where
ms o = makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = [TRSimple "t"]
,qeOrderBy = o}
queryExprJoinTests :: [(String,QueryExpr)]
queryExprJoinTests =
[("select a from t"
,ms [TRSimple "t"])
,("select a from t,u"
,ms [TRSimple "t", TRSimple "u"])
,("select a from t inner join u on expr"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t left join u on expr"
,ms [TRJoin (TRSimple "t") JoinLeft (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t right join u on expr"
,ms [TRJoin (TRSimple "t") JoinRight (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t full join u on expr"
,ms [TRJoin (TRSimple "t") JoinFull (TRSimple "u")
(Just $ JoinOn $ Iden "expr")])
,("select a from t cross join u"
,ms [TRJoin (TRSimple "t") JoinCross (TRSimple "u") Nothing])
,("select a from t natural inner join u"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just JoinNatural)])
,("select a from t inner join u using(a,b)"
,ms [TRJoin (TRSimple "t") JoinInner (TRSimple "u")
(Just $ JoinUsing ["a", "b"])])
,("select a from (select a from t)"
,ms [TRQueryExpr $ ms [TRSimple "t"]])
,("select a from t as u"
,ms [TRAlias (TRSimple "t") "u"])
,("select a from t u"
,ms [TRAlias (TRSimple "t") "u"])
,("select a from (t cross join u) as u"
,ms [TRAlias (TRParens $ TRJoin (TRSimple "t") JoinCross
(TRSimple "u") Nothing) "u"])
]
where
ms f = makeSelect {qeSelectList = [(Iden "a",Nothing)]
,qeFrom = f}
allQueryExprTests :: [(String,QueryExpr)]
allQueryExprTests = concat [selectListTests ++ fromTests ++ whereTests ++ groupByTests ++ havingTests ++ orderByTests ++ queryExprJoinTests]
todo: use external api parsing code
makeTest :: (Eq a, Show a) => Parser a -> (String,a) -> H.Test
makeTest parser (src,expected) = H.TestLabel src $ H.TestCase $ do
let gote = parse (whitespace *> parser <* eof) "" src
case gote of
Left e -> H.assertFailure $ show e
Right got -> H.assertEqual src expected got
15. Pretty printing
Here is a pretty printer for the parser version 0.
It uses Text.PrettyPrint module. I think this module is a bit out of date these days, it would be nice to update this code to a more fashionable pretty printing library. I haven’t put much comments/explanation here.
module PrettyPrinting0 where
import Prelude hiding ((<>))
import SimpleSQLQueryParser0 (ValueExpr(..), QueryExpr(..), TableRef(..)
,JoinType(..), JoinCondition(..))
import qualified SimpleSQLQueryParser0 as S
import Text.PrettyPrint (render, vcat, text, (<>), (<+>), empty, parens,
nest, Doc, punctuate, comma, sep, quotes,
doubleQuotes, hsep)
import Data.Maybe (maybeToList, catMaybes)
import FunctionsAndTypesForParsing
--import Text.Parsec.String.Combinator (eof)
import qualified Test.HUnit as H
import Text.Parsec (parse, ParseError)
import Control.Applicative ((<$>),(<*), (*>),(<*>), (<$), (<|>), many)
--import Text.Parsec.String (Parser)
The basic concept in this pretty printer is we convert the ast into
Doc
values, then convert these into a string. We have a bunch of
functions to convert String
to `Doc`s, and combine these docs with
different layouts.
By using a pretty printer library, we can get human readable source very easily compared with trying to convert directly to strings ourselves.
15.1. api
prettyQueryExpr :: QueryExpr -> String
prettyQueryExpr = render . queryExpr
15.2. value expressions
valueExpr :: ValueExpr -> Doc
valueExpr (StringLit s) = quotes $ text s
valueExpr (NumLit i) = text $ show i
valueExpr (Iden s) = text s
valueExpr (DIden q i) = text q <> text "." <> text i
valueExpr Star = text "*"
valueExpr (DStar q) = text q <> text "." <> text "*"
valueExpr (App f es) = text f <> parens (commaSep $ map valueExpr es)
valueExpr (PrefOp op e) = sep[text op, valueExpr e]
valueExpr (BinOp e0 op e1) = sep [valueExpr e0, text op, valueExpr e1]
valueExpr (Case test whens els) =
sep [text "case" <+> maybe empty valueExpr test
,nest 5 $ sep (map wh whens
++ [maybe empty
(\e -> text "else" <+> valueExpr e)
els])
,text "end"]
where wh (w,t) = sep [text "when" <+> valueExpr w
,text "then" <+> valueExpr t]
valueExpr (Parens e) = parens $ valueExpr e
15.3. query expressions
queryExpr :: QueryExpr -> Doc
queryExpr (Select sl fr wh gb hv ob) = sep
[text "select" <+> nest 7 (commaSep $ map selectItem sl)
-- from
,ml fr $ \f -> text "from" <+> nest 7 (commaSep $ map tref f)
-- where
,me wh $ \w -> text "where" <+> nest 6 (valueExpr w)
-- group by
,ml gb $ \g -> text "group by" <+> nest 9 (commaSep $ map valueExpr g)
-- having
,me hv $ \h -> text "having" <+> nest 6 (valueExpr h)
-- order by
,ml ob $ \o -> text "order by" <+> nest 9 (commaSep $ map valueExpr o)
]
where
selectItem (e,a) = valueExpr e <+> me a (\a' -> text "as" <+> text a')
tref (TRSimple t) = text t
tref (TRParens t) = parens $ tref t
tref (TRAlias t a) = tref t <+> text "as" <+> text a
tref (TRQueryExpr q) = parens $ queryExpr q
tref (TRJoin t0 jt t1 jc) = sep
[tref t0
,joinName jt jc <+> tref t1
,case jc of
Just (JoinOn e) -> text "on" <+> valueExpr e
Just (JoinUsing is) -> text "using" <+> parens (commaSep $ map text is)
Just JoinNatural -> empty
Nothing -> empty]
joinName jt jc =
hsep [case jc of
Just JoinNatural -> text "natural"
_ -> empty
,case jt of
JoinInner -> text "inner join"
JoinCross -> text "cross join"
JoinLeft -> text "left join"
JoinRight -> text "right join"
JoinFull -> text "full join"]
me e r = maybe empty r e
ml [] _ = empty
ml l r = r l
15.4. helpers
commaSep :: [Doc] -> Doc
commaSep = sep . punctuate comma
Have a look at the haddock for this module and see if you can work out how the code above works.
*PrettyPrinting0> either (error . show) valueExpr (parseWithEof (S.valueExpr []) "a and b")
a and b
*PrettyPrinting0> either (error . show) queryExpr (parseWithEof S.queryExpr "select a from t inner join u using(a,b)")
select a from t inner join u using (a, b)
15.5. tests
Now we can do some tests: we take the previous test data, and for each test add an additional test which pretty prints then parses the results to see that it is unchanged.
makeTest :: (Eq a, Show a) =>
(String -> Either ParseError a)
-> (a -> String)
-> (String,a)
-> H.Test
makeTest parser pretty (src,expected) = H.TestLabel src $ H.TestCase $ do
let gote = parser src
case gote of
Left e -> H.assertFailure $ show e
Right got -> do
H.assertEqual src expected got
let prsql = pretty got
gotpretty = parser prsql
case gotpretty of
Left e -> H.assertFailure $ "pretty: " ++ prsql ++ "\n" ++ show e
Right gotp -> H.assertEqual ("pretty: " ++ prsql) expected gotp
TODO: fix parsing issue
*PrettyPrinting0> H.runTestTT $ H.TestList $ map (makeTest S.parseQueryExpr (render . queryExpr)) S.allQueryExprTests
Cases: 26 Tried: 26 Errors: 0 Failures: 0
Counts {cases = 26, tried = 26, errors = 0, failures = 0}
*PrettyPrinting0> H.runTestTT $ H.TestList $ map (makeTest S.parseValueExpr (render . valueExpr)) S.allValueExprTests
### Failure in: 34:case when a=1 then 2 when a=3 then 4 else 5 end
(line 1, column 11):
unexpected "a"
expecting "--" or "/*"
Cases: 35 Tried: 35 Errors: 0 Failures: 1
Counts {cases = 35, tried = 35, errors = 0, failures = 1}
16. Error messages
In this tutorial, we will start looking at the error messages generated by Parsec and how we can influence to improve them.
TODO:
review the different methods of generating failure in parsec and what they look like, including choice.
examine how rearranging parsers can change the error messages, committing to a parsing branch to improve the messages
also possibly look at adding additional parsing to detect common errors to provide a nicer error messages
maybe talk about the idea of being permissive in the parsing proper stage, then doing additional syntax-type checks on the ast after
16.1. ways of generating errors in parsec
16.2. how parsec combines errors or drops them into the void
16.3. Parsing tpch
Let’s try the parser out on the TPC-H queries.
Summary of errors so far: q1: typed literal: type_name 'literal value' q2: scalar subquery q3: typed literal q4: typed literal q5: typed literal q6: typed literal q7: ?? q8: extract?? q9: extract ?? q10: typed literal q11: scalar sub query q12: ?? q13: not like?? q14: decimal literal q15: cte q16: count distinct q17: decimal literal q18: in subquery q19: in literal list q20: in subquery q21: exists subquery q22: substring
TODO: try to figure out each issue. Not very easy: the error messages are not very good.
Do some ad hoc stuff to try to improve each error message? Keep simplifying expressions then try to understand specific principles. Then make some generalizations that can be made across the parser
16.4. value expressions in isolation
We can start with the real error code now by considering the valueexpression parser in isolation.
ideas: error inside function parameter list bad identifier unrecognised operator unrecognised chars unmatching parens keyword errors in case unterminated string lit 3 dot components binary op with missing second arg associativity errors with binary ops? blacklist errors
16.5. query expressions
mispelling keywords at this level clauses in wrong order adding extra tokens or removing them at clause/value expression boundaries