Write You a Haskell 2

A continuation of Stephen Diehl's Write You a Haskell


Project maintained by JKTKops Hosted on GitHub Pages — Theme by mattgraham

Lexing

// TODO: context-sensitive lexing for MagicHash // lex will need to take compiler flags.

Just like with our inference system in Poly, there are two systems playing together during parsing. The first system takes the input string and transforms it into a list of meaningful units called Tokens. Tokens are things like TokenLParen and TokenId String. This system is called the lexer. The other system is the parser, which consumes tokens to construct a parse tree. In many cases, the parse tree and the abstract syntax tree of a language are not completely identical. This will be the case for us.

Rather than continuing to do lexing with Parsec, we’ll use a real lexer generator called Alex. Then we’ll use the token stream from Alex as the input to a Parsec parser. Separating these concerns will simplify using Parsec later.

My main concern with Alex is that it doesn’t support Data.Text. I could wrap it myself but that seems to be a fairly involved effort. So instead we’ll let Alex work on an input String and then output Data.Text.Lazy.Text everywhere. I suspect that this will become a performance bottleneck - feel free to try supporting Data.Text with Alex!

Credit for this lexing strategy (and much of the lexer source) belongs to Simon Marlow, who also maintains Alex (and GHC!). The lexer is based on the Haskell ‘98 lexer presented in the /examples folder of the Alex source, plus several fixes.

Alex

Alex is a powerful lexer generator. With Alex, if we can describe the patterns that various tokens fit into, then Alex can split our text up into those tokens. This task is harder than it seems at first glance. Consider this Haskell example.

main = let letresult = 5 in print letresult

When we encounter the string "let", we should identify it as a reserved word. But when we encounter "letresult", which has the same prefix, we need to identify it as a variable identifier (VarId). There are much more difficult cases to handle correctly; Alex handles this complexity for us.

To use Alex, we’ll need to provide a specification file, Lexer.x, in our source tree. Alex spec files contain a mix of Alex-specific syntax and Haskell source code. Haskell source code is contained inside curly braces. We’ll start with a module declaration.

-- Lexer.x
{
module Compiler.Parser.Lexer (Token(..), Lexeme, lex) where

import Prelude hiding (lex)
import Data.Char (isAlphaNum, chr)
import Data.Text.Lazy (Text)
import qualified Data.Text.Lazy as T

import Utils.Outputable
}

Alex will copy this code directly into Lexer.hs, and the imports will be available to us in Haskell code blocks later in the source file.

Next we need to tell Alex what mode to run in. Alex has a basic mode, which will eventually produce a function scanTokens :: String -> [Token]. However it will be more powerful (and easier to extend in the future, if we need) to use the monad mode. This makes Alex run inside the Alex monad, which is essentially a custom StateT AlexState (Except String). This way if Alex runs into an error, it won’t crash the program, which is very important for our interpreter.

%wrapper "monad"

Now we can start specifying macros. Macros can be either character sets or regular expressions. Our rules will be represented by regular expressions, and we can expand macros into those. A character set lets us assign a name to a set of characters. Character sets are prefixed with $.

$whitechar = [\t\n\r\v\f\ ]
$special   = [\(\)\,\;\[\]\{\}]

The braces denote a union, so the character set [A-Z a-z] is the set of characters that are in the set A-Z and the set of the characters in the set a-z. The character sets that we’re interested in can be copied almost directly out of the Haskell Report, so I won’t copy all this boilerplate here.

Next are the regexes. A regex is prefixed with @. Alex’s regular expressions aren’t exactly regular expressions, but they’re really close. Take a look at the user guide for specifics.

@reservedid = case|class|data|default|...

@reservedOp = ".." | ":" | "::" | "=" |...

Notice that we reserve the default keyword. Even though we might not implement it, it’s a good idea to reserve all keywords that we might want to implement in the future. This way we can throw an error rather than accepting a program that is not future-proof.

At the lexer level, we shouldn’t try to separate things by type. Trying to lex an expression with type Bool differently than an expression of type Int would effectively require implementing a type inference engine in our lexer. The same goes for our parser. But at the lexing level we can separate some things into namespaces, as described in the Haskell Report. Variable names must start with a lowercase letter, module, type, data constructor, and typeclass names must start with an uppercase letter, variable symbols can’t start with :, and constructor symbols must start with :. Our lexer can separate these cases for us, so we’ll create distinct patterns for them.

@varid  = $small $idchar*
@conid  = $large $idchar*
@varsym = $symbol $symchar*
@consym = \: $symchar*

We’ll also accept numbers according to the Haskell Report spec, which means we need to support octal, hexadecimal, and exponent notation (for floats).

@decimal     = $digit+
@octal       = $octit+
@hexadecimal = $hexit+
@exponent    = [eE] [\-\+] @decimal

We also need to provide a regex for @string, which is a bit involved. Our @string pattern should only match the legal characters of a string, but not the string itself. That is, it shouldn’t match the start or end quotes. We’ll put the quotes into the actual lexing rule.

Now that we have our patterns, we can provide lexing rules. Alex looks for the start of rules by looking for the pattern :-. It’s common to include the name of the lexing scheme as well, so we put Haskell :-.

A rule in Alex starts with a start code, enclosed in angle brackets. This lets you tell Alex to only match certain rules in certain circumstances. We’re not concerned with this, so we’ll always use 0.

To start, let’s tell Alex to ignore whitespace (we’ll parse it later using the source positions) and line comments.

<0> $white+ { skip }
<0> "--"\-*[^$symchar].* { skip }

The Haskell source in braces here is a continuation. More on this in a little bit.

ProtoHaskell allows nested block comments, which makes it easier to comment out a region of code that itself contains a block comment. This is rather complicated compared to the rest of the lexing logic, so we’ll write it for Alex. We need to tell Alex to trigger on {- and the rule continuation will handle skipping the comment.

"{-" { nested_comment }

The continuations in braces are functions that Alex will call once it has a match. skip is provided by Alex and simply ignores the match. We can use our own functions as continuations, as seen above; we need to write nested_comment. The type of the continuation depends on which wrapper is being used. In our case, we need AlexInput -> Int -> Alex Token. Below our rules, we can specify arbitrary Haskell source (in braces), which Alex will copy into the generated lexer. The lex function that we exported from the module belongs here, for example.

There’s a catch with our tokenizing, though. When we parse, we’re going to need position information to support indentation sensitivity. So rather than just creating tokens, we’ll create what we’ll (slightly inaccurately) refer to as a Lexeme, which will store a position as well as the token.

Let’s define some data types that we need and a function to make our Lexemes.

{
-- L position token source
data Lexeme = L AlexPosn TokenType String deriving Eq

data TokenType
     = TokInteger
     | TokFloat
     | TokChar
     | TokString
     | TokSpecial
     | TokReservedId
     | TokReservedOp
     | TokVarId
     | TokQualVarId
     | TokConId
     | TokQualConId
     | TokVarSym
     | TokQualVarSym
     | TokConSym
     | TokQualConSym
     | TokEOF
     deriving (Eq, Show)

-- | Our continuation, for example, @mkL TokVarSym@
mkL :: TokenType -> AlexInput -> Int -> Alex Lexeme
mkL tok (p,_,_,str) len = return $ L p tok (take len str)
}

We’ll want to export Lexeme and TokenType too, for our parser.

Now that we have our continuation function, we can start specifying rules that match tokens. With the macros we’ve defined, this is pretty straightforward.

<0> $special                  { mkL TokSpecial }

<0> @reservedid               { mkL TokReservedId }
<0> (@conid \.)+ @varid       { mkL TokQualVarId }
<0> (@conid \.)+ @conid       { mkL TokQualConId }
<0> @varid                    { mkL TokVarId }
<0> @conid                    { mkL TokConId }

<0> @reservedop               { mkL TokReservedOp }
<0> (@conid \.)+ @varsym      { mkL TokQualVarSym }
<0> (@conid \.)+ @consym      { mkL TokQualConSym }
<0> @varsym                   { mkL TokVarSym }
<0> @consym                   { mkL TokConSym }

<0> @decimal
  | 0[oO] @octal
  | 0[xX] @hexadecimal        { mkL TokInteger }


<0> @decimal \. @decimal @exponent?
  | @decimal @exponent        { mkL TokFloat }

<0> \' ($graphic # [\'\\] | " " | @escape) \'
                              { mkL TokChar }

<0> \" @string* \"            { mkL TokString }

Handling nested comments is a bit of a mess of special case handling (for example, not breaking on --}). Note that in ASCII, 45 is -, 123 is {, and 125 is }.

alexMonadScan scans for one lexeme in the Alex monad. alexGetByte gets the next byte of the given AlexInput.

nested_comment :: AlexInput -> Int -> Alex Lexeme
nested_comment _ _ = do
    input <- alexGetInput
    go 1 input
  where go 0 input = alexSetInput input >> alexMonadScan
        go n input = do
            case alexGetByte input of
                Nothing -> err input
                Just (c, input) -> do
                    case chr (fromIntegral c) of
                        '-' -> let temp = input
                               in case alexGetByte input of
                                   Nothing -> err input
                                   Just (125, input) -> go (n-1) input
                                   Just (45,  input) -> go n temp
                                   Just (c, input)   -> go n input
                        '\123' -> case alexGetByte input of
                            Nothing -> err input
                            Just (c, input)
                              | c == fromIntegral (ord '-') -> go (n+1) input
                            Just (c, input) -> go n input
                        c -> go n input
        err input = alexSetInput input >> lexError "error in nested comment"

lexError :: String -> Alex a
lexError s = do
    (p,c,_,input) <- alexGetInput
    alexError $ showPosn p ++ ": " ++ s ++
                      (if not $ null input
                       then " before " ++ show (head input)
                       else " at end of file")

showPosn (AlexPn _ line col) = show line ++ ':' : show col

Alex also wants a function to call when it finds the end of input. We’ll want a lexeme with a TokEOF, and no source. For the position information, anything we put there would be inaccurate, since source contains no characters that could have a position. We will never inspect the position information for this lexeme, so we can safely leave it undefined.

alexEOF = return $ L undefined TokEOF ""

Finally, we provide lex.

lex :: String -> Either String [Lexeme]
lex str = runAlex str alexLex

alexLex :: Alex [Lexeme]
alexLex = do lexeme@(L _ tok _) <- alexMonadScan
             if tok == TokEOF
               then return [lexeme]
               else (lexeme:) <$> alexLex

Now we have a more powerful lexer that we can use for parsing.

Telling Alex About Filenames

This lexer is definitely powerful, but we’re about to run into a problem. If we were committed to only compiling single-file programs, we would never need to know filenames. However, eventually, we’re going to support multiple-module compilation. This means we’re going to need to know the filename being lexed, in order to attach complete location information to it. Alex doesn’t know this on its own, so we’ll need to provide it. We can do this with a custom user state. To start, we need to change our wrapper from %wrapper "monad" to %wrapper "monadUserState". The Alex-generated code is now identical to before, with two differences. It now has references to two identifiers, AlexUserState, and alexInitUserState, which we’ll need to define.

The only state we care about is the filename, so type AlexUserState = String will suffice. We’ll initialize this to ""; alexInitUserState = "".

Recall that the Alex monad is effectively a StateT AlexState (Except String) monad. To properly initialize our filename, we’ll make lex take an extra filename parameter and then pass it in so we can read it later.

alexInitFilename :: String -> Alex ()
alexInitFilename fname = Alex $ \s -> Right (s { alex_ust = fname }, ())

-- Utility
alexGetFilename :: Alex String
alexGetFilename = Alex $ \s -> Right (s, alex_ust s)

lex :: String -> String -> Either String [Lexeme]
lex fname input = runAlex input $ alexInitFilename fname >> init <$> alexLex

A Better Token Type

While we’re certainly better off with our Lexemes than we would be with pure tokens, they don’t quite have all the information we want. Perhaps most importantly, we only have the start location of each token, but when attaching these locations to the AST, we want to know end locations too.

We can also make an optimization. Once mkL knows which type of token we’re making, we can sometimes dispatch on the source to get a more specific token. That is, we can change L pos TokSpecial "(" to the much easier to handle L pos TokLParen. To beef up our lexer like this, we’ll need some new types and a stronger mkL function.

To start, our current TokenType constructor names should really contain the word Type, to disambiguate from the actual Token type that we’ll export to go with Lexeme. So let’s change the Tok prefix to TokType. Then we’ll need our complete type of tokens:

data Token
       -- ^ Special Characters
     = TokLParen
     | TokRParen
     | TokComma
     | TokSemicolon
     | TokLBracket
     | TokRBracket
     | TokLBrace
     | TokRBrace
     | TokBackquote

       -- ^ Literals
     | TokLitInteger Text
     | TokLitFloat   Text
     | TokLitChar    Text
     | TokLitString  Text

       -- ^ Reserved Words
     | TokCase   | TokClass   | TokData   | TokDefault  | TokDeriving
     | TokDo     | TokElse    | TokIf     | TokImport   | TokIn 
     | TokInfix  | TokInfixL  | TokInfixR | TokInstance | TokLet
     | TokModule | TokNewtype | TokOf     | TokThen     | TokType
     | TokWhere

       -- ^ Reserved Operators
     | TokTwoDots -- ".."
     | TokColon | TokDoubleColon | TokEqual  | TokLambda
     | TokBar   | TokLArrow      | TokRArrow | TokAt
     | TokTilde | TokPredArrow

       -- ^ Other
     | TokVarId      Text
     | TokQualVarId  Text Text
     | TokConId      Text
     | TokQualConId  Text Text
     | TokVarSym     Text
     | TokQualVarSym Text Text
     | TokConSym     Text
     | TokQualConSym Text Text
     | TokEOF
     deriving Eq

-- Show source
instance Show Token where
    ...

Tokens with Text fields store the literal source code which denotes the token. This means TokLitString Text still has the escape characters and gaps. We’ll cheat a bit and use Parsec TokenParser to parse out our literals from the Text fields later, since Parsec’s parsers are already designed to parse numbers, chars, and Strings according to Haskell rules.

We also want to store better position information. To this end, we’ll add a module, Compiler/BasicTypes/SrcLoc.hs. Here we’ll define SrcLoc and SrcSpan. Each of these is either Real or Unhelpful. A Real location is attached to anything that appears literally in the source. We’ll attach Unhelpful locations to code generated by the compiler.

data RealSrcLoc = SrcLoc !Text !Int !Int
data SrcLoc = RealSrcLoc !RealSrcLoc
            | UnhelpfulLoc !Text

data RealSrcSpan = SrcSpan { srcSpanFile :: !Text
                           , startLine, startCol, endLine, endCol :: !Int 
                           }
data SrcSpan = RealSrcSpan !RealSrcSpan
             | UnhelpfulSpan !Text

We want to separate Real and Unhelpful locations at the type level, because most of our functions for working with locations will be unsafe with Unhelpful locations. We also don’t need laziness with these types, so we make them entirely strict to avoid the overhead.

While we’re at it, we can define a type Located e = Located SrcSpan e for attaching locations to things. Our Lexeme type becomes type Lexeme = Located Token.

Then we let mkL dispatch on the TokenType of its first argument.

mkL :: TokenType -> AlexInput -> Int -> Alex Lexeme
mkL toktype (alexStartPos,_,_,str) len = do
    fname <- alexGetFilename
    alexEndPos <- alexGetPos
    let AlexPn _ startLine startCol = alexStartPos
        AlexPn _ endLine endCol = alexEndPos
        startPos = mkSrcLoc fname startLine startCol
        endPos   = mkSrcLoc fname endLine endCol
        srcSpan  = mkSrcSpan startPos endPos
        src = take len str
        cont = case toktype of
            TokTypeInteger    -> mkTok1 TokLitInteger
            TokTypeFloat      -> mkTok1 TokLitFloat
            TokTypeChar       -> mkTok1 TokLitChar
            ...
    return $ cont srcSpan src

mkTok1 :: (String -> Token) -> SrcSpan -> String -> Located Token

Actually making the cases is tedious, so it is left as an exercise (see the source). Take care in TokQualVarSym - F.. lexes as TokQualVarSym "F" ".", and not as [TokConId "F", TokTwoDots]. Similarly, F.<.> lexes as TokQualVarSym "F" "<.>".

Compiling with Alex

Cabal (and by extension, Stack) is aware of Alex. By adding alex to the build-depends section of your config file, the build system will automatically invoke Alex on your specification file. If you prefer to compile manually, then simply run $ alex Lexer.x and Alex will put a Lexer.hs file in the same folder.

Closing Remarks

I would like to start including a test tree in the full sources for these chapters, which will make them slower to produce. I’ll try and put out text for a chapter and it’s src tree first, and correct errors if testing finds them.