Finish Translation

This commit is contained in:
BriFuture 2018-10-21 18:44:30 +08:00
parent b9fbb81314
commit 9492913d36
2 changed files with 637 additions and 639 deletions

View File

@ -1,639 +0,0 @@
BriFuture is translating this article
# Compiling Lisp to JavaScript From Scratch in 350
In this article we will look at a from-scratch implementation of a compiler from a simple LISP-like calculator language to JavaScript. The complete source code can be found [here][7].
We will:
1. Define our language and write a simple program in it
2. Implement a simple parser combinator library
3. Implement a parser for our language
4. Implement a pretty printer for our language
5. Define a subset of JavaScript for our usage
6. Implement a code translator to the JavaScript subset we defined
7. Glue it all together
Let's start!
### 1\. Defining the language
The main attraction of lisps is that their syntax already represent a tree, this is why they are so easy to parse. We'll see that soon. But first let's define our language. Here's a BNF description of our language's syntax:
```
program ::= expr
expr ::= <integer> | <name> | ([<expr>])
```
Basically, our language let's us define one expression at the top level which it will evaluate. An expression is composed of either an integer, for example `5`, a variable, for example `x`, or a list of expressions, for example `(add x 1)`.
An integer evaluate to itself, a variable evaluates to what it's bound in the current environment, and a list evaluates to a function call where the first argument is the function and the rest are the arguments to the function.
We have some built-in special forms in our language so we can do more interesting stuff:
* let expression let's us introduce new variables in the environment of the body of the let. The syntax is:
```
let ::= (let ([<letarg>]) <body>)
letargs ::= (<name> <expr>)
body ::= <expr>
```
* lambda expression: evaluates to an anonymous function definition. The syntax is:
```
lambda ::= (lambda ([<name>]) <body>)
```
We also have a few built in functions: `add`, `mul`, `sub`, `div` and `print`.
Let's see a quick example of a program written in our language:
```
(let
((compose
(lambda (f g)
(lambda (x) (f (g x)))))
(square
(lambda (x) (mul x x)))
(add1
(lambda (x) (add x 1))))
(print ((compose square add1) 5)))
```
This program defines 3 functions: `compose`, `square` and `add1`. And then prints the result of the computation:`((compose square add1) 5)`
I hope this is enough information about the language. Let's start implementing it!
We can define the language in Haskell like this:
```
type Name = String
data Expr
= ATOM Atom
| LIST [Expr]
deriving (Eq, Read, Show)
data Atom
= Int Int
| Symbol Name
deriving (Eq, Read, Show)
```
We can parse programs in the language we defined to an `Expr`. Also, we are giving the new data types `Eq`, `Read`and `Show` instances to aid in testing and debugging. You'll be able to use those in the REPL for example to verify all this actually works.
The reason we did not define `lambda`, `let` and the other built-in functions as part of the syntax is because we can get away with it in this case. These functions are just a more specific case of a `LIST`. So I decided to leave this to a later phase.
Usually, you would like to define these special cases in the abstract syntax - to improve error messages, to unable static analysis and optimizations and such, but we won't do that here so this is enough for us.
Another thing you would like to do usually is add some annotation to the syntax. For example the location: Which file did this `Expr` come from and which row and col in the file. You can use this in later stages to print the location of errors, even if they are not in the parser stage.
* _Exercise 1_ : Add a `Program` data type to include multiple `Expr` sequentially
* _Exercise 2_ : Add location annotation to the syntax tree.
### 2\. Implement a simple parser combinator library
First thing we are going to do is define an Embedded Domain Specific Language (or EDSL) which we will use to define our languages' parser. This is often referred to as parser combinator library. The reason we are doing it is strictly for learning purposes, Haskell has great parsing libraries and you should definitely use them when building real software, or even when just experimenting. One such library is [megaparsec][8].
First let's talk about the idea behind our parser library implementation. In it's essence, our parser is a function that takes some input, might consume some or all of the input, and returns the value it managed to parse and the rest of the input it didn't parse yet, or throws an error if it failed. Let's write that down.
```
newtype Parser a
= Parser (ParseString -> Either ParseError (a, ParseString))
data ParseString
= ParseString Name (Int, Int) String
data ParseError
= ParseError ParseString Error
type Error = String
```
Here we defined three main new types.
First, `Parser a`, is the parsing function we described before.
Second, `ParseString` is our input or state we carry along. It has three significant parts:
* `Name`: This is the name of the source
* `(Int, Int)`: This is the current location in the source
* `String`: This is the remaining string left to parse
Third, `ParseError` contains the current state of the parser and an error message.
Now we want our parser to be flexible, so we will define a few instances for common type classes for it. These instances will allow us to combine small parsers to make bigger parsers (hence the name 'parser combinators').
The first one is a `Functor` instance. We want a `Functor` instance because we want to be able to define a parser using another parser simply by applying a function on the parsed value. We will see an example of this when we define the parser for our language.
```
instance Functor Parser where
fmap f (Parser parser) =
Parser (\str -> first f <$> parser str)
```
The second instance is an `Applicative` instance. One common use case for this instance instance is to lift a pure function on multiple parsers.
```
instance Applicative Parser where
pure x = Parser (\str -> Right (x, str))
(Parser p1) <*> (Parser p2) =
Parser $
\str -> do
(f, rest) <- p1 str
(x, rest') <- p2 rest
pure (f x, rest')
```
(Note:  _We will also implement a Monad instance so we can use do notation here._ )
The third instance is an `Alternative` instance. We want to be able to supply an alternative parser in case one fails.
```
instance Alternative Parser where
empty = Parser (`throwErr` "Failed consuming input")
(Parser p1) <|> (Parser p2) =
Parser $
\pstr -> case p1 pstr of
Right result -> Right result
Left _ -> p2 pstr
```
The forth instance is a `Monad` instance. So we'll be able to chain parsers.
```
instance Monad Parser where
(Parser p1) >>= f =
Parser $
\str -> case p1 str of
Left err -> Left err
Right (rs, rest) ->
case f rs of
Parser parser -> parser rest
```
Next, let's define a way to run a parser and a utility function for failure:
```
runParser :: String -> String -> Parser a -> Either ParseError (a, ParseString)
runParser name str (Parser parser) = parser $ ParseString name (0,0) str
throwErr :: ParseString -> String -> Either ParseError a
throwErr ps@(ParseString name (row,col) _) errMsg =
Left $ ParseError ps $ unlines
[ "*** " ++ name ++ ": " ++ errMsg
, "* On row " ++ show row ++ ", column " ++ show col ++ "."
]
```
Now we'll start implementing the combinators which are the API and heart of the EDSL.
First, we'll define `oneOf`. `oneOf` will succeed if one of the characters in the list supplied to it is the next character of the input and will fail otherwise.
```
oneOf :: [Char] -> Parser Char
oneOf chars =
Parser $ \case
ps@(ParseString name (row, col) str) ->
case str of
[] -> throwErr ps "Cannot read character of empty string"
(c:cs) ->
if c `elem` chars
then Right (c, ParseString name (row, col+1) cs)
else throwErr ps $ unlines ["Unexpected character " ++ [c], "Expecting one of: " ++ show chars]
```
`optional` will stop a parser from throwing an error. It will just return `Nothing` on failure.
```
optional :: Parser a -> Parser (Maybe a)
optional (Parser parser) =
Parser $
\pstr -> case parser pstr of
Left _ -> Right (Nothing, pstr)
Right (x, rest) -> Right (Just x, rest)
```
`many` will try to run a parser repeatedly until it fails. When it does, it'll return a list of successful parses. `many1`will do the same, but will throw an error if it fails to parse at least once.
```
many :: Parser a -> Parser [a]
many parser = go []
where go cs = (parser >>= \c -> go (c:cs)) <|> pure (reverse cs)
many1 :: Parser a -> Parser [a]
many1 parser =
(:) <$> parser <*> many parser
```
These next few parsers use the combinators we defined to make more specific parsers:
```
char :: Char -> Parser Char
char c = oneOf [c]
string :: String -> Parser String
string = traverse char
space :: Parser Char
space = oneOf " \n"
spaces :: Parser String
spaces = many space
spaces1 :: Parser String
spaces1 = many1 space
withSpaces :: Parser a -> Parser a
withSpaces parser =
spaces *> parser <* spaces
parens :: Parser a -> Parser a
parens parser =
(withSpaces $ char '(')
*> withSpaces parser
<* (spaces *> char ')')
sepBy :: Parser a -> Parser b -> Parser [b]
sepBy sep parser = do
frst <- optional parser
rest <- many (sep *> parser)
pure $ maybe rest (:rest) frst
```
Now we have everything we need to start defining a parser for our language.
* _Exercise_ : implement an EOF (end of file/input) parser combinator.
### 3\. Implementing a parser for our language
To define our parser, we'll use the top-bottom method.
```
parseExpr :: Parser Expr
parseExpr = fmap ATOM parseAtom <|> fmap LIST parseList
parseList :: Parser [Expr]
parseList = parens $ sepBy spaces1 parseExpr
parseAtom :: Parser Atom
parseAtom = parseSymbol <|> parseInt
parseSymbol :: Parser Atom
parseSymbol = fmap Symbol parseName
```
Notice that these four function are a very high-level description of our language. This demonstrate why Haskell is so nice for parsing. Still, after defining the high-level parts, we still need to define the lower-level `parseName` and `parseInt`.
What characters can we use as names in our language? Let's decide to use lowercase letters, digits and underscores, where the first character must be a letter.
```
parseName :: Parser Name
parseName = do
c <- oneOf ['a'..'z']
cs <- many $ oneOf $ ['a'..'z'] ++ "0123456789" ++ "_"
pure (c:cs)
```
For integers, we want a sequence of digits optionally preceding by '-':
```
parseInt :: Parser Atom
parseInt = do
sign <- optional $ char '-'
num <- many1 $ oneOf "0123456789"
let result = read $ maybe num (:num) sign of
pure $ Int result
```
Lastly, we'll define a function to run a parser and get back an `Expr` or an error message.
```
runExprParser :: Name -> String -> Either String Expr
runExprParser name str =
case runParser name str (withSpaces parseExpr) of
Left (ParseError _ errMsg) -> Left errMsg
Right (result, _) -> Right result
```
* _Exercise 1_ : Write a parser for the `Program` type you defined in the first section
* _Exercise 2_ : Rewrite `parseName` in Applicative style
* _Exercise 3_ : Find a way to handle the overflow case in `parseInt` instead of using `read`.
### 4\. Implement a pretty printer for our language
One more thing we'd like to do is be able to print our programs as source code. This is useful for better error messages.
```
printExpr :: Expr -> String
printExpr = printExpr' False 0
printAtom :: Atom -> String
printAtom = \case
Symbol s -> s
Int i -> show i
printExpr' :: Bool -> Int -> Expr -> String
printExpr' doindent level = \case
ATOM a -> indent (bool 0 level doindent) (printAtom a)
LIST (e:es) ->
indent (bool 0 level doindent) $
concat
[ "("
, printExpr' False (level + 1) e
, bool "\n" "" (null es)
, intercalate "\n" $ map (printExpr' True (level + 1)) es
, ")"
]
indent :: Int -> String -> String
indent tabs e = concat (replicate tabs " ") ++ e
```
* _Exercise_ : Write a pretty printer for the `Program` type you defined in the first section
Okay, we wrote around 200 lines so far of what's typically called the front-end of the compiler. We have around 150 more lines to go and three more tasks: We need to define a subset of JS for our usage, define the translator from our language to that subset, and glue the whole thing together. Let's go!
### 5\. Define a subset of JavaScript for our usage
First, we'll define the subset of JavaScript we are going to use:
```
data JSExpr
= JSInt Int
| JSSymbol Name
| JSBinOp JSBinOp JSExpr JSExpr
| JSLambda [Name] JSExpr
| JSFunCall JSExpr [JSExpr]
| JSReturn JSExpr
deriving (Eq, Show, Read)
type JSBinOp = String
```
This data type represent a JavaScript expression. We have two atoms - `JSInt` and `JSSymbol` to which we'll translate our languages' `Atom`, We have `JSBinOp` to represent a binary operation such as `+` or `*`, we have `JSLambda`for anonymous functions same as our `lambda expression`, We have `JSFunCall` which we'll use both for calling functions and introducing new names as in `let`, and we have `JSReturn` to return values from functions as that's required in JavaScript.
This `JSExpr` type is an **abstract representation** of a JavaScript expression. We will translate our own `Expr`which is an abstract representation of our languages' expression to `JSExpr` and from there to JavaScript. But in order to do that we need to take this `JSExpr` and produce JavaScript code from it. We'll do that by pattern matching on `JSExpr` recursively and emit JS code as a `String`. This is basically the same thing we did in `printExpr`. We'll also track the scoping of elements so we can indent the generated code in a nice way.
```
printJSOp :: JSBinOp -> String
printJSOp op = op
printJSExpr :: Bool -> Int -> JSExpr -> String
printJSExpr doindent tabs = \case
JSInt i -> show i
JSSymbol name -> name
JSLambda vars expr -> (if doindent then indent tabs else id) $ unlines
["function(" ++ intercalate ", " vars ++ ") {"
,indent (tabs+1) $ printJSExpr False (tabs+1) expr
] ++ indent tabs "}"
JSBinOp op e1 e2 -> "(" ++ printJSExpr False tabs e1 ++ " " ++ printJSOp op ++ " " ++ printJSExpr False tabs e2 ++ ")"
JSFunCall f exprs -> "(" ++ printJSExpr False tabs f ++ ")(" ++ intercalate ", " (fmap (printJSExpr False tabs) exprs) ++ ")"
JSReturn expr -> (if doindent then indent tabs else id) $ "return " ++ printJSExpr False tabs expr ++ ";"
```
* _Exercise 1_ : Add a `JSProgram` type that will hold multiple `JSExpr` and create a function `printJSExprProgram` to generate code for it.
* _Exercise 2_ : Add a new type of `JSExpr` - `JSIf`, and generate code for it.
### 6\. Implement a code translator to the JavaScript subset we defined
We are almost there. In this section we'll create a function to translate `Expr` to `JSExpr`.
The basic idea is simple, we'll translate `ATOM` to `JSSymbol` or `JSInt` and `LIST` to either a function call or a special case we'll translate later.
```
type TransError = String
translateToJS :: Expr -> Either TransError JSExpr
translateToJS = \case
ATOM (Symbol s) -> pure $ JSSymbol s
ATOM (Int i) -> pure $ JSInt i
LIST xs -> translateList xs
translateList :: [Expr] -> Either TransError JSExpr
translateList = \case
[] -> Left "translating empty list"
ATOM (Symbol s):xs
| Just f <- lookup s builtins ->
f xs
f:xs ->
JSFunCall <$> translateToJS f <*> traverse translateToJS xs
```
`builtins` is a list of special cases to translate, like `lambda` and `let`. Every case gets the list of arguments for it, verify that its syntactically valid and translates it to the equivalent `JSExpr`.
```
type Builtin = [Expr] -> Either TransError JSExpr
type Builtins = [(Name, Builtin)]
builtins :: Builtins
builtins =
[("lambda", transLambda)
,("let", transLet)
,("add", transBinOp "add" "+")
,("mul", transBinOp "mul" "*")
,("sub", transBinOp "sub" "-")
,("div", transBinOp "div" "/")
,("print", transPrint)
]
```
In our case, we treat built-in special forms as special and not first class, so will not be able to use them as first class functions and such.
We'll translate a Lambda to an anonymous function:
```
transLambda :: [Expr] -> Either TransError JSExpr
transLambda = \case
[LIST vars, body] -> do
vars' <- traverse fromSymbol vars
JSLambda vars' <$> (JSReturn <$> translateToJS body)
vars ->
Left $ unlines
["Syntax error: unexpected arguments for lambda."
,"expecting 2 arguments, the first is the list of vars and the second is the body of the lambda."
,"In expression: " ++ show (LIST $ ATOM (Symbol "lambda") : vars)
]
fromSymbol :: Expr -> Either String Name
fromSymbol (ATOM (Symbol s)) = Right s
fromSymbol e = Left $ "cannot bind value to non symbol type: " ++ show e
```
We'll translate let to a definition of a function with the relevant named arguments and call it with the values, Thus introducing the variables in that scope:
```
transLet :: [Expr] -> Either TransError JSExpr
transLet = \case
[LIST binds, body] -> do
(vars, vals) <- letParams binds
vars' <- traverse fromSymbol vars
JSFunCall . JSLambda vars' <$> (JSReturn <$> translateToJS body) <*> traverse translateToJS vals
where
letParams :: [Expr] -> Either Error ([Expr],[Expr])
letParams = \case
[] -> pure ([],[])
LIST [x,y] : rest -> ((x:) *** (y:)) <$> letParams rest
x : _ -> Left ("Unexpected argument in let list in expression:\n" ++ printExpr x)
vars ->
Left $ unlines
["Syntax error: unexpected arguments for let."
,"expecting 2 arguments, the first is the list of var/val pairs and the second is the let body."
,"In expression:\n" ++ printExpr (LIST $ ATOM (Symbol "let") : vars)
]
```
We'll translate an operation that can work on multiple arguments to a chain of binary operations. For example: `(add 1 2 3)` will become `1 + (2 + 3)`
```
transBinOp :: Name -> Name -> [Expr] -> Either TransError JSExpr
transBinOp f _ [] = Left $ "Syntax error: '" ++ f ++ "' expected at least 1 argument, got: 0"
transBinOp _ _ [x] = translateToJS x
transBinOp _ f list = foldl1 (JSBinOp f) <$> traverse translateToJS list
```
And we'll translate a `print` as a call to `console.log`
```
transPrint :: [Expr] -> Either TransError JSExpr
transPrint [expr] = JSFunCall (JSSymbol "console.log") . (:[]) <$> translateToJS expr
transPrint xs = Left $ "Syntax error. print expected 1 arguments, got: " ++ show (length xs)
```
Notice that we could have skipped verifying the syntax if we'd parse those as special cases of `Expr`.
* _Exercise 1_ : Translate `Program` to `JSProgram`
* _Exercise 2_ : add a special case for `if Expr Expr Expr` and translate it to the `JSIf` case you implemented in the last exercise
### 7\. Glue it all together
Finally, we are going to glue this all together. We'll:
1. Read a file
2. Parse it to `Expr`
3. Translate it to `JSExpr`
4. Emit JavaScript code to the standard output
We'll also enable a few flags for testing:
* `--e` will parse and print the abstract representation of the expression (`Expr`)
* `--pp` will parse and pretty print
* `--jse` will parse, translate and print the abstract representation of the resulting JS (`JSExpr`)
* `--ppc` will parse, pretty print and compile
```
main :: IO ()
main = getArgs >>= \case
[file] ->
printCompile =<< readFile file
["--e",file] ->
either putStrLn print . runExprParser "--e" =<< readFile file
["--pp",file] ->
either putStrLn (putStrLn . printExpr) . runExprParser "--pp" =<< readFile file
["--jse",file] ->
either print (either putStrLn print . translateToJS) . runExprParser "--jse" =<< readFile file
["--ppc",file] ->
either putStrLn (either putStrLn putStrLn) . fmap (compile . printExpr) . runExprParser "--ppc" =<< readFile file
_ ->
putStrLn $ unlines
["Usage: runghc Main.hs [ --e, --pp, --jse, --ppc ] <filename>"
,"--e print the Expr"
,"--pp pretty print Expr"
,"--jse print the JSExpr"
,"--ppc pretty print Expr and then compile"
]
printCompile :: String -> IO ()
printCompile = either putStrLn putStrLn . compile
compile :: String -> Either Error String
compile str = printJSExpr False 0 <$> (translateToJS =<< runExprParser "compile" str)
```
That's it. We have a compiler from our language to JS. Again, you can view the full source file [here][9].
Running our compiler with the example from the first section yields this JavaScript code:
```
$ runhaskell Lisp.hs example.lsp
(function(compose, square, add1) {
return (console.log)(((compose)(square, add1))(5));
})(function(f, g) {
return function(x) {
return (f)((g)(x));
};
}, function(x) {
return (x * x);
}, function(x) {
return (x + 1);
})
```
If you have node.js installed on your computer, you can run this code by running:
```
$ runhaskell Lisp.hs example.lsp | node -p
36
undefined
```
* _Final exercise_ : instead of compiling an expression, compile a program of multiple expressions.
--------------------------------------------------------------------------------
via: https://gilmi.me/blog/post/2016/10/14/lisp-to-js
作者:[ Gil Mizrahi ][a]
选题:[oska874][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://gilmi.me/home
[b]:https://github.com/oska874
[1]:https://gilmi.me/blog/authors/Gil
[2]:https://gilmi.me/blog/tags/compilers
[3]:https://gilmi.me/blog/tags/fp
[4]:https://gilmi.me/blog/tags/haskell
[5]:https://gilmi.me/blog/tags/lisp
[6]:https://gilmi.me/blog/tags/parsing
[7]:https://gist.github.com/soupi/d4ff0727ccb739045fad6cdf533ca7dd
[8]:https://mrkkrp.github.io/megaparsec/
[9]:https://gist.github.com/soupi/d4ff0727ccb739045fad6cdf533ca7dd
[10]:https://gilmi.me/blog/post/2016/10/14/lisp-to-js

View File

@ -0,0 +1,637 @@
# 用 350 行代码从零开始,将 Lisp 编译成 JavaScript
我们将会在本篇文章中看到从零开始实现的编译器,将简单的类 LISP 计算语言编译成 JavaScript。完整的源代码在 [这里][7].
我们将会:
1. 自定义语言,并用它编写一个简单的程序
2. 实现一个简单的解析器组合器
3. 为该语言实现一个解析器
4. 为该语言实现一个美观的打印器
5. 为我们的需求定义 JavaScript 的一个子集
6. 实现代码转译器,将代码转译成我们定义的 JavaScript 子集
7. 把所有东西整合在一起
开始吧!
### 1. 定义语言
lisps 最迷人的地方在于它们的语法就是树状表示的这就是这门语言很容易解析的原因。我们很快就能接触到它。但首先让我们把自己的语言定义好。关于我们语言的语法的范式BNF描述如下
```
program ::= expr
expr ::= <integer> | <name> | ([<expr>])
```
基本上,我们可以在该语言的最顶层定义表达式并对其进行运算。表达式由一个整数(比如 `5`)、一个变量(比如 `x`)或者一个表达式列表(比如 `(add x 1)`)组成。
整数对应它本身的值,变量对应它在当前环境中绑定的值,表达式列表对应一个函数调用,该列表的第一个参数是相应的函数,剩下的表达式是传递给这个函数的参数。
该语言中,我们保留一些内建的特殊形式,这样我们就能做一些更有意思的事情:
* let 表达式使我们可以在它的 body 环境中引入新的变量。语法如下:
```
let ::= (let ([<letarg>]) <body>)
letargs ::= (<name> <expr>)
body ::= <expr>
```
* lambda 表达式:也就是匿名函数定义。语法如下:
```
lambda ::= (lambda ([<name>]) <body>)
```
还有一些内建函数: `add`、`mul`、`sub`、`div` 和 `print`
让我们看看用我们这门语言编写的入门示例程序:
```
(let
((compose
(lambda (f g)
(lambda (x) (f (g x)))))
(square
(lambda (x) (mul x x)))
(add1
(lambda (x) (add x 1))))
(print ((compose square add1) 5)))
```
这个程序定义了 3 个函数:`compose`、`square` 和 `add1`。然后将计算结果的值 `((compose square add1) 5)` 输出出来。
我相信了解这门语言,这些信息就足够了。开始实现它吧。
在 Haskell 中,我们可以这样定义语言:
```
type Name = String
data Expr
= ATOM Atom
| LIST [Expr]
deriving (Eq, Read, Show)
data Atom
= Int Int
| Symbol Name
deriving (Eq, Read, Show)
```
我们可以解析用该语言用 `Expr` 定义的程序。而且,这里我们添加了新数据类型 `Eq`、`Read` 和 `Show` 等实例用于测试和调试。你能够在 REPL 中使用这些数据类型,验证它们确实有用。
我们不在语法中定义 `lambda`、`let` 或其它的内建函数,原因在于,当前情况下我们没必要用到这些东西。这些函数仅仅是 `LIST` (表达式列表)的更加特殊的用例。所以我决定将它放到后面的部分。
一般来说你想要在抽象语法中定义这些特殊用例 —— 用于改进错误信息、禁用静态分析和优化等等,但在这里我们不会这样做,对我们来说这些已经足够了。
另一件你想做的事情可能是在语法中添加一些注释信息。比如定位:`Expr` 是来自哪个文件的,具体到这个文件的哪一行哪一列。你可以在后面的阶段中使用这一特性,打印出错误定位,即使它们不是处于解析阶段。
* _练习 1_:添加一个 `Program` 数据类型,可以按顺序包含多个 `Expr`
* _练习 2_:向语法树中添加一个定位注解。
### 2. 实现一个简单的解析器组合库
我们要做的第一件事情是定义一个嵌入式领域专用语言Embedded Domain Specific Language 或者 EDSL我们会用它来定义我们的语言解析器。这常常被称为解析器组合库。我们做这件事完全是出于学习的目的Haskell 里有很好的解析库,在实际构建软件或者进行实验时,你应该使用它们。[megaparsec][8] 就是这样的一个库。
首先我们来谈谈解析库的实现的思路。本质上,我们的解析器就是一个函数,接受一些输入,可能会读取输入的一些或全部内容,然后返回解析出来的值和无法解析的输入部分,或者在解析失败时抛出异常。我们把它写出来。
```
newtype Parser a
= Parser (ParseString -> Either ParseError (a, ParseString))
data ParseString
= ParseString Name (Int, Int) String
data ParseError
= ParseError ParseString Error
type Error = String
```
这里我们定义了三个主要的新类型。
第一个,`Parser a` 是之前讨论的解析函数。
第二个,`ParseString` 是我们的输入或携带的状态。它有三个重要的部分:
* `Name`: 这是源的名字
* `(Int, Int)`: 这是源的当前位置
* `String`: 这是等待解析的字符串
第三个,`ParseError` 包含了解析器的当前状态和一个错误信息。
现在我们想让这个解析器更灵活,我们将会定义一些常用类型的实例。这些实例让我们能够将小巧的解析器和复杂的解析器结合在一起(因此它的名字叫做 “解析器组合器”)。
第一个是 `Functor` 实例。我们需要 `Functor` 实例,因为我们要能够对解析值应用函数从而使用不同的解析器。当我们定义自己语言的解析器时,我们将会看到关于它的示例。
```
instance Functor Parser where
fmap f (Parser parser) =
Parser (\str -> first f <$> parser str)
```
第二个是 `Applicative` 实例。该实例的常见用例是在多个解析器中实现一个纯函数。
```
instance Applicative Parser where
pure x = Parser (\str -> Right (x, str))
(Parser p1) <*> (Parser p2) =
Parser $
\str -> do
(f, rest) <- p1 str
(x, rest') <- p2 rest
pure (f x, rest')
```
注意_我们还会实现一个 Monad 实例这样我们才能使用符号_
第三个是 `Alternative` 实例。万一前面的解析器解析失败了,我们要能够提供一个备用的解析器。
```
instance Alternative Parser where
empty = Parser (`throwErr` "Failed consuming input")
(Parser p1) <|> (Parser p2) =
Parser $
\pstr -> case p1 pstr of
Right result -> Right result
Left _ -> p2 pstr
```
第四个是 `Monad` 实例。这样我们就能链接解析器。
```
instance Monad Parser where
(Parser p1) >>= f =
Parser $
\str -> case p1 str of
Left err -> Left err
Right (rs, rest) ->
case f rs of
Parser parser -> parser rest
```
接下来,让我们定义一种的方式,用于运行解析器和防止失败的助手函数:
```
runParser :: String -> String -> Parser a -> Either ParseError (a, ParseString)
runParser name str (Parser parser) = parser $ ParseString name (0,0) str
throwErr :: ParseString -> String -> Either ParseError a
throwErr ps@(ParseString name (row,col) _) errMsg =
Left $ ParseError ps $ unlines
[ "*** " ++ name ++ ": " ++ errMsg
, "* On row " ++ show row ++ ", column " ++ show col ++ "."
]
```
现在我们将会开始实现组合器,这是 EDSL 的 API也是它的核心。
首先,我们会定义 `oneOf`。如果输入列表中的字符后面还有字符的话,`oneOf` 将会成功,否则就会失败。
```
oneOf :: [Char] -> Parser Char
oneOf chars =
Parser $ \case
ps@(ParseString name (row, col) str) ->
case str of
[] -> throwErr ps "Cannot read character of empty string"
(c:cs) ->
if c `elem` chars
then Right (c, ParseString name (row, col+1) cs)
else throwErr ps $ unlines ["Unexpected character " ++ [c], "Expecting one of: " ++ show chars]
```
`optional` 将会抛出异常,停止解析器。失败时它仅仅会返回 `Nothing`
```
optional :: Parser a -> Parser (Maybe a)
optional (Parser parser) =
Parser $
\pstr -> case parser pstr of
Left _ -> Right (Nothing, pstr)
Right (x, rest) -> Right (Just x, rest)
```
`many` 将会试着重复运行解析器,直到失败。当它完成的时候,会返回成功运行的解析器列表。`many1` 做的事情是一样的,但解析失败时它至少会抛出一次异常。
```
many :: Parser a -> Parser [a]
many parser = go []
where go cs = (parser >>= \c -> go (c:cs)) <|> pure (reverse cs)
many1 :: Parser a -> Parser [a]
many1 parser =
(:) <$> parser <*> many parser
```
下面的这些解析器通过我们定义的组合器来实现一些特殊的解析器:
```
char :: Char -> Parser Char
char c = oneOf [c]
string :: String -> Parser String
string = traverse char
space :: Parser Char
space = oneOf " \n"
spaces :: Parser String
spaces = many space
spaces1 :: Parser String
spaces1 = many1 space
withSpaces :: Parser a -> Parser a
withSpaces parser =
spaces *> parser <* spaces
parens :: Parser a -> Parser a
parens parser =
(withSpaces $ char '(')
*> withSpaces parser
<* (spaces *> char ')')
sepBy :: Parser a -> Parser b -> Parser [b]
sepBy sep parser = do
frst <- optional parser
rest <- many (sep *> parser)
pure $ maybe rest (:rest) frst
```
现在为该门语言定义解析器所需要的所有东西都有了。
* _练习_ :实现一个 EOFend of file/input即文件或输入终止符解析器组合器。
### 3. 为我们的语言实现解析器
我们会用自顶而下的方法定义解析器。
```
parseExpr :: Parser Expr
parseExpr = fmap ATOM parseAtom <|> fmap LIST parseList
parseList :: Parser [Expr]
parseList = parens $ sepBy spaces1 parseExpr
parseAtom :: Parser Atom
parseAtom = parseSymbol <|> parseInt
parseSymbol :: Parser Atom
parseSymbol = fmap Symbol parseName
```
注意到这四个函数是在我们这门语言中属于高阶描述。这解释了为什么 Haskell 执行解析工作这么棒。在定义完高级部分后,我们还需要定义低级别的 `parseName``parseInt`
我们能在这门语言中用什么字符作为名字呢?用小写的字母、数字和下划线吧,而且名字的第一个字符必须是字母。
```
parseName :: Parser Name
parseName = do
c <- oneOf ['a'..'z']
cs <- many $ oneOf $ ['a'..'z'] ++ "0123456789" ++ "_"
pure (c:cs)
```
整数是一系列数字,数字前面可能有负号 -
```
parseInt :: Parser Atom
parseInt = do
sign <- optional $ char '-'
num <- many1 $ oneOf "0123456789"
let result = read $ maybe num (:num) sign of
pure $ Int result
```
最后,我们会定义用来运行解析器的函数,返回值可能是一个 `Expr` 或者是一条错误信息。
```
runExprParser :: Name -> String -> Either String Expr
runExprParser name str =
case runParser name str (withSpaces parseExpr) of
Left (ParseError _ errMsg) -> Left errMsg
Right (result, _) -> Right result
```
* _练习 1_ :为第一节中定义的 `Program` 类型编写一个解析器
* _练习 2_ :用 Applicative 的形式重写 `parseName`
* _练习 3_ `parseInt` 可能出现溢出情况,找到处理它的方法,不要用 `read`
### 4. 为这门语言实现一个更好看的输出器
我们还想做一件事,将我们的程序以源代码的形式打印出来。这对完善错误信息很有用。
```
printExpr :: Expr -> String
printExpr = printExpr' False 0
printAtom :: Atom -> String
printAtom = \case
Symbol s -> s
Int i -> show i
printExpr' :: Bool -> Int -> Expr -> String
printExpr' doindent level = \case
ATOM a -> indent (bool 0 level doindent) (printAtom a)
LIST (e:es) ->
indent (bool 0 level doindent) $
concat
[ "("
, printExpr' False (level + 1) e
, bool "\n" "" (null es)
, intercalate "\n" $ map (printExpr' True (level + 1)) es
, ")"
]
indent :: Int -> String -> String
indent tabs e = concat (replicate tabs " ") ++ e
```
* _练习_ :为第一节中定义的 `Program` 类型编写一个美观的输出器
好,目前为止我们写了近 200 行代码,这些代码一般叫做编译器的前端。我们还要写大概 150 行代码,用来执行三个额外的任务:我们需要根据需求定义一个 JS 的子集,定义一个将我们的语言转译成这个子集的转译器,最后把所有东西整合在一起。开始吧。
### 5. 根据需求定义 JavaScript 的子集
首先,我们要定义将要使用的 JavaScript 的子集:
```
data JSExpr
= JSInt Int
| JSSymbol Name
| JSBinOp JSBinOp JSExpr JSExpr
| JSLambda [Name] JSExpr
| JSFunCall JSExpr [JSExpr]
| JSReturn JSExpr
deriving (Eq, Show, Read)
type JSBinOp = String
```
这个数据类型表示 JavaScript 表达式。我们有两个原子类型 `JSInt``JSSymbol`,它们是由我们这个语言中的 `Atom` 转译来的,我们用 `JSBinOp` 来表示二元操作,比如 `+``*`,用 `JSLambda` 来表示匿名函数,和我们语言中的 `lambda expressionlambda 表达式)` 一样,我们将会用 `JSFunCall` 来调用函数,用 `let` 来引入新名字,用 `JSReturn` 从函数中返回值,在 JavaScript 中是需要返回值的。
`JSExpr` 类型是对 JavaScript 表达式的 **抽象表示**。我们会把自己语言中表达式的抽象表示 `Expr` 转译成 JavaScript 表达式的抽象表示 `JSExpr`。但为了实现这个功能,我们需要实现 `JSExpr` ,并从这个抽象表示中生成 JavaScript 代码。我们将通过递归匹配 `JSExpr` 实现,将 JS 代码当作 `String` 来输出。这和我们在 `printExpr` 中做的基本上是一样的。我们还会追踪元素的作用域,这样我们才可以用合适的方式缩进生成的代码。
```
printJSOp :: JSBinOp -> String
printJSOp op = op
printJSExpr :: Bool -> Int -> JSExpr -> String
printJSExpr doindent tabs = \case
JSInt i -> show i
JSSymbol name -> name
JSLambda vars expr -> (if doindent then indent tabs else id) $ unlines
["function(" ++ intercalate ", " vars ++ ") {"
,indent (tabs+1) $ printJSExpr False (tabs+1) expr
] ++ indent tabs "}"
JSBinOp op e1 e2 -> "(" ++ printJSExpr False tabs e1 ++ " " ++ printJSOp op ++ " " ++ printJSExpr False tabs e2 ++ ")"
JSFunCall f exprs -> "(" ++ printJSExpr False tabs f ++ ")(" ++ intercalate ", " (fmap (printJSExpr False tabs) exprs) ++ ")"
JSReturn expr -> (if doindent then indent tabs else id) $ "return " ++ printJSExpr False tabs expr ++ ";"
```
* _练习 1_ :添加 `JSProgram` 类型,它可以包含多个 `JSExpr` ,然后创建一个叫做 `printJSExprProgram` 的函数来生成代码。
* _练习 2_ :添加 `JSExpr` 的新类型:`JSIf`,并为其生成代码。
### 6. 实现到我们定义的 JavaScript 子集的代码转译器
我们快做完了。这一节将会创建函数,将 `Expr` 转译成 `JSExpr`
基本思想很简单,我们会将 `ATOM` 转译成 `JSSymbol` 或者 `JSInt`,然后会将 `LIST` 转译成一个函数调用或者转译的特例。
```
type TransError = String
translateToJS :: Expr -> Either TransError JSExpr
translateToJS = \case
ATOM (Symbol s) -> pure $ JSSymbol s
ATOM (Int i) -> pure $ JSInt i
LIST xs -> translateList xs
translateList :: [Expr] -> Either TransError JSExpr
translateList = \case
[] -> Left "translating empty list"
ATOM (Symbol s):xs
| Just f <- lookup s builtins ->
f xs
f:xs ->
JSFunCall <$> translateToJS f <*> traverse translateToJS xs
```
`builtins` 是一系列要转译的特例,就像 `lambada``let`。每一种情况都可以获得一系列参数,验证它是否合乎语法规范,然后将其转译成等效的 `JSExpr`
```
type Builtin = [Expr] -> Either TransError JSExpr
type Builtins = [(Name, Builtin)]
builtins :: Builtins
builtins =
[("lambda", transLambda)
,("let", transLet)
,("add", transBinOp "add" "+")
,("mul", transBinOp "mul" "*")
,("sub", transBinOp "sub" "-")
,("div", transBinOp "div" "/")
,("print", transPrint)
]
```
我们这种情况,会将内建的特殊形式当作特殊的、非第一类的进行对待,因此不可能将它们当作第一类函数。
我们会把 Lambda 表达式转译成一个匿名函数:
```
transLambda :: [Expr] -> Either TransError JSExpr
transLambda = \case
[LIST vars, body] -> do
vars' <- traverse fromSymbol vars
JSLambda vars' <$> (JSReturn <$> translateToJS body)
vars ->
Left $ unlines
["Syntax error: unexpected arguments for lambda."
,"expecting 2 arguments, the first is the list of vars and the second is the body of the lambda."
,"In expression: " ++ show (LIST $ ATOM (Symbol "lambda") : vars)
]
fromSymbol :: Expr -> Either String Name
fromSymbol (ATOM (Symbol s)) = Right s
fromSymbol e = Left $ "cannot bind value to non symbol type: " ++ show e
```
我们会将 let 转译成带有相关名字参数的函数定义,然后带上参数调用函数,因此会在这一作用域中引入变量:
```
transLet :: [Expr] -> Either TransError JSExpr
transLet = \case
[LIST binds, body] -> do
(vars, vals) <- letParams binds
vars' <- traverse fromSymbol vars
JSFunCall . JSLambda vars' <$> (JSReturn <$> translateToJS body) <*> traverse translateToJS vals
where
letParams :: [Expr] -> Either Error ([Expr],[Expr])
letParams = \case
[] -> pure ([],[])
LIST [x,y] : rest -> ((x:) *** (y:)) <$> letParams rest
x : _ -> Left ("Unexpected argument in let list in expression:\n" ++ printExpr x)
vars ->
Left $ unlines
["Syntax error: unexpected arguments for let."
,"expecting 2 arguments, the first is the list of var/val pairs and the second is the let body."
,"In expression:\n" ++ printExpr (LIST $ ATOM (Symbol "let") : vars)
]
```
我们会将可以在多个参数之间执行的操作符转译成一系列二元操作符。比如:`(add 1 2 3)` 将会变成 `1 + (2 + 3)`
```
transBinOp :: Name -> Name -> [Expr] -> Either TransError JSExpr
transBinOp f _ [] = Left $ "Syntax error: '" ++ f ++ "' expected at least 1 argument, got: 0"
transBinOp _ _ [x] = translateToJS x
transBinOp _ f list = foldl1 (JSBinOp f) <$> traverse translateToJS list
```
然后我们会将 `print` 转换成对 `console.log` 的调用。
```
transPrint :: [Expr] -> Either TransError JSExpr
transPrint [expr] = JSFunCall (JSSymbol "console.log") . (:[]) <$> translateToJS expr
transPrint xs = Left $ "Syntax error. print expected 1 arguments, got: " ++ show (length xs)
```
注意,如果我们将这些代码当作 `Expr` 的特例进行解析,那我们就可能会跳过语法验证。
* _练习 1_ :将 `Program` 转译成 `JSProgram`
* _练习 2_ :为 `if Expr Expr Expr` 添加一个特例,并将它转译成你在上一次练习中实现的 `JSIf` 条件语句。
### 7. 把所有东西整合到一起
最终,我们将会把所有东西整合到一起。我们会:
1. 读取文件
2. 将文件解析成 `Expr`
3. 将文件转译成 `JSExpr`
4. 将 JavaScript 代码发送到标准输出流
我们还会启用一些用于测试的标志位:
* `--e` 将进行解析并打印出表达式的抽象表示(`Expr`
* `--pp` 将进行解析,美化输出
* `--jse` 将进行解析、转译、并打印出生成的 JS 表达式(`JSExpr`)的抽象表示
* `--ppc` 将进行解析,美化输出并进行编译
```
main :: IO ()
main = getArgs >>= \case
[file] ->
printCompile =<< readFile file
["--e",file] ->
either putStrLn print . runExprParser "--e" =<< readFile file
["--pp",file] ->
either putStrLn (putStrLn . printExpr) . runExprParser "--pp" =<< readFile file
["--jse",file] ->
either print (either putStrLn print . translateToJS) . runExprParser "--jse" =<< readFile file
["--ppc",file] ->
either putStrLn (either putStrLn putStrLn) . fmap (compile . printExpr) . runExprParser "--ppc" =<< readFile file
_ ->
putStrLn $ unlines
["Usage: runghc Main.hs [ --e, --pp, --jse, --ppc ] <filename>"
,"--e print the Expr"
,"--pp pretty print Expr"
,"--jse print the JSExpr"
,"--ppc pretty print Expr and then compile"
]
printCompile :: String -> IO ()
printCompile = either putStrLn putStrLn . compile
compile :: String -> Either Error String
compile str = printJSExpr False 0 <$> (translateToJS =<< runExprParser "compile" str)
```
大功告成。将自己的语言编译到 JS 子集的编译器已经完成了。再说一次,你可以在 [这里][9] 看到完整的源文件。
用我们的编译器运行第一节的示例,产生的 JavaScript 代码如下:
```
$ runhaskell Lisp.hs example.lsp
(function(compose, square, add1) {
return (console.log)(((compose)(square, add1))(5));
})(function(f, g) {
return function(x) {
return (f)((g)(x));
};
}, function(x) {
return (x * x);
}, function(x) {
return (x + 1);
})
```
如果你在自己电脑上安装了 node.js你可以用以下命令运行这段代码
```
$ runhaskell Lisp.hs example.lsp | node -p
36
undefined
```
* _最终练习_ : 编译有多个表达式的程序而非仅编译一个表达式。
--------------------------------------------------------------------------------
via: https://gilmi.me/blog/post/2016/10/14/lisp-to-js
作者:[ Gil Mizrahi ][a]
选题:[oska874][b]
译者:[BriFuture](https://github.com/BriFuture)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://gilmi.me/home
[b]:https://github.com/oska874
[1]:https://gilmi.me/blog/authors/Gil
[2]:https://gilmi.me/blog/tags/compilers
[3]:https://gilmi.me/blog/tags/fp
[4]:https://gilmi.me/blog/tags/haskell
[5]:https://gilmi.me/blog/tags/lisp
[6]:https://gilmi.me/blog/tags/parsing
[7]:https://gist.github.com/soupi/d4ff0727ccb739045fad6cdf533ca7dd
[8]:https://mrkkrp.github.io/megaparsec/
[9]:https://gist.github.com/soupi/d4ff0727ccb739045fad6cdf533ca7dd
[10]:https://gilmi.me/blog/post/2016/10/14/lisp-to-js