What is RESSA

2018-12-17

Now that I have released a minimal working version of RESSA, it seems like a good time to go over how someone might use it. RESSA is a library for parsing javascript from text into an abstract syntax tree (AST). The target for this project is to enable users to build javascript development tools with Rust.

Before we dig in too deep, it may be worth it to go back and read about RESS, which powers the core of RESSA. It isn't required but it might be helpful. Where RESS covers the first part of the parsing process, tokenization, RESSA builds on that to evaluate what a set of tokens might represent. To be clear we are still just dealing with the syntax, no semantic meaning or evaluation will come into play.

The main point of entry for RESSA is the Parser struct, which takes in some javascript as a &str and converts that into an AST. There are two ways to construct a Parser, by using Parser::new(&str) or by using the Builder struct. The new method will give you a Parser setup with the default configuration, it will assume the &str should be parsed as a Script not a Module, it will not tolerate any errors and it will discard any Comments it finds. The Builder method would allow you to customize these more ergonomically.

let mut p = Builder::new()
                .module(true)
                .tolerant(true)
                .js("console.log('things!');")
                .build()
                .expect("failed to create parser");

Once you have a Parser there are two ways to use it, the simplest way would be to call the parse() method, which returns a Result<Program, Error>, a Program is either a Script or Module containing a Vec<ProgramPart>. The other way to use it is as an iterator over Result<ProgramPart, Error>. As might be apparent, ProgramPart is the main building block of the tree, all complete sections of code will end up as a ProgramPart which has 3 variants.

Looking over the list above, a directive is pretty straight forward - technically it could be an literal but the only one that has any semantic meaning is 'use strict'. Declarations are top level items function, class, var, let, const, import and export, while most of these can appear below the top level, when they are at the top level they will be Declarations otherwise they would be Statements. Instead of just listing out all of the different possibilities, let's look at an example.

function print(message) {
    console.log(message)
}

The above example, lifted from the RESS tutorial, when parsed by RESSA would look like this.

ProgramPart::Decl(
    Declaration::Function(
        Function {
            id: Some(String::from("print")),
            params: vec![
                FunctionArg::Pattern(
                    Pattern::Identifier(
                        String::from("message")
                    )
                )
            ],
            body: vec![
                ProgramPart::Statement(
                    Statement::Expr(
                        Expression::Call(
                            CallExpression {
                                callee: Box::new(
                                    Expression::Member(
                                        MemberExpression {
                                            object: Box::new(
                                                Expression::Ident(
                                                    String::from("console")
                                                )
                                            ),
                                            property: Box::new(
                                                Expression::Ident(
                                                    String::from("log")
                                                )
                                            ),
                                            computed: false,
                                        }
                                    )
                                ),
                                arguments: vec![
                                    Expression::Ident(
                                        String::from("message")
                                    )
                                ],
                            },
                        )
                    )
                )
            ],
            generator: false,
            is_async: false,
        }
    )
);

Whoa, that is quite a bit of information for three lines of code! Let's break it down a little, first we have our ProgramPart::Decl which will always contain a Declaration. A Declaration can be a few different things, in this case it is Declaration::Function, pretty straight forward so far. The Function has an id property that could be None but in this case it is "print", next are the params, params can be either a Pattern or Expression this param is a Pattern specifically a Pattern::Identifier named "message".

Now we are at the function body, A function body is a Vec of ProgramParts in this case we have one Statement which is a Statement::Expr which will always contain an Expression. Our one expression here is an Expression::Call, meaning it is calling another function, inside of this variant we will have a struct CallExpression. A CallExpression has two properties the callee and the arguments, the callee can be any Expression so we need to wrap that in a Box, otherwise the compiler would tells us Expression could be infinatly sized which the compiler hates. Inside of our Box we have an Expression::Member which contains a MemberExpression, this is how we describe the action of accessing members of something, so in console.log or console['log'], log is a member of console. a MemberExpression has three properties, object which is the parent, here it is "console", but since it could be almost anything we first need to wrap it in an Expression and again with the compiler's distane for infinity we need to wrap that in a Box, inside our box is going to be an Expression::Ident with the value of "console". For the property we need to do a similar dance, we have a Box wrapping an Expression::Ident wrapping the value "log". The final property here is computed, this is a flag to indicate if we used index notation (console['log']) instead of dot notation (cosole.log), for this case it would be false. Moving on to the arguments, this will be a Vec<Expression>, in our case there is only one and it will be of type Expression::Ident with the value of "message". At this point we get to exit the function's body and go back to the Function properties generator and is_async, both of which are false.

Holy cow, that is verbose! The unfortunate truth is that javascript has so many corner cases to cover meaning to truely capture any part of a program requires this much information. Let's take the MemberExpression as an example, this needs to be able to represent almost any combination of literals and identifiers. Consider the following code block, it is an illustration of a large number of ways to represent console.log.

console.log;
console['log'];
const logVar = 'log';
console[logVar];
console[['l','o','g'].join('')];
class Log {
    toString() {
        return 'log';
    }
}
const logToString = new Log();
console[logToString];
function logFunc() {
    return 'log';
}
console[logFunc()];
function getConsole() {
    return console
}
getConsole()[logFunc()];
getConsole().log;

And that's just what I could think of in this moment, imagine how many other possiblilities there might be. To handle all of that RESSA leans pretty heavily on enums which is nice because it provides an inheriently structured kind of dynamic value. The two heavy lifters in this space are Statement and Expression, with these two structures RESSA is able to represent the nearly infinate possible combinations of tokens that would represent valid javascript.

At this point I would normally start a little example development tool and walk through how it works, however that would make this already long winded and dense post significantly longer. In the coming weeks I am hoping to create an mdbook with one such example.