Nov 4, 2024

Writing a Groovy Parser in Go

Hold tight for another post explaining why I did this. I’ll focus here on the technical challenges.

Porting Groovy’s parsing to Go was a challenging project. The code is about 18,000 lines and is located at https://github.com/RefTrace/RefTrace/tree/main/parser.

The stages of a compiler are (roughly):

Lexing - turn strings into tokens
Concrete Syntax Tree (CST) generation - take those tokens and give them structure
Abstract Syntax Tree (AST) generation - take the CST and imbue it with meaning
Take the AST, do optimizations, and generate code

Lexing

Lexing takes a completely unstructured string and turns it into a list of tokens.

Groovy, a dynamic JVM language, has the concept of “GStrings”. A GString is a string that allows for variable substitution. For example:

def name = "Alice"
def message = "Hello, ${name}!"
println(message)

message is a GString. Given input to the lexer that starts def name =..., here’s the output:

(DEF)(Identifier)(ASSIGN)(StringLiteral)(NL)
(DEF)(Identifier)(ASSIGN)(GStringBegin)(LBRACE)(Identifier)(RBRACE)(GStringEnd)(NL)
(Identifier)(LPAREN)(Identifier)(RPAREN)(EOF)

The point of lexing is it shrinks the universe of input possibilities for the next stage. We know we’re only dealing with a pre-defined set of tokens. And we know something about their order.

Here’s a bad GString:

def name = "Alice"
def message = "Hello, ${name"
println(message)

The GString is not properly closed with a }. Feeding the above input into Groovy’s lexer throws a GroovySyntaxError. The lexer “knows” it can’t possibly build a valid sequence of tokens from this input.

Concrete Syntax Tree

CST generation takes a sequence of tokens and structures them as a tree.
Not all valid token sequences are valid trees. def name "Alice" lexes to (DEF)(Identifier)(StringLiteral) but does not parse because of the missing =. Groovy’s parsing interprets def name "Alice" as def name with an extraneous "Alice" hanging off the end.

Here’s the (simplified) tree that would be produced by def name = "Alice":

variableDeclarator
├── variableDeclaratorId
│   └── identifier
│       └── name
└── variableInitializer
    └── stringLiteral
        └── "Alice"

That tree is produced by this parsing rule:

variableDeclarator
    :   variableDeclaratorId (nls ASSIGN nls variableInitializer)?
    ;

ANTLR Target Troubles

The above parsing rule is written in ANTLR. ANTLR is a parser generator that has its own grammar.

The idea is you specify your lexing and CST-generation logic in ANTLR’s grammar and ANTLR generates a parser in your favorite target language.

The reality is more muddled.

ANTLR grammars allow for embedded pieces of user code. This feature is used to model states not easily expressed in the grammar itself. For instance:

private boolean isInsideParens() {  
    Paren paren = parenStack.peek();  
  
    // We just care about "(", "[" and "?[", inside which the new lines will be ignored.  
    // Notice: the new lines between "{" and "}" can not be ignored.  
    if (null == paren) {  
        return false;  
    }  
  
    String text = paren.getText();  
  
    return ("(".equals(text) && TRY != paren.getLastTokenType()) // we don't treat try-paren(i.e. try (....)) as parenthesis  
                || "[".equals(text) || "?[".equals(text);  
}

But the problem is this user code is written in a particular language. I had to translate these code snippets (“actions” and “semantic predicates” in ANTLR parlance) to Go:

func (l *GroovyLexer) isInsideParens() bool {  
    if len(l.parenStack) == 0 {  
        return false  
    }  
    paren := l.parenStack[len(l.parenStack)-1]  
    text := paren.text  
    return (text == "(" && paren.lastTokenType != TRY) || text == "[" || text == "?["  
}

A particular ANTLR grammar may not be language-independent.

Translating Groovy’s ANTLR code snippets was tedious. But more difficult was dealing with Groovy’s use of ANTLR Base Context. Base Context affects the later CST -> AST conversion stage. The Base Context feature seems like it’s still waiting to be merged in the official ANTLR build and I’m honestly not sure how Groovy’s using it. The ANTLR Go target doesn’t have it.

Go Target Gotchas

First off, the ANTLR Go runtime and code generation is cool. My project wouldn’t have worked without it.

But I did have to work around bugs with the code generation. Some of the bugs were specific to my grammar! For instance, this patch was necessary to correctly handle null pointers:

oldLine := "return !isFollowingArgumentsOrClosure(localctx.(*CommandExpressionContext).Get_expression())"
newLines := `if cmdExprCtx, ok := localctx.(*CommandExpressionContext); ok {
return !isFollowingArgumentsOrClosure(cmdExprCtx.Get_expression())
}
return !isFollowingArgumentsOrClosure(localctx)`

I also had to modify the runtime to expose some additional information. For example, to know when the lexer hit the end of a file:

func (b *BaseLexer) GetHitEOF() bool {
       return b.hitEOF
}

ASTempest

Porting Groovy’s AST generation from Java to Go was the hardest part of the project. The entrypoint to this process is about 5,000 lines. AST generation depends on the result from CST generation. For example:

ClassNode variableType = this.visitType(ctx.type());  
ctx.variableDeclarators().putNodeMetaData(VARIABLE_DECLARATION_VARIABLE_TYPE, variableType);

This is part of the code that constructs a Groovy variable. It does some primitive figuring out of the variable’s type and stores it as metadata. The code depends on ctx.type() and ctx.variableDeclarators() which are defined in the ANTLR CST grammar.

Modifying the ANTLR grammars to work with the Go target while simultaneously getting the AST logic to line up was challenging.

Conclusion

ANTLR’s goal is noble - write parsing logic once and run everywhere. It’s obviously a cool project and useful tool.

However, this experience makes me think maybe it’s easier to hand-write a lexer and a parser.

I am very glad I did not try to hand-write a lexer and parser for Groovy. But for a new programming language, sure. Rust does it.

Working around ANTLR often left me yearning for a more focused implementation.