Counting nested braces inside comments in xtext

Oct 24, 2011

So, I finished one playthrough of Dark Souls, and since I’ve exceeded the monthly download limit imposed on my net connection, the speed is down to 1/4th i.e. 256 Kbps, which translates to 20-30 KBps browsing speed, which is apparently insufficient for the multiplayer component of Dark Souls. Having experienced the multiplayer, it’s hard to play through the game a second time offline, so I’m back to my eclipse plugin development for now.

So, I described how I added support for cpptext blocks in this previous post describes. Well, as it turns out, that only fixed the one particular cpptext block I was looking at. It (quite obviously) doesn’t work for cpptext blocks like the following:

cpptext {
    void Something() { }
    int Gotcha;
}

because after the closing brace of Something(), it looks for either an opening brace or a closing brace, and finds neither.

What I really need here is a way of counting braces. I looked through the Xtext docs and couldn’t find any way of doing this, so I asked around on the forums. Turns out there’s no way of doing this using only Xtext, and the only way I could do it was by either

  1. Writing my own lexer by hand, and using it instead of the generated one, or
  2. Overriding the rule-call that matches cpptext blocks, and manually updating the rule-number every time I change my grammar

Since I’m nowhere near experienced enough to write my own lexer, I obviously went with option 2. Let me explain what that is exactly (bear in mind that this explanation is based on what I intuitively believe is happening behind the scenes, and not really based on hard facts)

Xtext generates a Java lexer which contains functions that try to match blocks of text in your language model to each rule defined in your grammar. Instead of trying to name each rule, the generated file simply has numbered functions mapped to each rule, so each time you add/remove rules from your grammar, the corresponding numbers might change. In my case, the CPP_TEXT rule is currently numbered 158, and calls the generated mRULE_CPP_TEXT() function. Unfortunately, this function is declared as final, so I can’t override it in my subclass, and am forced to override the calling function, which is mTokens().

Hence, I have the following piece of code in my subclass UnrealscriptLexer that extends the generated InternalUnrealscriptLexer:

public void mTokens() throws RecognitionException {
    if (my_dfa19.predict(input) == 158)
        Custom_mRULE_CPP_TEXT();
    else super.mTokens();
}

The Custom_mRULE_CPP_TEXT() function has an integer that keeps track of how many opening-braces it has encountered so far, and allows the rule to match the final closing braces only if this count is zero. The code is as follows:

public final void Custom_mRULE_CPP_TEXT() throws RecognitionException {
    try {
        int _type = RULE_CPP_TEXT;
        int _channel = DEFAULT_TOKEN_CHANNEL;
        {
            int alt2 = 2;
            int LA2_0 = input.LA(1);
            if ( (LA2_0=='c') ) {
                alt2 = 1;
            }
            else if ( (LA2_0=='s') ) {
                alt2 = 2;
            }
            else {
                NoViableAltException nvae = new NoViableAltException("", 2, 0, input);
                throw nvae;
            }

            switch (alt2) {
                case 1 :
                    match("cpptext");
                    break;
                case 2 :
                    match("structcpptext");
                    break;
            }

            loop3:
            do {
                int alt3 = 2;
                alt3 = my_dfa3.predict(input);

                switch (alt3) {
                    case 1 :
                        matchAny();
                        break;
                    default :
                        break loop3;
                }
            } while (true);

            match('{');

            int open_braces = 0;
            loop4:
            do {
                int alt4 = 2;
                int LA4_0 = input.LA(1);
                if (LA4_0 == '}') {
                    if (open_braces > 0) {
                        open_braces -= 1;
                        alt4 = 1;
                    }
                    else {
                        alt4 = 2;
                    }
                }
                else if (LA4_0 == '{') {
                    open_braces += 1;
                    alt4 = 1;
                }
                else if ( ((LA4_0>='\u0000' && LA4_0<='|') ||
                           (LA4_0>='~' && LA4_0<='\uFFFF')) ) {
                           alt4=1;
                }

                switch (alt4) {
                    case 1 :
                        matchAny();
                        break;
                    default :
                        break loop4;
                }
            } while (true);

            match('}');
        }

        state.type = _type;
        state.channel = _channel;
    }
    finally { }
}

And to make Xtext use my custom lexer, I had to use the following hooks in my RuntimeModule class:

public Class<? extends org.eclipse.xtext.parser.antlr.Lexer> bindLexer() {
    return com.wirywolf.parser.lexer.UnrealscriptLexer.class;
}

public void configureRuntimeLexer(com.google.inject.Binder binder) {
    binder.bind(org.eclipse.xtext.parser.antlr.Lexer.class)
          .annotatedWith(com.google.inject.name.Names
          .named(org.eclipse.xtext.parser.antlr.LexerBindings.RUNTIME))
          .to(com.wirywolf.parser.lexer.UnrealscriptLexer.class);
}