Monday, January 24, 2011

Parsing very simple expressions in Java

You might say parsing such a simple expression

function(param1, param2)
(and the like) is easy even if you want escaping of special characters and quoted parts. At least that's what I said.

Obvious approaches include using standard Java classes
  1. StringTokenizer
  2. Scanner
  3. Regular Expressions (sometimes coming in disguise as direct String methods like String.split)
  4. Use a lexer generator like http://www.cs.princeton.edu/~appel/modern/java/JLex/ or http://jflex.de/
  5. Using a full parer generator like http://www.antlr.org/
  6. doing this by hand using low level String functions, e.g. indexOf and substring
  7. remember your C days and still do this by hand, but rather by iterating through the input character by character
What did I do?
  • Options (1) and (2) do not work as they are not powerful enough, especially when it comes to escaping. Besides they are slow
  • Option (3) might acutally work, but besides being slow, dealing with complex regular expressions and the Java API for it is something I can not recommend. Even after reading and groking this tutorial would you want to write and maintain regular expressions? Try it. I don't.
  • (4) and (5): Having the full burden of a generated lexer plus runtime libraries or even a full generated parser is hardly worth the effort considering the simplicity of the expressions, even though both approaches would solve the problem.
Which leaves me doing it by hand - which I wound up doing. First I persued option (6) which did the job, but resulted in code very hard to read and very hard to maintain.

I then turned to option (7) and things became both easy and fast. I even came up with a very simple utility class that not only handles the above case, but other similar inputs. This is a test case for that utility class

@Test
public void scanAndSplit() throws Exception {
String input = "function(param1, param2)";
List segments = MiniParser.defaultInstance().scan(input, "(", ")");
assertEquals(2, segments.size());

String functionName = segments.get(0);
String parameterString = segments.get(1);
assertEquals("function", functionName);
assertEquals("param1, param2", parameterString);
List parameters = MiniParser.trimmedInstance().split(parameterString, ',');
assertEquals(2, parameters.size());
assertEquals("param1", parameters.get(0));
assertEquals("param2", parameters.get(1));
}

Java byte code generation

To speed up template processing for our template engine I recently decided to optionally compile each template into byte code.

Looking for a library to help me I chose ASM, because it
  1. is small with no dependencies
  2. has been released lately
  3. has decent documentation
  4. has a tool (Eclipse plugin) that generates ASM API calls from existing byte code
Especially no 4 inspired me to do the following:
  1. Create a fistful of hand crafted Java classes that do the same as some selected templates
  2. Create ASM code that generates the same byte code as those Java classes by the tool described in no 4
  3. Create a recursive descent compiler that calls the generalized ASM code derived from step 2 that generates the desired byte code
  4. Load and cache the classes at runtime
It turned out to be a pretty viable approach and allowed me to finish my task without too much knowledge of Java byte code. And what I had to learn anyway was quite enlightening :)

Even though a lot of code remains to be dynamic, because I do not know what will be in the model at run time, the caliper micro benchmarks comparing uncompiled reference code against the compiled version shows a 2.5x - 10x speedup:


benchmark ns logarithmic runtime
SimpleExpressionReference 3367 ================
SimpleExpressionCompiled 963 ==
ComplexExpressionReference 6721 =======================
ComplexExpressionCompiled 1331 ======
If 7067 =======================
IfCompiled 805 =
Foreach 12796 =============================
ForeachCompiled 5275 ====================