Huge file support and the JavaCC21 preprocessor

From what I can tell, if you have hugeFileSupport enabled for a parser, you can’t use the preprocessor, as it currently relies on FileLineMap functionality. It may be that support for other languages may lead to a greater need to use the preprocessor than before. How important is hugeFileSupport? It seems like even if we do away with it altogether (relying only on FileLineMap and retiring TokenBuilder) one can still process sources with tens of thousands of lines on modern machines; do we really need to support larger files than these? Have there been any practical use cases which preclude using a FileLineMap-only approach?

Well, first of all, you may be misunderstanding the situation slightly. (Not sure.) It’s true that if you set HUGE_FILE_SUPPORT=true in your settings in a grammar for whatever language/format, the generated parser is not using FileLineMap so you wouldn’t be able to leverage the existing preprocessor in that language because, yes, that does assume that we’re using FileLineMap. However, you can use the preprocessor in your grammar for that language since the relevant parser there is JavaCCParser, which does use FileLineMap.

Or, to put it more concretely, if you put HUGE_FILE_SUPPORT=true up top of the Python.javacc file, the Python.javacc can still use the preprocessor, since Python.javacc is not written in Python, it’s written in JavaCC! But yes, in that case, the generated Python parser/lexer is not using FileLineMap. Actually, the practical implication of that would more be that certain new features would not work, such as more advanced error recovery since those features would just tend to be based on the assumption that you have the full input file in memory and you can backtrack or scan forward as needed (i.e. HUGE_FILE_SUPPORT is false).

But finally, the thing is that hugeFileSupport would only be used in some very rare use cases, because generally, it’s simpler just to read the whole file into memory. For one thing, once you say that the typical usage of the tool is to build a tree, then you have everything in memory anyway, so effectively hugeFileSuport just makes no sense! But even leaving that point aside, I think that in the modern computing world, in which a megabyte of RAM costs less than a penny, the whole thing can only very rarely be worth the bother. Unless you just have some outrageously huge input, you just slurp it all in and hold it in memory.

As you probably saw, over a year ago, I wrote a blog post entitled: The Gigabyte is the new Megabyte. In the legacy JavaCC codebase, there’s all this code that’s written based on the assumption that 4k, say, is a lot of RAM and it’s worth writing some very convoluted code to conserve amounts of memory like that. And then you realize that even a Hello, World program in Java requires something like 5 megabytes to run, since it instantiates a new JVM that loads all the core classes and…

Well, I guess to answer your question, it’s very dubious that hugeFileSupport is really worth the effort of maintaining, moving forward. I’m not even sure that it’s not broken in some key ways now, since there has been so much refactoring of the code and I don’t really have any systematic testing to make sure it’s still working correctly!

Oh, one thing I did not point out about this was that if you use HUGE_FILE_SUPPORT=true, there is (currently) no support for 32-bit unicode because it reverts to using some older code that has no awareness of extended unicode characters i.e. high-low surrogate pair and all that. Emojis and so on. Also, as I guess I said before, all the fault-tolerant machinery just assumes that we have random access to the entirety of the input, so that is incompatible with HUGE_FILE_SUPPORT.

Well, I mostly say this because it’s not documented anywhere else, so I say it here for now. That, and also to give some sign of life…

I think that, moving forward, our stance will likely be that setting HUGE_FILE_SUPPORT to true is not really supported or encouraged. There is no particular reason to rip out the code though. It also occurs to me that if somebody does have the use case and is willing to pay money to get HUGE_FILE_SUPPORT working with 32-bit unicode, then… But really, in general, my honest sense is that if all the older stuff that evolved from SimpleCharStream and so on was just ripped out, nobody would miss it!