Some question about adding new functionality

I am trying to add the handling of C/C++ continuation lines in the lexer.
The idea is that you should have something like:

LINE_CONTINUATION_MODE=CRELAX;
which should added the needed code.
You can see my first try here: Added support for LINE_CONTINUATION_MODE · Gravelbones/javacc21@b3dc892 · GitHub
But it seems that the value from the input file doesn’t get through.
How should I get to the value of the above line from the template?

Also I need to add new imports when creating the parser and the lexer. It doesn’t seem to be possible at the moment. Any suggestions on how I should do that?

Frankly, I don’t think there’s any need to patch the actual lexer.java.ftl template to get the behavior you want. Probably you just need to tweak the token definitions in Preprocessor.javacc to handle the case of “\\\n”. It needs to skip over those. I mean a \n ends a line, but a \\n doesn’t. It might stretch your regexp skills a bit… (My own regexp skills are not so much to write home about, but I only need to muck with this kind of thing infrequently and once it is working, it’s usually working okay.)

As for injecting new imports into the parser and lexer, that should not be much of a problem.

INJECT PARSER_CLASS : import foo.bar.*; 
INJECT LEXER_CLASS : import foo.bar.baz;

I think the above is okay.

You know, generally speaking, a good way of getting a feel of the feature set is just to use the full grammars of Java and Python as sort of models. examples/java/Java.javacc and examples/python/Python.javacc basically.

The continuation line can be anywhere in the code, literally.

I done some extensive testing and I have found all of these examples to be valid.

#define VER\
BOSE “verbose” - Token is VERBOSE
‘\\
n’ - Is a newline character not an escape of the backslash
5 =\
= 5 - Returns boolean of true

So to do it in the lexer means you would have to handle it in every token, which spans more than 1 character.
That would make most token extremely complex. So in my opinion its simply not feasible to do the work in the lexer.
True no sane programmer would do any of this, but a program creating the include file, like flex :slight_smile:, could do it.

Without documentation it is hard to read that import must be done by a single INJECT statement :wink:, but it is working. Any { seems to be putting it into the class.

It seems the missing input from the template might have been because of a branch not being created on top of the other branch. Now I have the correct included code, after the order is correct.
Sorry for the inconvenience.

OK, I guess I misunderstood. I thought that the continuation sequence (backslash + newline) only applied in the preprocessor. Also, I am surprised that the continuation character can actually split a token! That is rather funky. When would anybody ever want to do that?

Now, first of all, are we sure that this is the spec, or is this just the way it’s implemented somewhere or other? Particularly the thing where this splits a token is very strange. Somebody actually wrote a specification in which you can split a token this way? What is the use case for that?

Yes, to implement what you describe probably is impractical in the lexical grammar and actually, I think offhand that one would best handle it in the core pre-lexically, in roughly the same spot that one handles PRESERVE_NEW_LINES and TABS_TO_SPACES. I guess offhand it would be a question of patching the mungeContent routine here: javacc21/Lexer.java.ftl at master · javacc21/javacc21 · GitHub

Or possibly you have a separate routine that runs over the content and flags the backslash+newline as something to be ignored. But, in any case, if you want to implement the behavior you describe, then it’s not practical to do it in the lexical grammar probably, and I think you want to handle it pre-lexically.

Actually, to answer my own question of when you would want to split a token that way, okay, I guess in a string literal you might. You might want to write:

  foo = "blah blah \
        blah blah";

But splitting an identifier that way is just crazy. And, regardless, in Java at least (and surely C# too) you can write:

  foo = "blah blah"
           + "     blah blah";

And since JDK 16, there are multiline string literals with triple quotes.

I have been working on the full C/C++ preprocessor so I could find any more issues.
About what the standard really say, you have to pay to get that answer.
The wikipedia article: C17 (C standard revision) - Wikipedia
The standard page at ISO: ISO - ISO/IEC 9899:2018 - Information technology — Programming languages — C

Gcc says this: The C Preprocessor

A continued line is a line which ends with a backslash, \. The backslash is removed and the following line is joined with the current one. No space is inserted, so you may split a line anywhere, even in the middle of a word. (It is generally more readable to split lines only at white space.)

And

If there is white space between a backslash and the end of a line, that is still a continued line. However, as this is usually the result of an editing mistake, and many compilers will not accept it as a continued line, GCC will warn you about it.

And while a lot of new features gives other options, the old way is not removed. And I will be handling files from the 1990’s and maybe even the 1980’s.

I have looked at moving the processing to mungeContent, but that won’t work fully because the line count will be wrong if any linefeeds are removed in mungeContent. createLineOffsetsTable needs to see them first. So either there need to be a second mungeContent or this should be done in readChar.

But it is true that only the “strict” version should be in readChar. Mungecontent can handle removing any extra whitespace. So I have changed my implementation to be a boolean C_CONTINUATION_LINE.

For some reason there seems to be a bunch of lineending fixes in my commit, and git rebase refuses me to edit the commit message.

Forgot to reference this one too: Phases of translation - cppreference.com

I have deleted my previous commit as it had a bunch of problems.
Due to the fact that getImage() by default gets information from the input, I do have to munge the input by removing the \ and newline character from the input, before readChar sees that.
But rather than trying to parse information from mungeContent to createLineOffsets I have joined the 2 functions, since they are done together anyway.