Project 3 — Lexer (Java)

CS-364: Programming Languages

Fall 2024

Programming assignments 3 through 5 will direct you to design and build an interpreter for snail. Each assignment will cover one component of the interpreter: lexical analysis, parsing, and operational semantics. Each assignment will ultimately result in a working interpreter phase which can interface with the other phases.

For this assignment you will write a lexical analyzer, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for snail in an appropriate input format and the analyzer generator will generate actual code. You will then write additional code to serialize the tokens for use by later interpreter stages.

You will be implementing this stage of the interpreter in the Java programming language. You may work in a team of two people for this assignment.

Requirements

You must create a minimum of four artifacts:

  1. A Java program that reads from standard input (System.in) and prints to standard output (System.out). The input will be UTF-8 source code for a snail program. Your program must either indicate that there is an error in the input (e.g., a malformed string) or emit an SL-Lex-formatted serialized list of snail tokens. Your program's main lexer component must be constructed by a lexical analyzer generator. The "glue code" for serializing tokens should be written by hand. If your Java project compiles to lexer.jar, the following two commands should produce the same output (for a valid snail program):
                      
                        java -jar lexer.jar < file.sl > file.sl-lex
                        snail --lex file.sl
                      
                    
    Note that > performs output redirection to save standard output to a file. Your program will consist of a number of Java files.
  2. A plain UTF-8 text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A few paragraphs should suffice.
  3. A plain UTF-8 text file called references.txt providing a citation for each resource you used (excluding class notes, and assigned readings) to complete the assignment. For example, if you found a Stack Overflow answer helpful, provide a link to it. Additionally, provide a brief description of how the resource helped you.
  4. A suite of test cases (test-1-XXX.sl through test-N-XXX.sl>), which exercise lexing corner cases. Your test cases will be evaluated against buggy implementations of the snail lexer.
Warning! You must use JFlex. Do not write your entire lexer by hand. Parts of it must be tool-generated from regular expressions you provide.

Specification

Your lexer should conform to the lexical analysis and SL-LEX specification provided by the snail documentation.

Error Reporting

To report an error, write the string

ERROR: line_number:column_number: Lexer: message

to standard output (System.out) and terminate the program with exit code 0. You may write whatever you want in the message, but it should be fairly indicative.

Your program should print all tokens that were sucessfully discovered before the error.

Example erroneous input:
Backslash not allowed \
Example error report output:
ERROR: 1:23: Lexer: invalid character: '\'

The SL-Lex Format

If there are no errors in file.sl your program should print a list of scanned tokens in SL-Lex format to standard output (System.out). This is the same format used by the reference compiler to create a file.sl-lex when invoked with the --lex flag. You can find a detailed specification of this format here.

Note that the reference implementation (along with your upcoming P4 parser!) will read SL-Lex input instead of .sl files.

Lexical Analyzer Generators

You must use a lexical analyzer generator for this assignment. In class, we discuss JFlex, a lexical analyzer generator for Java. You will find the documentation to be particularly helpful. Because you are producing a Java program for this assignment, it is required that you use JFlex.

Downloading JFlex

JFlex is already provided for you in the CS-364 devcontainer environment. You should be able to use the provided ant build file to compile your lexer.

If you wish to work outside of the development environment, a copy of JFlex is available here. To work with the provided ant build file, you should place this jar in the lib directory at the top level of your project.

Test Suite Development

In this project, we introduce a form of test-driven development or mutation testing into our software development process and require you to construct a high-quality test suite.

The goal is to leave you with a high-quality test suite of programs that you can use to evaluate your own P3 Lexer. While you you can check for correct "positive" behavior by comparing your lexer's output to the reference interpreters's output on the usual "good" snail programs, it is comparatively harder to check for "corner case" behavior.

SLUGS contains 9 variants of the Java-based lexer, each with a secret intentionally-introduced defect related to Lexing. A high-quality test suite is one that reveals each introduced defect by showing a difference between the behavior of the true reference interpreter and the corresponding buggy verison. You desire a high-quality test suite to help you gain confidence in your own P3 submission.

For each bug, if one of your tests causes the reference and the buggy version to produce different output (that is different stdout), you win: that test has revealed that bug. For full credit your tests must reveal all 9 unknown defects.

SLUGS will tell you the correct output for test cases you submit that reveal bugs in your own implementation of P3. This is the same information you can determine by comparing your output with that of the reference compiler.

Tests should include a snail program named test-n-XXX.sl (XXX can be anything, but must be the same for both files) where 1 ≤ n ≤ 20.

Your test files may contain no more than 2048 characters in any one file (including comments and whitspace). You may submit up to 20 tests (though it is possible to get full credit with fewer). Note that the tests the autograder runs on your solution are NOT limited to 2048 characters in a file, so your solution should not impose any size limits (as long as sufficient system memory is available).

Commentary

You can do basic testing with something like the following:

              
                snail --lex file.sl 
                ant 
                java -jar lexer.jar < file.sl > generated.sl-lex
                diff -b -B -E -w file.sl-lex generated.sl-lex
              
            

Since P3 prints SL-Lex to standard output (rather than writing it to a file), we use a redirect to save this output to a file (> generated.sl-lex). Further, diff is a command line tool for comparing the contents of two files. You may also find Diffchecker as well as VSCode's built-in comparison or IntelliJ's file comparison to be helpful.

You may find the reference interpreter's --unlex option useful for debugging your .sl-lex files.

JFlex handles counting lines a little different from the reference interpreter (i.e., more characters count as newlines than the specification allows). To handle this, you can remove the %line directive from your JFlex specification and instead modify yyline direclty in rules that match newlines.

You will benefit from using lexer states in this project. This will likely require reading the JFlex documentation to understand how they work.

Need more testcases? Any snail file you have (including the one you wrote for P2) works fine. The snail-examples repository should also be a good start. There's also one among the P2 resources. You'll want to make more complicated test cases—in particular, you'll want to make negative testcases (e.g., testcases with malformed string constants).

If you are still stuck, you can post on the forum or approach the professor.

Video Guides

Video guides are provided to help you get started with various aspects of this project. Note that there may be errors and mistakes in these videos. You should not blindly follow the steps taken in these videos!

Handling Nested Comments in JFlex

What to Submit

You must turn in a zip file (or folder layout) containing these files:

  • readme.txt: your README file
  • references.txt: your file of citations
  • build.xml: your ant build file for this project
  • source_files:
    • main.java (contains the main method)
    • some_file.flex (or similar)
    • All other files containing code needed for your program to run
    • These source files must include a comment with the project identifier: 3aeee2ffb6bdcec698011572b6bbcaf180807419
  • test_files:
    • Up to 20 test files named test-N-XXX.sl where N is a number in the range 1 through 20 and XXX is any descriptive text without spaces
    • Each testcase you submit must be at most 2048 characters (i.e., wc -c yourtest.sl says 2048 or less). You want each of your testcases to be meaningful so that it helps you narrow down a bug later.
Warning! If your regular expressions and lexer definition are in some other files (e.g., Lexer.flex), be sure to include those as well!
Danger! Do not submit any pre-compiled jar files with your submisison. This includes, but is not limited to, your lexer, JFlex, etc.

The following directory layout is recommended for your zip file. This is the default layout generated by ant (see below) should you following the same project structure from the Lexer example in class.

              
                unzip -l fullsubmit.zip
                (out)src/Lexer.flex
                (out)src/Main.java
                (out)build.xml
                (out)readme.txt
                (out)references.txt
                (out)test-1-xxx.sl
                (out)test-2-something.sl
              
            

Using the Ant Build File

Danger! The ant build file has not been updated yet to generate zip files! Check back soon.

Ant is a build system for Java (much like make or dune). You may download a (mostly) pre-configured build file (build.xml) here. Add this file to the root of your java project.

Update the following entries in build.xml:

  • The main-class property (line 8) to be the name of the class in your project containing main()
  • The flex-file property (line 9) to be the name of your JFlex specification
  • Note: the project-name must remain lexer for SLUGS to correctly compile and run your code.

There are three (3) key operations that build.xml supports:

  • ant dist: compile your project (this is the same as just running ant)
  • ant clean: remove all generated Java files and compiled Java classes
  • ant fullsubmit: generate the zip submission archive. Note that ant will print out the list of files that are found in this archive. Review this list to be sure you are submitting what you think you are submitting!

Working in Pairs

You may complete this assignment in a team of two. Teamwork imposes burdens of communication and coordination, but has the benefits of more thoughtful designs and cleaner programs. Team programming is also the norm in the professional world.

Students on a team are expected to participate equally in the effort and to be thoroughly familiar with all aspects of the joint work. Both members bear full responsibility for the completion of assignments. Partners turn in one solution for each programming assignment; each member receives the same grade for the assignment. If a partnership is not going well, the instructor will help to negotiate new partnerships. Teams may not be dissolved in the middle of an assignment.

If you are working in a team, one team member should updload a single submission to Gradescope and add both members to the submission. There is an option to add members to a group once you upload a submission. Both students in the team will receive the same grade for their submission.

Grading Rubric

P3 Grading (out of 60 points):

  • 43 points: autograder tests
  • 9 points: test files (1 point per bug revealed)
  • 4 points: clear description in your README and References
    • 4 — thorough discussion of design decisions and choice of test cases; a few paragraphs of coherent English sentences should be fine. Citations provided are properly formatted.
    • 2 — vague or hard to understand; omits important details. Citations provided are properly formatted.
    • 0 — little to no effort. Citations do not provide correct information.
  • 4 points: code cleanliness
    • 4 — code is mostly clean and well-commented
    • 2 — code is sloppy and/or poorly commented in places
    • 0 — little to no effort to organize and document code
  • -5 points: neglecting to include the lexer definition
    • 0 — included lexer definition in submission (e.g., Lexer.flex file defining lexer)
    • -5 — only submitted machine-generated lexer; failed to submit definition from which lexer was generated