Project 4 — Debugging Automation

CS-340: Software Engineering

Summer 2021

For project 4, you develop a debugging support tool to reduce the effort associated with software maintenance. Then, you will compare your implementation with a production tool.

Reducing the size of a test input can singnificantly improve a developer's fault localization efficiency. In this assignment, you will implement the Delta Debugging algorithm to minimize "interesting" sets. Then, you will define an "interesting" function to aid you in identifying patches that cause compilation to fail. Finally, you will use a production Delta Debugging implementation to minimize a test suite for libpng.

XKCD comic 1319
Comic Credit: Randall Munroe

Requirements

This project is not-so-secretly three mini projects with common themes. Thus, we will define the requirements for each part separately. Note that 4B depends (in some sense) on 4A; however, you might be able to use 4C in place of 4A!

  • Project 4A:
    • delta.py: a Python 3 implementation of the basic Delta Debugging Algorithm.
  • Project 4B:
    • is-failure-inducing-change.sh: a shell (Bash) script for testing patches to the Wireworld program.
  • Project 4C:
    • A PDF report detailing your experiences with debugging automation.

4A: Delta Debugging

You must write a Python 3 program, delta.py, that implements delta debugging to find a one-minimal interesting subset of a given set. Your program takes two arguments:

  1. The size n of the set to be minimized — for example, if n is 5, the program will find an interesting subset of {0, 1, 2, 3, 4}
  2. A command that determines if a given subset is interesting — for example, if the command is is-interesting.sh, you may invoke is-interesting.sh 0 2 4 to probe if the subset {0, 2, 4} is interesting.
    • The command returns the exit code 1 if the subset is interesting and 0 otherwise
    • Note that command may be "bash ./is-interesting.sh" (or something similarly long) on Gradescope for security/permission reasons; your program should correctly handle the case where the single command string passed in contains spaces (by treating it as one command: do not break it up)
    • Do not try to add your own "bash" or "./" with string concatenation — simply use the command passed in and append the necessary arguments

Your program should print a one-minimal interesting subset in standard Python list format in sorted order. This is the only output your program should produce on stdout. Submitting a program with debugging output may cause your grade to be reduced.

Your implementation of delta.py will be evaluated both for correctness and efficiency (number of probes to the interesting command).

Example Usage

A standard use of Delta Debugging is to minimize failure-inducing inputs — inputs which cause a program to crash. For example, consider this (inefficient) implementation of is-interesting.sh that checks to see if its command-line arguments contain both 3 and 6:

              
                #!/bin/bash
                FIRST=0
                SECOND=0
                for i in $* # $* is a list of all the arguments
                do
                  if [ $i -eq 3 ]
                  then 
                    FIRST=1
                  fi
                  if [ $i -eq 6 ]
                  then 
                    SECOND=1
                  fi
                done
                if [ $FIRST -eq 1 ]
                then
                  if [ $SECOND -eq 1 ] 
                  then
                    exit 1 # interesting
                  fi
                fi
                exit 0
              
              
            

Note that running this script will produce an exit code of 1 if both 3 and 6 are present as arguments. Otherwise, the exit code will be 0.

              
              ./is-interesting.sh 1 2 3
              echo $?
              (out)0
              ./is-interesting.sh 1 2 3 4 5 6 7 8
              echo $?
              (out)1
              
            

To find a one-minimal interesting subset of {0, 1, 2, 3, 4, 5, 6, 7, 8}, we can run:

              
                python3 delta.py 9 ./is-interesting.sh
                (out)[3, 6]
              
            

The above example shows the exact output that your program is required to produce (the single line with a list of elements in sorted order). With additional debugging output (i.e., a version that the autograder does not accept), we can trace the delta debugging algorithm:

              
                python3 delta.py --verbose 9 ./is-interesting.sh 
                (out)DEBUG:delta.py:Command is: ./is-interesting.sh
                (out)DEBUG:delta.py:dd([0, 1, 2, 3, 4, 5, 6, 7, 8], required=[])
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 0 1 2 3`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 4 5 6 7 8`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:dd([0, 1, 2, 3], required=[4, 5, 6, 7, 8])
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 0 1 4 5 6 7 8`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 2 3 4 5 6 7 8`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([2, 3], required=[4, 5, 6, 7, 8])
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 2 4 5 6 7 8`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 3 4 5 6 7 8`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([3], required=[4, 5, 6, 7, 8])
                (out)DEBUG:delta.py:dd([4, 5, 6, 7, 8], required=[0, 1, 2, 3])
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 4 5 0 1 2 3`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 6 7 8 0 1 2 3`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([6, 7, 8], required=[0, 1, 2, 3])
                (out)DEBUG:delta.py:Executing `./is-interesting.sh 6 0 1 2 3`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([6], required=[0, 1, 2, 3])
                (out)DEBUG:delta.py:Called `./is-interesting.sh` 9 times
                (out)[3, 6]
              
            

In this example we see that Delta Debugging made 9 probes to is-interesting.sh to find a one-minimal subset (compared to 29 = 512 to exhaustively find a minimal subset). Delta Debugging is usually very efficient.

4B: Minimizing Failure-Inducing Changes

 Download wireworld-example.zip

To use Delta Debugging to minimize a different type of set, such as a set of failure-inducing changes, it suffices to change your is-interesting.sh script.

Consider the program wireworld.c from Rosetta Code.

Write is-failure-inducing-change.sh (a bash shell script) that determines if the given patches (named patch.i where i is an integer argument to the script, when applied to wireworld-original.c, result in a GCC compilation error. Your script is responsible for restoring wireworld-original.c to its pristine state after each run. You will likely have to read up on how the Unix patch utility works.

The inputs, outputs and arguments to is-failure-inducing-change.sh are exactly the same as is-interesting.sh from 4A.

Example Usage

As an example, you are provided with an original version of wireworld.c as well as 10 patches:

              
                unzip -l wireworld-example.zip
                (out)Archive:  wireworld-example.zip
                (out)  Length      Date    Time    Name
                (out)---------  ---------- -----   ----
                (out)       21  03-03-2018 12:31   patch.0
                (out)       35  03-03-2018 12:31   patch.1
                (out)       58  03-03-2018 12:31   patch.2
                (out)       58  03-03-2018 12:31   patch.3
                (out)       76  03-03-2018 12:31   patch.4
                (out)      235  03-03-2018 12:32   patch.5
                (out)      112  03-03-2018 12:33   patch.6
                (out)       65  03-03-2018 12:32   patch.7
                (out)       32  03-03-2018 12:32   patch.8
                (out)       40  03-03-2018 12:32   patch.9
                (out)     1393  03-03-2018 12:33   wireworld-final.c
                (out)     1314  03-03-2018 12:33   wireworld-original.c
                (out)---------                     -------
                (out)     3439                     12 files
                
                cat patch.0
                (out)6a7
                (out)> this looks bad
                
                cat patch.2
                (out)11c13
                (out)<     "+-----------+\n"
                (out)---
                (out)>     "+===========+\n"
                
                cat patch.4
                (out)25c27
                (out)<   for (i = 0; i < w*h; i++) {
                (out)---
                (out)>   for (i = 0+0; i < w*h; i++) {
              
            

We can observe that the original version compiles correctly, but application of all the patches produces compilation failures:

              
                gcc -c wireworld-original.c
                echo $?
                (out)0
                
                cat patch.* | patch -p0 wireworld-original.c
                (out)patching file wireworld-original.c
                
                gcc wireworld-original.c
                (out)wireworld-original.c: In function 'next_world':
                (out)wireworld-original.c:36:11: error: 'j' undeclared (first use in this function)
                (out)   36 |       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
                (out)      |           ^
                (out)wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears in
                echo $?
                (out) 1
              
            

Given the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:

              
                python3 delta.py --verbose $(ls patch.* | wc -l) ./is-failure-inducing-change.sh
                (out)DEBUG:delta.py:Command is: ./is-failure-inducing-change.sh
                (out)DEBUG:delta.py:dd([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], required=[])
                (out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 0 1 2 3 4`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5 6 7 8 9`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([5, 6, 7, 8, 9], required=[])
                (out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5 6`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([5, 6], required=[])
                (out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5`
                (out)DEBUG:delta.py:Return code: 0
                (out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 6`
                (out)DEBUG:delta.py:Return code: 1
                (out)DEBUG:delta.py:dd([6], required=[])
                (out)DEBUG:delta.py:Called `./is-failure-inducing-change.sh` 5 times
                (out)[6]
                
                cat patch.6
                (out)34c36
                (out)<       out[i] = (hc == 1 || hc == 2) ? 'H' : '.';
                (out)---
                (out)>       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
              
            

The wireworld example shown above is available for local testing.

4C: Minimizing Test Suites

 Download libpng-1.6.34.tar.gz  Download large-png-suite.zip

We can also use Delta Debugging to find one-minimal subsets of test suites that retain the same coverage (or other behavior). We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since you likely have experience manually minimizing test suites for it in an previous project.

You should use Delta Debugging with the goal of finding a one-minimal subset of these 1639 test cases such that that subset has the same line coverage as the complete set as reported by gcovr (just as in Project 3). Read the PDF report guidelines carefully before beginning.

WARNING! Be sure to read carefully so you understand what is needed for the report.

We could attempt to use our own delta.py implementation for this task; however, let's use this opportunity to use a production version of the Delta Debugging algorithm with more features, Picire.

Unlike our version of Delta Debugging, Picire reads a file that contains the set to be minimized. Similarly, the "interesting" script is given the path to a file containing the current set of elements to test. Further, Picire inverts the exit codes (0 is now interesting).

You will need to write a script that:

  1. reads the file provided by Picire and run these files as inputs to pngtest.
  2. reads the coverage information provided by gcovr.
  3. determines if the line coverage has decreased.
WARNING! Be sure to run this command inside of a tmux session so that you can disconnect if needed without killing the process.

Example Usage

To get started, we will need to download the suite of test images. To make sure everything is working, let's check the coverage of the full suite.

              
                cd /path/to/libpng
                # make sure libpng is compiled for coverage before proceeding
                
                rm -rf *.gcda pngout.png
                
                curl -Ok https://myslu.stlawu.edu/~kangstadt/teaching/summer21/340/files/p4/large-png-suite.zip
                unzip large-png-suite.zip
                
                for img in large-png-suite/*.png
                do
                  ./pngtest "$img"
                done
                
                gcovr -s -e contrib -e intel -e mips -e powerpc -r .
                (out)------------------------------------------------------------------------------
                (out)               GCC Code Coverage Report
                (out)Directory: .
                (out)------------------------------------------------------------------------------
                (out)File                                       Lines    Exec  Cover   Missing
                (out)------------------------------------------------------------------------------
                (out)         ...
                (out)------------------------------------------------------------------------------
                (out)TOTAL                                      11219    4094    36%
                (out)------------------------------------------------------------------------------
                (out)lines: 36.5% (4094 out of 11219)
                (out)branches: 30.2% (2325 out of 7687)
              
            

Picire requires that we provide the "set" to minimize as a single file. By default it will minimizes the "lines" in the file. Therefore, we can list each test (i.e., image) on a separate line in our test file. Picire will algorithmically remove lines (i.e., image paths) from this file and test for interesting-ness.

              
                for img in *.png
                do
                  echo "$(pwd)/$img" >> tests.txt
                done
              
            

Next, let's install Picire and run a sample minimization test. We'll just pick a handful of images and say that this is an "interesting" subset. We'll call this interesting_sanity.sh:

              
                #!/bin/bash
                
                # $1 is the "test" file
                # we will look for 10.png, 24.png, and 1024.png
                
                ONE=0
                TWO=0
                THREE=0
                
                for f in $(cat $1)
                do
                  if [ $(basename $f) = "10.png" ]
                  then
                    ONE=1
                  fi
                  
                  if [ $(basename $f) = "24.png" ]
                  then
                    TWO=1
                  fi
                  
                  if [ $(basename $f) = "1024.png" ]
                  then
                    THREE=1
                  fi
                done
                
                if [ $ONE -eq 1 ] && [ $TWO -eq 1 ] && [ $THREE -eq 1 ]
                then
                  exit 0 # interesting <= note this is for picire
                else
                  exit 1
                fi
              
            

After installing Picire, we see that it successfully uses our sanity script to find a subset of images:

              
                # make sure that pip is installed
                sudo apt update
                sudo apt install python3-pip
                sudo pip3 install --upgrade pip
                
                sudo pip3 install picire
                
                ./interesting_sanity.sh path/to/libpng-1.6.34/large-png-suite/tests.txt
                echo $?
                (out)0
                
                # test with no images
                ./interesting_sanity.sh /dev/null 
                echo $?
                (out)1
                
                picire --input path/to/libpng-1.6.34/large-png-suite/tests.txt --test interesting_sanity.sh 
                (out)Reduce session starts for /path/to/libpng-1.6.34/large-png-suite/tests.txt
                (out)        atom: line
                (out)        cache_class: picire.outcome_cache.ConfigCache
                (out)        cleanup: True
                (out)        encoding: ascii
                (out)        input: /path/to/libpng-1.6.34/large-png-suite/tests.txt
                (out)        out: /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE
                (out)        reduce_class: picire.light_dd.LightDD
                (out)        reduce_config:
                (out)                complement_iterator: builtins.range
                (out)                split: picire.config_splitters.ZellerSplit(n=2)
                (out)                subset_first: True
                (out)                subset_iterator: builtins.range
                (out)        tester_class: picire.subprocess_test.SubprocessTest
                (out)        tester_config:
                (out)                cleanup: True
                (out)                command_pattern: /path/to/tests/interesting_sanity.sh %s
                (out)                encoding: ascii
                (out)
                (out)Initial test contains 1639 lines
                (out)Run #0
                (out)        Config size: 1639
                (out)        Granularity: 2
                (out)        Reduced
                (out)  ...
                (out)Run #33
                (out)        Config size: 4
                (out)        Granularity: 4
                (out)        Reduced
                (out)Run #34
                (out)        Config size: 3
                (out)        Granularity: 3
                (out)        Done
                (out)Result is saved to /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE/tests.txt.
                
                cat /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE/tests.txt
                (out)/path/to/libpng-1.6.34/large-png-suite/1024.png
                (out)/path/to/libpng-1.6.34/large-png-suite/10.png
                (out)/path/to/libpng-1.6.34/large-png-suite/24.png
              
            

Now, you just need to write a suitable "interesting" script for this part of the project, but the setup should be similar.

DANGER! Make sure you read the requirements for the report. You may need additional commands and scripting to successfully complete the report.

Written Report

You must write a multi-paragraph report for Project 4C. Your report should focus on using Picire to reduce the provided test suite. In particular, your report should address:

  • What was your high-level notion of interesting for this activity? How did you implement it?
  • What was the original coverage of the test suite? What were the size and coverage of your final result? Were you able to use Delta Debugging to produce a one-minimal subset of the test suite? Why or why not?
    • Note that both an affirmative or negative answer may potentially receive full credit. This depends on the logical argument you present and the mastery of the course material you demonstrate.
    • This is (intentionally) a challenging question.
    • Be sure to also addess "how you know" if your test suite is one-minimal.
    • WARNING! This question is NOT asking if you were able to run Picire to minimize the set of tests, but rather how successful Picire (Delta Debugging) was.
  • How many probes did Picire make to determine if sets were interesting? How many "runs" did Picire need to minimize the suite? What was the overall runtime? How does this compare to the expected asymptotic runtime ("Big-O") performance? How does this informally compare to a manual effort to minimize the test suite?
  • Are there other situations for which you would want to use Delta Debugging in the future? Are there past experiences (apart from Project 3) where Delta Debugging would have helped you? If you would not use Delta Debugging, why not?
    • Again, both affirmative or negative answers may receive full credit. Your logical argument is the critical factor.

Commentary

You might notice that a common theme in these projects is scripting and writing code to run other code. This is a common task for professional developers. One of our course goals is to help you increase your productivity through automation and scripting. One "meta" analysis you might want to conduct is looking for patterns in the scripts you have written throughout this course. This sort of introspection can help you recognize similar problems in the future and save you time.

Similarly, this course has higher reading and writing expectations than other computing courses. Again, this reflects the real world. If you are struggling to read documentation efficiently, ask for pointers! There's no expectation that you are a documentation expert without practice.

Testing

You might notice that the autograder does not provide much output for this project — this is intentional. We have spent a significant portion of the semester discussing quality assurance. You are expected to test your own code! Consider writing unit and integration tests to gain confidence that aspects of your implementation adhere to this sepecification.

What if Test Suite Minimization Takes Too Long?

Time management and planning again plays an important role in this project. Running Picire for Project 4C is likely an "overnight" task. This assumes that your "interesting" script is coded correctly. The script you need to write for 4C is non-trivial; you might consider starting here or (at minimum) giving yourself a week to implement it correctly.

You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the asymptotic runtime complexity (Big-O) for Delta Debugging.

If you're feeling adventurous, you might also cache the per-test coverage for each image and use this information for your "interesting" script (i.e., you can union the lines covered by test #1 and test #2 to determine the coverage of tests #1 and #2). This might allow you to run Picire in parallel mode. This might not be worth the extra time.

If you fear you started too late to minimize the full test suite, you may use a subset of the test suite for partial credit in your report. That is, you may use the first 500 tests (for some value of 500 that allows you to complete your report in time). The lower the quantity you use, the lower the partial credit. Note that to do this you will have to change your definition of "full coverage" to match what is provided by those 500 tests. If you do this, your reporst must explicitly state the time taken to compute coverage on all 1639 test cases, the reduced number of tests chosen (e.g., 500), and the maximum total coverage associated with your test subset in your report.

Helpful Python Modules

It is your responsibility to research helpful Python modules (and cite your resources, such as Stack Overflow); however, I recognize that it can be daunting to get started. Here are few modules that might be helpful for this project:

Helpful Unix Utilities

Similarly, there are many ways to implement a shell script; however, you might find these utilities helpful:

  • bc
  • cut
  • date
  • head
  • patch
  • tail
  • time
  • tr

Supported Platforms

If you are having trouble using your cs-linuxlab server, I will help you get things working. I will not support any other platform.

What to Submit for P4

Be sure that you are submitting to the correct assignment on Gradescope! Further, be sure to read the output provided by Gradescope after your submission is run by the grading scripts. Don't wait until the last minute to submit.

For each of the sub-projects, submit the following:

  • Project 4A:
    • delta.py: your implementation of Delta Debugging
  • Project 4B:
    • is-failure-inducing-change.sh: your "is interesting" test for finding a failure-inducing patch on Wireworld.
  • Project 4C:
    • A PDF containing your written report and your references for ALL parts of this assignment.

You can submit these by dragging the files into the submission window on the Gradescope interface.

Grading Rubric

P4 Grading (out of 45 points):

  • 15 points: 4A Delta Debugging Implementation
    • 12 points — autograder tests
    • 3 points — code quality and commenting
  • 15 points: 4B Interesting Script
    • 10 points — autograder tests
    • 5 points — script quality and commenting
  • 15 points: 4C Report
    • 3 points — discussion of "interesting" script
    • 6 points — coverage information and discussion of one-minimality
    • 4 points — discussion of runtime
    • 2 points — concluding remarks on Delta Debugging
    • -1 — English prose or grammatical errors
    • -all points — submission does not provide a references or works cited section.

FAQ and Troubleshooting

This section details common pitfalls, questions, and resolutions.

  1. Q: Can I ask for clarification if I don't understand all of the commands we are running for this project?

    A: Absolutely! I'd be happy to help clarify what various commands do.

  2. Q: For P4B, when minimizing failure-inducing changes, my approach appears to work locally but not on the autograder.

    A: You might have better luck applying patches one by one (using a loop rather than using cat to apply them all at once).

  3. Q: I'm seeing an error like

                      
                      python3 delta.py 10 ./is-interesting.sh
                      (out)1: ./is-interesting.sh: Permission denied
                      
                    

    A1: Does your script have execution permissions (ls -l is-interesting.sh and look for the x permission)? You might need to run chmod +x is-interesting.sh.

    A2: Make sure you're treating the command passed in to you as a single command, even if it is a string that contains spaces. Try both /bin/bash ./is-interesting.sh and ./is-interesting.sh as commands; your script should work for both.

  4. Q: I'm seeing /bin/bash: /bin/bash: cannot execute binary file.

    A: You don't need to add an "extra" bash to the beginning of the command you run. Just run the command you are given.

  5. Q: I'm getting weird results for gcovr.

    A: Your best bet is to run gcovr from the root of the libpng folder. Perhaps consider changing to this directory as part of your script. Alternatively, make sure that your excluded paths are relative to the current directory.

  6. Q: For Project 4C, I'm getting assertion failures when I try to run Picire.

    A1: Picire does some sanity checks of your "interesting" script. One possibility is that your script is not answering sensibly for standard inputs (i.e., Does your script exit correctly for the full test suite? How about for an empty list?)

    A2: Are you trying to run Picire in parallel mode? Since pngtest always outputs to pngout.png, what could happen if you try to run multiple copies of pngtest? Similarly, since coverage information is addititve, what else could go wrong?

Acknowledgements

This assignment has be adapted, updated, and revised (with permission) from an assignment used in Wes Weimer's EECS 481 Software Engineering course at the University of Michigan [1].