For project 4, you develop a debugging support tool to reduce the effort associated with software maintenance. Then, you will compare your implementation with a production tool.
Reducing the size of a test input can singnificantly improve a
developer's fault localization efficiency. In this assignment,
you will implement the Delta Debugging algorithm to minimize
"interesting" sets. Then, you will define an "interesting"
function to aid you in identifying patches that cause
compilation to fail. Finally, you will use a production Delta
Debugging implementation to minimize a test suite for
libpng
.
This project is not-so-secretly three mini projects with common themes. Thus, we will define the requirements for each part separately. Note that 4B depends (in some sense) on 4A; however, you might be able to use 4C in place of 4A!
delta.py
: a Python 3 implementation of the basic
Delta Debugging Algorithm.
is-failure-inducing-change.sh
: a shell (Bash) script
for testing patches to the Wireworld program.
You must write a Python 3 program, delta.py
, that
implements delta debugging to find a one-minimal interesting
subset of a given set. Your program takes two arguments:
is-interesting.sh
, you
may invoke is-interesting.sh 0 2 4
to probe if the subset {0,
2, 4} is interesting.
"bash
./is-interesting.sh"
(or something similarly long) on Gradescope
for security/permission reasons; your program should
correctly handle the case where the single
command string passed in
contains spaces (by treating it as one command: do not break it up)
Your program should print a one-minimal interesting subset in
standard Python list format in sorted order. This is the only output your program should produce
on stdout
. Submitting a program with debugging output
may cause your grade to be reduced.
Your implementation of delta.py
will be evaluated both
for correctness and efficiency (number of probes to the interesting command).
A standard use of Delta Debugging is to minimize failure-inducing
inputs — inputs which cause a program to crash. For example,
consider this (inefficient) implementation of
is-interesting.sh
that checks to see if its command-line
arguments contain both 3 and 6:
#!/bin/bash
FIRST=0
SECOND=0
for i in $* # $* is a list of all the arguments
do
if [ $i -eq 3 ]
then
FIRST=1
fi
if [ $i -eq 6 ]
then
SECOND=1
fi
done
if [ $FIRST -eq 1 ]
then
if [ $SECOND -eq 1 ]
then
exit 1 # interesting
fi
fi
exit 0
Note that running this script will produce an exit code of 1 if both 3 and 6 are present as arguments. Otherwise, the exit code will be 0.
./is-interesting.sh 1 2 3
echo $?
(out)0
./is-interesting.sh 1 2 3 4 5 6 7 8
echo $?
(out)1
To find a one-minimal interesting subset of {0, 1, 2, 3, 4, 5, 6, 7, 8}, we can run:
python3 delta.py 9 ./is-interesting.sh
(out)[3, 6]
The above example shows the exact output that your program is required to produce (the single line with a list of elements in sorted order). With additional debugging output (i.e., a version that the autograder does not accept), we can trace the delta debugging algorithm:
python3 delta.py --verbose 9 ./is-interesting.sh
(out)DEBUG:delta.py:Command is: ./is-interesting.sh
(out)DEBUG:delta.py:dd([0, 1, 2, 3, 4, 5, 6, 7, 8], required=[])
(out)DEBUG:delta.py:Executing `./is-interesting.sh 0 1 2 3`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-interesting.sh 4 5 6 7 8`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:dd([0, 1, 2, 3], required=[4, 5, 6, 7, 8])
(out)DEBUG:delta.py:Executing `./is-interesting.sh 0 1 4 5 6 7 8`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-interesting.sh 2 3 4 5 6 7 8`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([2, 3], required=[4, 5, 6, 7, 8])
(out)DEBUG:delta.py:Executing `./is-interesting.sh 2 4 5 6 7 8`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-interesting.sh 3 4 5 6 7 8`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([3], required=[4, 5, 6, 7, 8])
(out)DEBUG:delta.py:dd([4, 5, 6, 7, 8], required=[0, 1, 2, 3])
(out)DEBUG:delta.py:Executing `./is-interesting.sh 4 5 0 1 2 3`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-interesting.sh 6 7 8 0 1 2 3`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([6, 7, 8], required=[0, 1, 2, 3])
(out)DEBUG:delta.py:Executing `./is-interesting.sh 6 0 1 2 3`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([6], required=[0, 1, 2, 3])
(out)DEBUG:delta.py:Called `./is-interesting.sh` 9 times
(out)[3, 6]
In this example we see that Delta Debugging made 9 probes to
is-interesting.sh
to find a one-minimal subset
(compared to 29 = 512 to exhaustively find a minimal
subset). Delta Debugging is usually very efficient.
Download wireworld-example.zip
To use Delta Debugging to minimize a different type of set, such
as a set of failure-inducing changes, it suffices to change your
is-interesting.sh
script.
Consider the program wireworld.c
from Rosetta Code.
Write is-failure-inducing-change.sh
(a bash shell
script) that determines if the given patches (named
patch.i
where i
is an integer argument
to the script, when applied to wireworld-original.c
,
result in a GCC compilation error. Your script is responsible for
restoring wireworld-original.c
to its pristine state
after each run. You will likely have to read up on how the Unix
patch
utility works.
The inputs, outputs and arguments to
is-failure-inducing-change.sh
are exactly the same as
is-interesting.sh
from 4A.
As an example, you are provided with an original version of
wireworld.c
as well as 10 patches:
unzip -l wireworld-example.zip
(out)Archive: wireworld-example.zip
(out) Length Date Time Name
(out)--------- ---------- ----- ----
(out) 21 03-03-2018 12:31 patch.0
(out) 35 03-03-2018 12:31 patch.1
(out) 58 03-03-2018 12:31 patch.2
(out) 58 03-03-2018 12:31 patch.3
(out) 76 03-03-2018 12:31 patch.4
(out) 235 03-03-2018 12:32 patch.5
(out) 112 03-03-2018 12:33 patch.6
(out) 65 03-03-2018 12:32 patch.7
(out) 32 03-03-2018 12:32 patch.8
(out) 40 03-03-2018 12:32 patch.9
(out) 1393 03-03-2018 12:33 wireworld-final.c
(out) 1314 03-03-2018 12:33 wireworld-original.c
(out)--------- -------
(out) 3439 12 files
cat patch.0
(out)6a7
(out)> this looks bad
cat patch.2
(out)11c13
(out)< "+-----------+\n"
(out)---
(out)> "+===========+\n"
cat patch.4
(out)25c27
(out)< for (i = 0; i < w*h; i++) {
(out)---
(out)> for (i = 0+0; i < w*h; i++) {
We can observe that the original version compiles correctly, but application of all the patches produces compilation failures:
gcc -c wireworld-original.c
echo $?
(out)0
cat patch.* | patch -p0 wireworld-original.c
(out)patching file wireworld-original.c
gcc wireworld-original.c
(out)wireworld-original.c: In function 'next_world':
(out)wireworld-original.c:36:11: error: 'j' undeclared (first use in this function)
(out) 36 | out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
(out) | ^
(out)wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears in
echo $?
(out) 1
Given the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:
python3 delta.py --verbose $(ls patch.* | wc -l) ./is-failure-inducing-change.sh
(out)DEBUG:delta.py:Command is: ./is-failure-inducing-change.sh
(out)DEBUG:delta.py:dd([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], required=[])
(out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 0 1 2 3 4`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5 6 7 8 9`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([5, 6, 7, 8, 9], required=[])
(out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5 6`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([5, 6], required=[])
(out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 5`
(out)DEBUG:delta.py:Return code: 0
(out)DEBUG:delta.py:Executing `./is-failure-inducing-change.sh 6`
(out)DEBUG:delta.py:Return code: 1
(out)DEBUG:delta.py:dd([6], required=[])
(out)DEBUG:delta.py:Called `./is-failure-inducing-change.sh` 5 times
(out)[6]
cat patch.6
(out)34c36
(out)< out[i] = (hc == 1 || hc == 2) ? 'H' : '.';
(out)---
(out)> out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
The wireworld example shown above is available for local testing.
Download libpng-1.6.34.tar.gz
Download large-png-suite.zip
We can also use Delta Debugging to find one-minimal subsets of test suites that retain the same coverage (or other behavior). We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since you likely have experience manually minimizing test suites for it in an previous project.
You should use Delta Debugging with the goal of finding a
one-minimal subset of these 1639 test cases such
that that subset has the same line coverage as the complete set
as reported by gcovr
(just as in Project 3).
Read the PDF report guidelines carefully before beginning.
We could attempt to use our own delta.py
implementation for this task; however, let's use this opportunity
to use a production version of the Delta Debugging
algorithm with more features, Picire
.
Unlike our version of Delta Debugging, Picire
reads a
file that contains the set to be minimized. Similarly, the
"interesting" script is given the path to a file containing the
current set of elements to test. Further, Picire
inverts the exit codes (0 is now interesting).
You will need to write a script that:
Picire
and run these files as inputs to pngtest
.
gcovr
.
tmux
session so that you can disconnect if
needed without killing the process.
To get started, we will need to download the suite of test images. To make sure everything is working, let's check the coverage of the full suite.
cd /path/to/libpng
# make sure libpng is compiled for coverage before proceeding
rm -rf *.gcda pngout.png
curl -Ok https://myslu.stlawu.edu/~kangstadt/teaching/spring23/340/files/p4/large-png-suite.zip
unzip large-png-suite.zip
for img in large-png-suite/*.png
do
./pngtest "$img"
done
gcovr -s -e contrib -e intel -e mips -e powerpc -e arm -r .
(out)------------------------------------------------------------------------------
(out) GCC Code Coverage Report
(out)Directory: .
(out)------------------------------------------------------------------------------
(out)File Lines Exec Cover Missing
(out)------------------------------------------------------------------------------
(out) ...
(out)------------------------------------------------------------------------------
(out)TOTAL 11219 4094 36%
(out)------------------------------------------------------------------------------
(out)lines: 36.5% (4094 out of 11219)
(out)branches: 30.2% (2325 out of 7687)
Picire
requires that we provide the "set" to minimize
as a single file. By default it will minimizes the "lines" in the
file. Therefore, we can list each test (i.e., image) on a separate
line in our test file. Picire
will algorithmically
remove lines (i.e., image paths) from this file and test for
interesting-ness.
for img in *.png
do
echo "$(pwd)/$img" >> tests.txt
done
Next, let's install Picire
and run a sample
minimization test. We'll just pick a handful of images and say
that this is an "interesting" subset. We'll call this
interesting_sanity.sh
:
#!/bin/bash
# $1 is the "test" file
# we will look for 10.png, 24.png, and 1024.png
ONE=0
TWO=0
THREE=0
for f in $(cat $1)
do
if [ $(basename $f) = "10.png" ]
then
ONE=1
fi
if [ $(basename $f) = "24.png" ]
then
TWO=1
fi
if [ $(basename $f) = "1024.png" ]
then
THREE=1
fi
done
if [ $ONE -eq 1 ] && [ $TWO -eq 1 ] && [ $THREE -eq 1 ]
then
exit 0 # interesting <= note this is for picire
else
exit 1
fi
After installing Picire
, we see that it successfully
uses our sanity script to find a subset of images:
# make sure that pip is installed
sudo apt update
sudo apt install python3-pip
python3 -m pip install --upgrade pip
python3 -m pip install picire
./interesting_sanity.sh path/to/libpng-1.6.34/large-png-suite/tests.txt
echo $?
(out)0
# test with no images
./interesting_sanity.sh /dev/null
echo $?
(out)1
picire --input path/to/libpng-1.6.34/large-png-suite/tests.txt --test interesting_sanity.sh
(out)Reduce session starts for /path/to/libpng-1.6.34/large-png-suite/tests.txt
(out) atom: line
(out) cache_class: picire.outcome_cache.ConfigCache
(out) cleanup: True
(out) encoding: ascii
(out) input: /path/to/libpng-1.6.34/large-png-suite/tests.txt
(out) out: /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE
(out) reduce_class: picire.light_dd.LightDD
(out) reduce_config:
(out) complement_iterator: builtins.range
(out) split: picire.config_splitters.ZellerSplit(n=2)
(out) subset_first: True
(out) subset_iterator: builtins.range
(out) tester_class: picire.subprocess_test.SubprocessTest
(out) tester_config:
(out) cleanup: True
(out) command_pattern: /path/to/tests/interesting_sanity.sh %s
(out) encoding: ascii
(out)
(out)Initial test contains 1639 lines
(out)Run #0
(out) Config size: 1639
(out) Granularity: 2
(out) Reduced
(out) ...
(out)Run #33
(out) Config size: 4
(out) Granularity: 4
(out) Reduced
(out)Run #34
(out) Config size: 3
(out) Granularity: 3
(out) Done
(out)Result is saved to /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE/tests.txt.
cat /path/to/libpng-1.6.34/large-png-suite/tests.txt.DATE/tests.txt
(out)/path/to/libpng-1.6.34/large-png-suite/1024.png
(out)/path/to/libpng-1.6.34/large-png-suite/10.png
(out)/path/to/libpng-1.6.34/large-png-suite/24.png
Now, you just need to write a suitable "interesting" script for this part of the project, but the setup should be similar.
You must write a multi-paragraph report for Project 4C. Your report
should focus on using Picire
to reduce the provided
test suite. In particular, your report should address:
Picire
to minimize the set of tests, but rather
how successful Picire
(Delta Debugging) was.
Picire
make to determine if sets were interesting?
How many "runs" did Picire
need to minimize the suite?
What was the overall runtime? How does this compare to the expected asymptotic
runtime ("Big-O") performance? How does this informally compare to a manual
effort to minimize the test suite?
You might notice that a common theme in these projects is scripting and writing code to run other code. This is a common task for professional developers. One of our course goals is to help you increase your productivity through automation and scripting. One "meta" analysis you might want to conduct is looking for patterns in the scripts you have written throughout this course. This sort of introspection can help you recognize similar problems in the future and save you time.
Similarly, this course has higher reading and writing expectations than other computing courses. Again, this reflects the real world. If you are struggling to read documentation efficiently, ask for pointers! There's no expectation that you are a documentation expert without practice.
You might notice that the autograder does not provide much output for this project — this is intentional. We have spent a significant portion of the semester discussing quality assurance. You are expected to test your own code! Consider writing unit and integration tests to gain confidence that aspects of your implementation adhere to this sepecification.
Time management and planning again plays an important role in this
project. Running Picire
for Project 4C is likely an
"overnight" task. This assumes that your "interesting" script is
coded correctly. The script you need to write for 4C is non-trivial;
you might consider starting here or (at minimum) giving yourself a
week to implement it correctly.
You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the asymptotic runtime complexity (Big-O) for Delta Debugging.
If you're feeling adventurous, you might also cache the per-test
coverage for each image and use this information for your
"interesting" script (i.e., you can union the lines covered by
test #1 and test #2 to determine the coverage of tests #1 and
#2). This might allow you to run Picire
in parallel
mode. This might
not be worth the extra time.
If you fear you started too late to minimize the full test suite, you may use a subset of the test suite for partial credit in your report. That is, you may use the first 500 tests (for some value of 500 that allows you to complete your report in time). The lower the quantity you use, the lower the partial credit. Note that to do this you will have to change your definition of "full coverage" to match what is provided by those 500 tests. If you do this, your reporst must explicitly state the time taken to compute coverage on all 1639 test cases, the reduced number of tests chosen (e.g., 500), and the maximum total coverage associated with your test subset in your report.
It is your responsibility to research helpful Python modules (and cite your resources, such as Stack Overflow); however, I recognize that it can be daunting to get started. Here are few modules that might be helpful for this project:
Similarly, there are many ways to implement a shell script; however, you might find these utilities helpful:
bc
cut
date
head
patch
tail
time
tr
If you are having trouble using your cs-linuxlab
server,
I will help you get things working. I will not
support any other platform.
Be sure that you are submitting to the correct assignment on Gradescope! Further, be sure to read the output provided by Gradescope after your submission is run by the grading scripts. Don't wait until the last minute to submit.
For each of the sub-projects, submit the following:
delta.py
: your implementation of Delta Debugging
is-failure-inducing-change.sh
: your "is interesting" test for
finding a failure-inducing patch on Wireworld.
PDF
containing your written report and
your references for ALL parts of
this assignment.
You can submit these by dragging the files into the submission window on the Gradescope interface.
P4 Grading (out of 45 points):
This section details common pitfalls, questions, and resolutions.
Q: Can I ask for clarification if I don't understand all of the commands we are running for this project?
A: Absolutely! I'd be happy to help clarify what various commands do.
Q: For P4B, when minimizing failure-inducing changes, my approach appears to work locally but not on the autograder.
A: You might have better luck applying patches
one by one (using a loop rather than using cat
to apply
them all at once).
Q: I'm seeing an error like
python3 delta.py 10 ./is-interesting.sh
(out)1: ./is-interesting.sh: Permission denied
A1: Does your script have execution permissions
(ls -l is-interesting.sh
and look for the x
permission)?
You might need to run chmod +x is-interesting.sh
.
A2: Make sure you're treating the command passed in to you as a
single command, even if it is a string that contains spaces.
Try both /bin/bash ./is-interesting.sh
and
./is-interesting.sh
as commands; your script should
work for both.
Q: I'm seeing /bin/bash: /bin/bash: cannot execute binary file
.
A: You don't need to add an "extra" bash
to the
beginning of the command you run. Just run the command you are
given.
Q: I'm getting weird results for gcovr
.
A: Your best bet is to run gcovr
from the root of
the libpng folder. Perhaps consider changing to this directory
as part of your script. Alternatively, make sure that your
excluded paths are relative to the current directory.
Q: For Project 4C, I'm getting assertion failures when I try
to run Picire
.
A1: Picire
does some sanity checks of your "interesting"
script. One possibility is that your script is not answering
sensibly for standard inputs (i.e., Does your script exit correctly
for the full test suite? How about for an empty list?)
A2: Are you trying to run Picire
in parallel mode?
Since pngtest
always outputs to pngout.png
,
what could happen if you try to run multiple copies of pngtest
?
Similarly, since coverage information is addititve, what else could go wrong?
This assignment has be adapted, updated, and revised (with permission) from an assignment used in Wes Weimer's EECS 481 Software Engineering course at the University of Michigan [1].