Contents:
Do you love writing unit tests to find bugs in your programs? Or, would you prefer that the tests were written for you automatically?
The Randoop tool writes tests for you.
Writing tests is a difficult and time-consuming activity, and yet it is a crucial part of good software engineering. If you are interested in improving the state of the art in test generation and bug finding, read on.
Our system, Randoop, automatically generates unit tests for Java classes. For example, here is a test case generated by Randoop that reveals an error in the Sun JDK 1.5:
public static void test1() {
LinkedList l1 = new LinkedList();
Object o1 = new Object();
l1.addFirst(o1);
TreeSet t1 = new TreeSet(l1);
Set s1 = Collections.unmodifiableSet(t1);
// This assertion fails
Assert.assertEquals(s1, s1);
}
The ultimate goal is a testing system that at the push of a button
creates a solid JUnit test suite providing good coverage of arbitrary
classes, with no manual effort on your part. Randoop is a step in the
direction of this vision and it already does a very good job, especially
with library classes (such as java.util
). Randoop is already used at
companies like ABB and Microsoft, and on open-source projects.
Some of the projects are interesting and useful engineering that will make Randoop more effective for developers. Some projects are publishable research that extends scientific knowledge about test generation approaches. And many projects include both aspects.
Required skills: These projects require facility with Java, because Randoop is written in Java. You should be very comfortable with the idea of unit testing. You should know how unit testing helps you and where it can hinder you, and you should be passionate about improving the tradeoff. You should be willing to dive into and understand a moderate-sized codebase.
If you want to discuss these projects, feel free to contact the Randoop developers at randoop-developers@googlegroups.com. We look forward to hearing from you!
One way to get familiar with the Randoop codebase is to fix some issue in the issue tracker. Particularly good choices are issues labeled as "good first issue".
Randoop selects a method to test at random. Uniform random selection is the simplest, but not always the best, option. This project would augment Randoop with different, and potentially novel strategies for selecting the next method to test. Here are a few example strategies; the project would involve fleshing out the details.
Randoop's generated test sequences are random and may not be characteristic
of real use. One problem is that there may be restrictions on usage of a
class: close
should only be called after open
, or arguments
must have particular relationships, such as the non-array argument to
binary search must appear in the array argument. Another problem is that
random invocations may not create interesting data structures, such as very
large ones or ones with particular qualities such as duplicate elements.
Randoop does a good job of generating tests for library classes (such
as java.util.*
) that have few restrictions on their calling sequences and
parameters. Classes that make up programs often have a number of
restrictions on both the sequences of calls and their parameters.
We want to extend the system to provide better coverage for such classes. Randoop could infer which sequences of calls are valid and invalid. For example, Randoop could observe what call sequences real code makes, perhaps using a tool like Palulu. Then, bias could be introduced into the generator to make call sequences that are more realistic, and therefore better test the most commonly-used and important parts of the code.
A possible downside is reduced diversity in tests. A research question is whether mimicking real executions is worthwhile overall.
Suppose that Randoop wants to create a call to a method with a parameter of
type T1
. Currently, Randoop selects an object of type T1
at random.
Randoop can do better. Here is one idea; other approaches are possible.
Randoop could determine what fields a method will read, and then chose an
object for which those fields have been recently side-effected. As an
example, when obtaining a Point
to pass to
the getX()
method, a Point
produced by
the setX()
or move()
method is more interesting
than one produced by the setY()
method.
As another example, Randoop could choose objects that are most likely to achieve additional coverage in the called method, perhaps because the values will cause particular branches to be taken.
Sometimes Randoop creates tests that do not terminate or that take a long time to execute. For example, Randoop once tried to create a test that executed an enormously large exponentiation:
bigFraction3 = -104 / 5
bigInteger25 = 1561321067192303346130420089702102441699871430851380131368332415577876285112504379523988395794851806993803452448222179606997671319406528676747268949561732973027887
bigFraction3.pow(bigInteger25);
Such a call will not terminate in a reasonable amount of time.
To detect infinitely-looping tests, Randoop can run with
the --usethreads
command-line option. However, that reduces
performance.
When a programmer can characterize which inputs to a method are reasonable, it would be nice for Randoop to avoid creating such tests.
Randoop creates regression tests that capture the current behavior of a program. A regression test fails if a programmer changes the program's behavior. However, the program's behavior might already be wrong; that is, the program might be buggy. How can we detect when Randoop has created a test case that captures incorrect/buggy behavior?
The only way to find a mistake is to compare two things. Here are some things that could be compared to Randoop test assertions:
Scenario: we know there is a bug, and we want better tests.
Here are three challenges or research questions:
Here are some differences; understanding them might help with solving the problems:
It is easy to generate more test inputs, and to accept their results as correct. But are they? One possibility: Daikon is surprisingly good at generalizing. Could use it in conjunction with other resources (eg, the bug report).
Is it good or bad to add more regression tests to a test suite before APR? Here is a paper that did this evaluation, for plausible patches but not for correct patches: “Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system” https://link.springer.com/article/10.1007/s10664-018-9619-4 . From its absract: “The main result is that automatic test generation is effective in alleviating one kind of overfitting, issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.”
APR (automated program repair) fixes bugs without human intervention. The input to an APR tool is a program and a test suite, such that some of the tests pass and some of the tests fail. The goal of APR is to edit the program so that all of the tests pass.
Many evaluations of APR tools have been performed, using version control histories that contain real bugs that have been fixed. However, these evaluations are a bit unrealistic:
How well do APR techniques do on more realistic scenarios?
This is related to test generation. If a test generation tool given the post-tests cannot generate error-revealing regression tests when run on the post-fix version, then the problem is not the oracle problem, but we need more work on test generation of inputs.
Related: A result by Jacques Klein (in work under submission) is that about 50% of bugs are detected by a failure of a regression test generated on the fixed version.
When a method is nondeterministic, such as any method that returns
a Set
, the user currently needs to exclude that method from
consideration by Randoop. It would be better to permit Randoop to call
methods that do not depend on the nondeterminism. For example, rather than
avoiding construction of all Set
s, Randoop could create them,
and call methods like size()
and contains()
.
Randoop would not iterate through the Set
nor print it (or,
better, the Set
could iterate/print in a deterministic order).
Programmers do not enjoy writing unit tests. The result is under-tested software and lingering bugs. Tools exist to automatically generate tests, but they are not widely used. The goal of this project is to understand why, and to improve the test generation tools.
In particular, there is a paper on Guided Random Testing (GRT) that proposed improvements and shows experimentally that these improvements are helpful. In a recent competition among testing tools, GRT won. However, the authors have refused to release their tools and experiments. The goal of this project is to reproduce that work, and determine whether the techniques are useful. If they are, make the tool and experiments publicly available. If they are not, then improve the techniques.
The research question is how to improve test generation techniques, with an initial focus on the techniques proposed in the "Guided Random Testing" paper.
The methodology is to re-implement and reproduce the GRT paper's techniques and results, and then to evaluate the benefits of each one. Incorporate the ones that help into an open-source unit test generator, and understand and improve the other ones. As a stretch goal, go beyond the GRT techniques: invent new improvements to unit testing.
GRT's techniques are
The goal of this project is to make Randoop output independent unit tests, rather than tests that must be run in a specific order.
Randoop makes test sequences out of methods that may have side effects. This can cause running one test to affect the outcome of a subsequent, putatively independent, test.
For example, suppose that method m1 modifies a global variable, and method m2 reads it. Further suppose that test t1 includes a call to m1, and test t2 includes a call to m2. The result of m2 (and, thus, whether test t2 succeeds) depends on whether test t1 has already been run in the same JVM. All of the tests are entangled, which makes it difficult to run a subset of them, to understand them, or to debug failures.
Ideally, all of the tests in a test suite should be independent, in that they can be run in any order.
One approach is to undo state changes: for each global field that a Randoop test sets (directly or indirectly), the test should reset it at the end of the test.
This requires determining which fields the test changes and restoring them to their original values.
A research question is whether this degrades the fault-finding ability of
the test suite. Setting global fields may change the behavior of
subsequent tests — is that a positive or a negative?
It would also be interesting to exclude the fields from being modified by
Randoop, via the
--omit-methods
and
--omit-field
command-line options. How much does that degrade Randoop's fault-finding
ability?
The goal of this project is to convert the Randoop output, which is currently a set of interdependent tests, into a set of independent tests.
Suppose Randoop outputs 100 small unit tests. An idea is to split each tests into the part that depends on other tests, and the independent part. Now move the dependent parts into a new test. The result is 101 tests: 100 of them are even smaller than before, but are completely independent of one another, and the last one is relatively large but captures all the dependencies.
Either a static or a dynamic analysis could perform this task.
Here are some additional details.
A test consists of alternating actions and observations:
a1
o11
o12
...
o1m_1
a2
o21
o22
...
o2m_2
...
Each action is a method invocation, that Randoop chose because Randoop believed that the action method would affect the state of some object. Each observation is a method invocation, that Randoop chose because Randoop believed that the observer method would NOT affect the state of any object but would reveal information about it. The test oracle checks the result of each method call (actions and observers) against an expected value that was obtained by running the method at test generation time.
n tests that might depend on one another
n independent tests that may be run in any order (each of these is a subset of one of the original tests), plus one big test that concatenates all n of the original tests. The big test must be run in a fresh JVM or sandbox, or at least as the last test of the suite.
The big test may seem problematic -- too hard to understand and debug -- but it is no worse than the original suite, and at least it is honest in putting all the possible effects together. I suspect that slicing, delta debugging, or other test case minimization approaches will be extremely effective on the big test for those situations where it fails but none of the little tests does.
One could imagine putting only a subset of the original tests into the big test suite, but including everything is more likely to catch serendipitous interactions among rarely-called methods. We could empirically evaluate whether that benefit is worthwhile.
An analysis can compute 2 distinct properties of a test case:
Only the latter property matters: An independent test that can be run in any order, and has no side effects on global (static) state. We could compute the property via a static analysis, or by executing the tests to see what happens.
Static analysis seems cleaner to me. Compute what action methods can write to static fields. For each test t, construct a new unit test that is just like t, but it omits all actions (& their associated observers) that side-effect static fields. Now, those unit tests can be run in any order. Also concatenate all omitted actions (or, more likely, all actions whatsoever) into one big test that must be run in a new JVM.
In a new JVM, run all tests and compute oracles. For each test t, run (in a new JVM) t followed by all tests. If any test does not satisfy its oracle, then say that the test t has side effects on global state. (This isn't quite right: The effect of test t may be masked by the effect of the first test, for example.) Retain all side-effect-free tests as unit tests. (Verify that they don't affect one another by running them all in a new JVM, in random order; do this (say) 5 times.) Concatenate all other tests (or maybe all tests whatsoever) as one big test that must be run in a new JVM, or at least as the last test of the suite.
What % of tests are separable? This will affect whether the technique is able to create independent tests, or those end up empty and the algorithm's output is just the big hairball.test. That would be no worse than the current situation, and at least the user would know that interdependent tests are a problem.
Randoop can call APIs whose value changes from run to run of tests, and this can cause flaky tests (tests that spuriously fail). Examples of such APIs are ones that retrieve the current date, the username, etc. Two ways to mitigate such flakiness are:
EvoSuite follows approach #1 so far, and you can see its source code for the APIs it mocks. Approach #2 would avoid the need for a dedicated runtime library.
Randoop executes methods under test with random inputs, and this can cause usability problems. Code under test may exhaust memory or other resources to the point that impedes Randoop from making progress, or even causes Randoop itself to crash. Making Randoop more robust in the face of these kinds of failures would greatly improve the tool's usability and attract more users.
One approach is to create a wrapper around Randoop: a process that that monitors test generation and can terminate/restart the generator if it fails to make progress or aborts unexpectedly. For more details on the wrapper idea, see Finding Errors in .NET with Feedback-Directed Random Testing (Section 3).
Another approach is to use a safe security manager that can easily be extended by the user. Currently, the security manager has to be specified by the user in the normal fashion on the command line.
This project would modify Randoop so that it can run on multiple machines. This would greatly increase the efficiency of the system, by distributing the generation effort across multiple processors. The test generation techniques that Randoop uses are amenable to parallelization: Randoop could spawn multiple test generation engine instances, each with a different random seed.
There are a number of challenges to overcome in parallelizing Randoop. Two examples:
Coordinating multiple instances. Making Randoop parallel requires writing code that can spawn, monitor, and combine the results of multiple processes running in separate machines or processors.
Managing quantity of output. A parallel version of Randoop might produce many repetitive test cases. Avoiding repetitive tests is a general challenge for a tool like Randoop, but is even more important to manage this challenge in a parallel version of the tool.
A downside of this project is that generation speed is not the most important problem for Randoop; effort would probably be better invested elsewhere.
Additional required skills: Knowledge of Java APIs for spawning processes, creating streams, reading/writing files, would also be helpful.
Randoop's --side-effect-free-methods
command-line argument informs Randoop which
methods are pure (perform no side effects).
There are some enhancements that would make Randoop's use of it even more compelling.
Perhaps use Soot's side effect analysis, or some other side effect analysis.
Debugging is difficult. When a test fails, it can be very time-consuming to understand the source of the failure in order to fix the code (or the test). For example, consider the following test generated by Randoop:
public static void test1() {
ArrayList l = new ArrayList(7);
Object o = new Object();
l.add(o);
TreeSet t = new TreeSet(l);
Set s = Collections.synchronizedSet(t);
// This assertion fails
Assert.assertEquals(s, s);
}
This test reveals an error in Sun's JDK (version 1.5). It shows a short sequence of calls leading up to the creation of an object that is not equal to itself! Unfortunately, it is difficult to tell simply by looking at the test what is the source of the failure.
This project would augment Randoop with the ability to provide debugging clues: comments within a failing test that provide potentially-useful facts about the failure and can help the user with debugging. For example, an improved version of the above test may look something like this:
public static void test1() {
// Test fails for any capacity >= 0.
// Test throws IllegalArgumentException for any capacity <0.
ArrayList l = new ArrayList(7);
Object o = new Object();
// Test passes when o is a Comparable.
l.add(o);
TreeSet t = new TreeSet(l);
Set s = Collections.synchronizedSet(t);
// This assertion fails
Assert.assertEquals(s, s);
}
The new test provides several clues that can help the user avoid time debugging. The key observation is that the test passes if we use a Comparable
object in the call l.add(o)
. Further investigation would reveal that the TreeSet
constructor should not accept a list with a non-comparable object in it but does, and this leads to the error.
To output tests with debugging comments, Randoop would have to be modified in two ways:
In theory this sounds simple, but there are several issues that make the problem interesting: creating a set of variations of a test is non-trivial, and synthesizing test executions into useful comments, are both non-trivial tasks.
These projects (some of which are small, and others of which are quite ambitious and would make research contributions) would be a good way to get started with Randoop development and make a contribution.
--public-only=boolean
), but
the resulting JUnit tests do not compile. This task involves modifying
Randoop's JUnit output format so that non-public methods are called via
reflection, making them accessible for execution.
A key difference is that QuickCheck requires the user to provide generators for values, whereas Randoop creates them automatically. Which is more effective in practice?
Randoop could be extended to make it easier to specify a routine that returns a set of legal values for a parameter (that Randoop wouldn't otherwise guess).