Randoop project ideas

Contents:

Test generation projects
User interface and user experience projects
- Test cases with debugging clues
Other projects

Do you love writing unit tests to find bugs in your programs? Or, would you prefer that the tests were written for you automatically?

The Randoop tool writes tests for you.

Writing tests is a difficult and time-consuming activity, and yet it is a crucial part of good software engineering. If you are interested in improving the state of the art in test generation and bug finding, read on.

Our system, Randoop, automatically generates unit tests for Java classes. For example, here is a test case generated by Randoop that reveals an error in the Sun JDK 1.5:

public static void test1() {
  LinkedList l1 = new LinkedList();
  Object o1 = new Object();
  l1.addFirst(o1);
  TreeSet t1 = new TreeSet(l1);
  Set s1 = Collections.unmodifiableSet(t1);
  // This assertion fails
  Assert.assertEquals(s1, s1);
}

The ultimate goal is a testing system that at the push of a button creates a solid JUnit test suite providing good coverage of arbitrary classes, with no manual effort on your part. Randoop is a step in the direction of this vision and it already does a very good job, especially with library classes (such as java.util). Randoop is already used at companies like ABB and Microsoft, and on open-source projects.

Some of the projects are interesting and useful engineering that will make Randoop more effective for developers. Some projects are publishable research that extends scientific knowledge about test generation approaches. And many projects include both aspects.

Required skills: These projects require facility with Java, because Randoop is written in Java. You should be very comfortable with the idea of unit testing. You should know how unit testing helps you and where it can hinder you, and you should be passionate about improving the tradeoff. You should be willing to dive into and understand a moderate-sized codebase.

If you want to discuss these projects, feel free to contact the Randoop developers at randoop-developers@googlegroups.com. We look forward to hearing from you!

Test generation projects

Starter project: fix an issue

One way to get familiar with the Randoop codebase is to fix some issue in the issue tracker. Particularly good choices are issues labeled as "good first issue".

Better choice of methods

Randoop selects a method to test at random. Uniform random selection is the simplest, but not always the best, option. This project would augment Randoop with different, and potentially novel strategies for selecting the next method to test. Here are a few example strategies; the project would involve fleshing out the details.

Track code coverage information and use it to guide the generation process, by choosing methods that have not been covered so far.
Let the user specify methods that are critical to test and Randoop should select more often.
Combine random selection with non-random selection strategies (for example, exhaustive testing).

Mimic real call sequences

Randoop's generated test sequences are random and may not be characteristic of real use. One problem is that there may be restrictions on usage of a class: close should only be called after open, or arguments must have particular relationships, such as the non-array argument to binary search must appear in the array argument. Another problem is that random invocations may not create interesting data structures, such as very large ones or ones with particular qualities such as duplicate elements.

Randoop does a good job of generating tests for library classes (such as java.util.*) that have few restrictions on their calling sequences and parameters. Classes that make up programs often have a number of restrictions on both the sequences of calls and their parameters.

We want to extend the system to provide better coverage for such classes. Randoop could infer which sequences of calls are valid and invalid. For example, Randoop could observe what call sequences real code makes, perhaps using a tool like Palulu. Then, bias could be introduced into the generator to make call sequences that are more realistic, and therefore better test the most commonly-used and important parts of the code.

A possible downside is reduced diversity in tests. A research question is whether mimicking real executions is worthwhile overall.

Better choice of arguments

Suppose that Randoop wants to create a call to a method with a parameter of type T1. Currently, Randoop selects an object of type T1 at random. Randoop can do better. Here is one idea; other approaches are possible.

Randoop could determine what fields a method will read, and then chose an object for which those fields have been recently side-effected. As an example, when obtaining a Point to pass to the getX() method, a Point produced by the setX() or move() method is more interesting than one produced by the setY() method.

As another example, Randoop could choose objects that are most likely to achieve additional coverage in the called method, perhaps because the values will cause particular branches to be taken.

Avoid known bad arguments

Sometimes Randoop creates tests that do not terminate or that take a long time to execute. For example, Randoop once tried to create a test that executed an enormously large exponentiation:


bigFraction3 = -104 / 5
bigInteger25 = 1561321067192303346130420089702102441699871430851380131368332415577876285112504379523988395794851806993803452448222179606997671319406528676747268949561732973027887
bigFraction3.pow(bigInteger25);

Such a call will not terminate in a reasonable amount of time.

To detect infinitely-looping tests, Randoop can run with the --usethreads command-line option. However, that reduces performance.

When a programmer can characterize which inputs to a method are reasonable, it would be nice for Randoop to avoid creating such tests.

Determining when Randoop assertions are incorrect

Randoop creates regression tests that capture the current behavior of a program. A regression test fails if a programmer changes the program's behavior. However, the program's behavior might already be wrong; that is, the program might be buggy. How can we detect when Randoop has created a test case that captures incorrect/buggy behavior?

The only way to find a mistake is to compare two things. Here are some things that could be compared to Randoop test assertions:

developer-written tests. Do any of their assertions contradict assertions in Randoop-generated tests? It could be an assertion that maps input to output, a representation invariant, observing a state after a method sequence that violates an assertion, etc.
usage in the implementation. A trivial example is that if a method dereferences one of its arguments, then the method is assuming that the reference must be non-null. It does not make sense for a test to pass null to the method. This example is trivial because passing null would immediately cause a null pointer exception. This could be interesting for non-crashing tests, however.
documentation, whether Javadoc or other. This could require natural language processing or pattern matching, such as performed by Toradocu, JDoctor, or other tools.
good end-to-end executions. This could use specification inference, such as performed by Daikon.
other generation tools, such as EvoSuite? The problem is that they, too, capture current behavior.
other implementations of the same specification. This is most relevant to libraries or other well-specified software components, and is called differential analysis. It could also be possible to compare to the test suite for those other implementations.

Generating tests when we know there is a bug

Scenario: we know there is a bug, and we want better tests.

Here are three challenges or research questions:

Extract/generate a test case from a bug report. (Starting from a bug report is a realistic scenario.) Also utilize other resources: requirements, documentation, stack overflow discussions, user reviews on app store, etc.
The oracle problem. Given a generated test, does it capture correct or incorrect behavior? Related question, that also requires a solution to the oracle problem: What if some other existing test case is wrong, in the test suite used for APR? Is there any way to detect that?
Given a test case, does it comprehensively capture all the buggy behaviors? There are already studies of this question; it is the same question as: “Does APR overfit from a set of passing tests?”. But at its base it is still the same oracle problem.

Here are some differences; understanding them might help with solving the problems:

“partial specification” vs “failing partial tests / one example”
“more comprehensively describing correct spec” vs “more comprehensively describing incorrect behavior”

It is easy to generate more test inputs, and to accept their results as correct. But are they? One possibility: Daikon is surprisingly good at generalizing. Could use it in conjunction with other resources (eg, the bug report).

Is it good or bad to add more regression tests to a test suite before APR? Here is a paper that did this evaluation, for plausible patches but not for correct patches: “Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system” https://link.springer.com/article/10.1007/s10664-018-9619-4 . From its absract: “The main result is that automatic test generation is effective in alleviating one kind of overfitting, issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.”

Side note: how realistic is APR from failing tests?

APR (automated program repair) fixes bugs without human intervention. The input to an APR tool is a program and a test suite, such that some of the tests pass and some of the tests fail. The goal of APR is to edit the program so that all of the tests pass.

Many evaluations of APR tools have been performed, using version control histories that contain real bugs that have been fixed. However, these evaluations are a bit unrealistic:

The evaluation assumes that a programmer has written a test case that fails for the defective code and passes once the defect has been fixed. If a programmer can do this, then perhaps the programmer knows enough to fix the defect without the need for APR.
The evaluation removes all tests that fail in the defective version but are not related to the specific defect. In practice, a programmer does not know this exact set of tests.

How well do APR techniques do on more realistic scenarios?

This is related to test generation. If a test generation tool given the post-tests cannot generate error-revealing regression tests when run on the post-fix version, then the problem is not the oracle problem, but we need more work on test generation of inputs.

Related: A result by Jacques Klein (in work under submission) is that about 50% of bugs are detected by a failure of a regression test generated on the fixed version.

Nondeterminism

When a method is nondeterministic, such as any method that returns a Set, the user currently needs to exclude that method from consideration by Randoop. It would be better to permit Randoop to call methods that do not depend on the nondeterminism. For example, rather than avoiding construction of all Sets, Randoop could create them, and call methods like size() and contains(). Randoop would not iterate through the Set nor print it (or, better, the Set could iterate/print in a deterministic order).

Improved unit test generation via GRT techniques

Programmers do not enjoy writing unit tests. The result is under-tested software and lingering bugs. Tools exist to automatically generate tests, but they are not widely used. The goal of this project is to understand why, and to improve the test generation tools.

In particular, there is a paper on Guided Random Testing (GRT) that proposed improvements and shows experimentally that these improvements are helpful. In a recent competition among testing tools, GRT won. However, the authors have refused to release their tools and experiments. The goal of this project is to reproduce that work, and determine whether the techniques are useful. If they are, make the tool and experiments publicly available. If they are not, then improve the techniques.

The research question is how to improve test generation techniques, with an initial focus on the techniques proposed in the "Guided Random Testing" paper.

The methodology is to re-implement and reproduce the GRT paper's techniques and results, and then to evaluate the benefits of each one. Incorporate the ones that help into an open-source unit test generator, and understand and improve the other ones. As a stretch goal, go beyond the GRT techniques: invent new improvements to unit testing.

GRT's techniques are

Constant mining (static): Extract constants from the program under test for both global usage (seed the main object pool) and for local usage as inputs for specific methods.
Impurity (static + dynamic): Perform static purity analysis for all methods under test. At run-time, fuzz selected input from the object pool based on a Gaussian distribution and purity results.
Elephant brain (dynamic): Manage method sequences (to create inputs) in the object pool with the exact types obtained at run-time.
Detective (static + dynamic): Analyze the method input and return type dependency statically, and construct missing input data on demand at run-time.
Orienteering (dynamic): Favor method sequences that require lower cost to execute, which accelerates testing for other components.
Bloodhound (dynamic): Guide method selection and sequence generation by coverage information at run-time.

Creating independent tests

The goal of this project is to make Randoop output independent unit tests, rather than tests that must be run in a specific order.

Randoop makes test sequences out of methods that may have side effects. This can cause running one test to affect the outcome of a subsequent, putatively independent, test.

For example, suppose that method m1 modifies a global variable, and method m2 reads it. Further suppose that test t1 includes a call to m1, and test t2 includes a call to m2. The result of m2 (and, thus, whether test t2 succeeds) depends on whether test t1 has already been run in the same JVM. All of the tests are entangled, which makes it difficult to run a subset of them, to understand them, or to debug failures.

Ideally, all of the tests in a test suite should be independent, in that they can be run in any order.

Approach 1: Undo state changes

One approach is to undo state changes: for each global field that a Randoop test sets (directly or indirectly), the test should reset it at the end of the test.

This requires determining which fields the test changes and restoring them to their original values.

A research question is whether this degrades the fault-finding ability of the test suite. Setting global fields may change the behavior of subsequent tests — is that a positive or a negative? It would also be interesting to exclude the fields from being modified by Randoop, via the --omit-methods and --omit-field command-line options. How much does that degrade Randoop's fault-finding ability?

Approach 2: convert a set of interdependent tests into independent ones

The goal of this project is to convert the Randoop output, which is currently a set of interdependent tests, into a set of independent tests.

Suppose Randoop outputs 100 small unit tests. An idea is to split each tests into the part that depends on other tests, and the independent part. Now move the dependent parts into a new test. The result is 101 tests: 100 of them are even smaller than before, but are completely independent of one another, and the last one is relatively large but captures all the dependencies.

Either a static or a dynamic analysis could perform this task.

Here are some additional details.

Background

A test consists of alternating actions and observations:

a1
o11
o12
...
o1m_1
a2
o21
o22
...
o2m_2
...

Each action is a method invocation, that Randoop chose because Randoop believed that the action method would affect the state of some object. Each observation is a method invocation, that Randoop chose because Randoop believed that the observer method would NOT affect the state of any object but would reveal information about it. The test oracle checks the result of each method call (actions and observers) against an expected value that was obtained by running the method at test generation time.

Input

n tests that might depend on one another

Output

n independent tests that may be run in any order (each of these is a subset of one of the original tests), plus one big test that concatenates all n of the original tests. The big test must be run in a fresh JVM or sandbox, or at least as the last test of the suite.

The big test may seem problematic -- too hard to understand and debug -- but it is no worse than the original suite, and at least it is honest in putting all the possible effects together. I suspect that slicing, delta debugging, or other test case minimization approaches will be extremely effective on the big test for those situations where it fails but none of the little tests does.

One could imagine putting only a subset of the original tests into the big test suite, but including everything is more likely to catch serendipitous interactions among rarely-called methods. We could empirically evaluate whether that benefit is worthwhile.

An analysis can compute 2 distinct properties of a test case:

whether this test's method outputs depend on previous test executions.
whether this test affects the outputs of methods in subsequent tests. (If observers are guaranteed to be side-effect-free, then that is relevant only for action methods.)

Only the latter property matters: An independent test that can be run in any order, and has no side effects on global (static) state. We could compute the property via a static analysis, or by executing the tests to see what happens.

Static analysis seems cleaner to me. Compute what action methods can write to static fields. For each test t, construct a new unit test that is just like t, but it omits all actions (& their associated observers) that side-effect static fields. Now, those unit tests can be run in any order. Also concatenate all omitted actions (or, more likely, all actions whatsoever) into one big test that must be run in a new JVM.

Dynamic analysis

In a new JVM, run all tests and compute oracles. For each test t, run (in a new JVM) t followed by all tests. If any test does not satisfy its oracle, then say that the test t has side effects on global state. (This isn't quite right: The effect of test t may be masked by the effect of the first test, for example.) Retain all side-effect-free tests as unit tests. (Verify that they don't affect one another by running them all in a new JVM, in random order; do this (say) 5 times.) Concatenate all other tests (or maybe all tests whatsoever) as one big test that must be run in a new JVM, or at least as the last test of the suite.

Questions

What % of tests are separable? This will affect whether the technique is able to create independent tests, or those end up empty and the algorithm's output is just the big hairball.test. That would be no worse than the current situation, and at least the user would know that interdependent tests are a problem.

Flakiness

Randoop can call APIs whose value changes from run to run of tests, and this can cause flaky tests (tests that spuriously fail). Examples of such APIs are ones that retrieve the current date, the username, etc. Two ways to mitigate such flakiness are:

identify all Java API calls that can lead to flaky information and replace them with mocks
identify information that is "tainted" by flaky data and avoid assertions on them

EvoSuite follows approach #1 so far, and you can see its source code for the APIs it mocks. Approach #2 would avoid the need for a dedicated runtime library.

Robustness

Randoop executes methods under test with random inputs, and this can cause usability problems. Code under test may exhaust memory or other resources to the point that impedes Randoop from making progress, or even causes Randoop itself to crash. Making Randoop more robust in the face of these kinds of failures would greatly improve the tool's usability and attract more users.

One approach is to create a wrapper around Randoop: a process that that monitors test generation and can terminate/restart the generator if it fails to make progress or aborts unexpectedly. For more details on the wrapper idea, see Finding Errors in .NET with Feedback-Directed Random Testing (Section 3).

Another approach is to use a safe security manager that can easily be extended by the user. Currently, the security manager has to be specified by the user in the normal fashion on the command line.

Parallelizing Randoop

This project would modify Randoop so that it can run on multiple machines. This would greatly increase the efficiency of the system, by distributing the generation effort across multiple processors. The test generation techniques that Randoop uses are amenable to parallelization: Randoop could spawn multiple test generation engine instances, each with a different random seed.

There are a number of challenges to overcome in parallelizing Randoop. Two examples:

Coordinating multiple instances. Making Randoop parallel requires writing code that can spawn, monitor, and combine the results of multiple processes running in separate machines or processors.
Managing quantity of output. A parallel version of Randoop might produce many repetitive test cases. Avoiding repetitive tests is a general challenge for a tool like Randoop, but is even more important to manage this challenge in a parallel version of the tool.

A downside of this project is that generation speed is not the most important problem for Randoop; effort would probably be better invested elsewhere.

Additional required skills: Knowledge of Java APIs for spawning processes, creating streams, reading/writing files, would also be helpful.

Side effect analysis

Randoop's --side-effect-free-methods command-line argument informs Randoop which methods are pure (perform no side effects).

There are some enhancements that would make Randoop's use of it even more compelling.

Randoop takes optional inputs such as a list of pure methods. Create a wrapper around Randoop that automatically invokes a purity analysis and does other pre-processing, so that a user doesn't have to do those steps and Randoop is more automated.
Although side effect and purity analysis is an old research topic, the available tools aren't great. Enhance an existing purity analysis, or create a new one.
Make Randoop support programmer-written null and reference immutability annotations. For example, Randoop should read the Checker Framework's annotated JDK. This could both cut down the search space and possibly allow more actual errors to be found (such as null dereference errors when nulls are allowed).

Perhaps use Soot's side effect analysis, or some other side effect analysis.

User interface and user experience projects

Test cases with debugging clues

Debugging is difficult. When a test fails, it can be very time-consuming to understand the source of the failure in order to fix the code (or the test). For example, consider the following test generated by Randoop:

public static void test1() {

  ArrayList l = new ArrayList(7);
  Object o = new Object();
  l.add(o);
  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);
  // This assertion fails
  Assert.assertEquals(s, s);
}

This test reveals an error in Sun's JDK (version 1.5). It shows a short sequence of calls leading up to the creation of an object that is not equal to itself! Unfortunately, it is difficult to tell simply by looking at the test what is the source of the failure.

This project would augment Randoop with the ability to provide debugging clues: comments within a failing test that provide potentially-useful facts about the failure and can help the user with debugging. For example, an improved version of the above test may look something like this:

public static void test1() {

  // Test fails for any capacity >= 0.
  // Test throws IllegalArgumentException for any capacity <0.
  ArrayList l = new ArrayList(7);

  Object o = new Object();

  // Test passes when o is a Comparable.
  l.add(o);

  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);

  // This assertion fails
  Assert.assertEquals(s, s);
}

The new test provides several clues that can help the user avoid time debugging. The key observation is that the test passes if we use a Comparable object in the call l.add(o). Further investigation would reveal that the TreeSet constructor should not accept a list with a non-comparable object in it but does, and this leads to the error.

To output tests with debugging comments, Randoop would have to be modified in two ways:

Re-execute different variations of a failing test, trying out different method parameters, and keeping track of when the test fails and passes
Synthesize the results of the above executions as debugging comments

In theory this sounds simple, but there are several issues that make the problem interesting: creating a set of variations of a test is non-trivial, and synthesizing test executions into useful comments, are both non-trivial tasks.

Other projects

These projects (some of which are small, and others of which are quite ambitious and would make research contributions) would be a good way to get started with Randoop development and make a contribution.

Make Randoop able to test private methods: generate tests that directly call non-public methods and constructors (this is sometimes useful when testing certain code). Currently, Randoop can create and execute such tests (via the unpublicized option --public-only=boolean), but the resulting JUnit tests do not compile. This task involves modifying Randoop's JUnit output format so that non-public methods are called via reflection, making them accessible for execution.
Combine Randoop and EvoSuite. EvoSuite has multiple components: test generation, test evaluation, test minimization. If EvoSuite's test generation component were replaced by Randoop, what would the effect be? Another way of asking this is how much of EvoSuite's good performance is due to its different components, and how much is due to its core search-based algorithm.
Compare Randoop to QuickCheck, which is another popular test generation tool. Once you see the ways in which each of them is better than the other, add features to each to make each one better.
A key difference is that QuickCheck requires the user to provide generators for values, whereas Randoop creates them automatically. Which is more effective in practice?

Randoop could be extended to make it easier to specify a routine that returns a set of legal values for a parameter (that Randoop wouldn't otherwise guess).
Better classification. Given a test that throws an exception (or does not!), Randoop guesses whether the behavior is expected or unexpected. Evaluate its heuristics and try to improve them.
Run on open source. Randoop doesn't just generate tests: it detects bugs. Run Randoop on open-source programs and report bugs to their maintainers.
Read specifications about method behavior, such as whether a method is pure or whether a particular parameter may be null, and make Randoop respect those specifications. The specifications could be expressed as Java annotations or as Javadoc comments (see the Toradocu project).
Create your own enhancement suggestion. Run Randoop on a benchmark such as Defects4J, and work to improve Randoop's performance such as its coverage or mutation score. Identify a reason for poor performance, propose a way to improve Randoop, and evaluate it.