Randoop project ideas

Contents:

Do you love writing unit tests to find bugs in your programs? Or, would you prefer that the tests were written for you automatically?

The Randoop tool writes tests for you.

Writing tests is a difficult and time-consuming activity, and yet it is a crucial part of good software engineering. If you are interested in improving the state of the art in test generation and bug finding, read on.

Our system, Randoop, automatically generates unit tests for Java classes. For example, here is a test case generated by Randoop that reveals an error in the Sun JDK 1.5:

public static void test1() {
  LinkedList l1 = new LinkedList();
  Object o1 = new Object();
  l1.addFirst(o1);
  TreeSet t1 = new TreeSet(l1);
  Set s1 = Collections.unmodifiableSet(t1);
  // This assertion fails
  Assert.assertEquals(s1, s1);
}

The ultimate goal is a testing system that at the push of a button creates a solid JUnit test suite providing good coverage of arbitrary classes, with no manual effort on your part. Randoop is a step in the direction of this vision and it already does a very good job, especially with library classes (such as java.util). Randoop is already used at companies like ABB and Microsoft, and on open-source projects.

Some of the projects are interesting and useful engineering that will make Randoop more effective for developers. Some projects are publishable research that extends scientific knowledge about test generation approaches. And many projects include both aspects.

Required skills: These projects require facility with Java, because Randoop is written in Java. You should be very comfortable with the idea of unit testing. You should know how unit testing helps you and where it can hinder you, and you should be passionate about improving the tradeoff. You should be willing to dive into and understand a moderate-sized codebase.

If you want to discuss these projects, feel free to contact the Randoop developers at randoop-developers@googlegroups.com. We look forward to hearing from you!

Test generation projects

Starter project: fix an issue

One way to get familiar with the Randoop codebase is to fix some issue in the issue tracker. Particularly good choices are issues labeled as "good first issue".

Better choice of methods

Randoop selects a method to test at random. Uniform random selection is the simplest, but not always the best, option. This project would augment Randoop with different, and potentially novel strategies for selecting the next method to test. Here are a few example strategies; the project would involve fleshing out the details.

Mimic real call sequences

Randoop's generated test sequences are random and may not be characteristic of real use. One problem is that there may be restrictions on usage of a class: close should only be called after open, or arguments must have particular relationships, such as the non-array argument to binary search must appear in the array argument. Another problem is that random invocations may not create interesting data structures, such as very large ones or ones with particular qualities such as duplicate elements.

Randoop does a good job of generating tests for library classes (such as java.util.*) that have few restrictions on their calling sequences and parameters. Classes that make up programs often have a number of restrictions on both the sequences of calls and their parameters.

We want to extend the system to provide better coverage for such classes. Randoop could infer which sequences of calls are valid and invalid. For example, Randoop could observe what call sequences real code makes, perhaps using a tool like Palulu. Then, bias could be introduced into the generator to make call sequences that are more realistic, and therefore better test the most commonly-used and important parts of the code.

A possible downside is reduced diversity in tests. A research question is whether mimicking real executions is worthwhile overall.

Better choice of arguments

Suppose that Randoop wants to create a call to a method with a parameter of type T1. Currently, Randoop selects an object of type T1 at random. Randoop can do better. Here is one idea; other approaches are possible.

Randoop could determine what fields a method will read, and then chose an object for which those fields have been recently side-effected. As an example, when obtaining a Point to pass to the getX() method, a Point produced by the setX() or move() method is more interesting than one produced by the setY() method.

As another example, Randoop could choose objects that are most likely to achieve additional coverage in the called method, perhaps because the values will cause particular branches to be taken.

Avoid known bad arguments

Sometimes Randoop creates tests that do not terminate or that take a long time to execute. For example, Randoop once tried to create a test that executed an enormously large exponentiation:


bigFraction3 = -104 / 5
bigInteger25 = 1561321067192303346130420089702102441699871430851380131368332415577876285112504379523988395794851806993803452448222179606997671319406528676747268949561732973027887
bigFraction3.pow(bigInteger25);

Such a call will not terminate in a reasonable amount of time.

To detect infinitely-looping tests, Randoop can run with the --usethreads command-line option. However, that reduces performance.

When a programmer can characterize which inputs to a method are reasonable, it would be nice for Randoop to avoid creating such tests.

Determining when Randoop assertions are incorrect

Randoop creates regression tests that capture the current behavior of a program. A regression test fails if a programmer changes the program's behavior. However, the program's behavior might already be wrong; that is, the program might be buggy. How can we detect when Randoop has created a test case that captures incorrect/buggy behavior?

The only way to find a mistake is to compare two things. Here are some things that could be compared to Randoop test assertions:

Generating tests when we know there is a bug

Scenario: we know there is a bug, and we want better tests.

Here are three challenges or research questions:

Here are some differences; understanding them might help with solving the problems:

It is easy to generate more test inputs, and to accept their results as correct. But are they? One possibility: Daikon is surprisingly good at generalizing. Could use it in conjunction with other resources (eg, the bug report).

Is it good or bad to add more regression tests to a test suite before APR? Here is a paper that did this evaluation, for plausible patches but not for correct patches: “Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system” https://link.springer.com/article/10.1007/s10664-018-9619-4 . From its absract: “The main result is that automatic test generation is effective in alleviating one kind of overfitting, issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.”

Side note: how realistic is APR from failing tests?

APR (automated program repair) fixes bugs without human intervention. The input to an APR tool is a program and a test suite, such that some of the tests pass and some of the tests fail. The goal of APR is to edit the program so that all of the tests pass.

Many evaluations of APR tools have been performed, using version control histories that contain real bugs that have been fixed. However, these evaluations are a bit unrealistic:

How well do APR techniques do on more realistic scenarios?

This is related to test generation. If a test generation tool given the post-tests cannot generate error-revealing regression tests when run on the post-fix version, then the problem is not the oracle problem, but we need more work on test generation of inputs.

Related: A result by Jacques Klein (in work under submission) is that about 50% of bugs are detected by a failure of a regression test generated on the fixed version.

Nondeterminism

When a method is nondeterministic, such as any method that returns a Set, the user currently needs to exclude that method from consideration by Randoop. It would be better to permit Randoop to call methods that do not depend on the nondeterminism. For example, rather than avoiding construction of all Sets, Randoop could create them, and call methods like size() and contains(). Randoop would not iterate through the Set nor print it (or, better, the Set could iterate/print in a deterministic order).

Improved unit test generation via GRT techniques

Programmers do not enjoy writing unit tests. The result is under-tested software and lingering bugs. Tools exist to automatically generate tests, but they are not widely used. The goal of this project is to understand why, and to improve the test generation tools.

In particular, there is a paper on Guided Random Testing (GRT) that proposed improvements and shows experimentally that these improvements are helpful. In a recent competition among testing tools, GRT won. However, the authors have refused to release their tools and experiments. The goal of this project is to reproduce that work, and determine whether the techniques are useful. If they are, make the tool and experiments publicly available. If they are not, then improve the techniques.

The research question is how to improve test generation techniques, with an initial focus on the techniques proposed in the "Guided Random Testing" paper.

The methodology is to re-implement and reproduce the GRT paper's techniques and results, and then to evaluate the benefits of each one. Incorporate the ones that help into an open-source unit test generator, and understand and improve the other ones. As a stretch goal, go beyond the GRT techniques: invent new improvements to unit testing.

GRT's techniques are

Constant mining (static)
Extract constants from the program under test for both global usage (seed the main object pool) and for local usage as inputs for specific methods.
Impurity (static + dynamic)
Perform static purity analysis for all methods under test. At run-time, fuzz selected input from the object pool based on a Gaussian distribution and purity results.
Elephant brain (dynamic)
Manage method sequences (to create inputs) in the object pool with the exact types obtained at run-time.
Detective (static + dynamic)
Analyze the method input and return type dependency statically, and construct missing input data on demand at run-time.
Orienteering (dynamic)
Favor method sequences that require lower cost to execute, which accelerates testing for other components.
Bloodhound (dynamic)
Guide method selection and sequence generation by coverage information at run-time.

Creating independent tests

The goal of this project is to make Randoop output independent unit tests, rather than tests that must be run in a specific order.

Randoop makes test sequences out of methods that may have side effects. This can cause running one test to affect the outcome of a subsequent, putatively independent, test.

For example, suppose that method m1 modifies a global variable, and method m2 reads it. Further suppose that test t1 includes a call to m1, and test t2 includes a call to m2. The result of m2 (and, thus, whether test t2 succeeds) depends on whether test t1 has already been run in the same JVM. All of the tests are entangled, which makes it difficult to run a subset of them, to understand them, or to debug failures.

Ideally, all of the tests in a test suite should be independent, in that they can be run in any order.

Approach 1: Undo state changes

One approach is to undo state changes: for each global field that a Randoop test sets (directly or indirectly), the test should reset it at the end of the test.

This requires determining which fields the test changes and restoring them to their original values.

A research question is whether this degrades the fault-finding ability of the test suite. Setting global fields may change the behavior of subsequent tests — is that a positive or a negative? It would also be interesting to exclude the fields from being modified by Randoop, via the --omit-methods and --omit-field command-line options. How much does that degrade Randoop's fault-finding ability?

Approach 2: convert a set of interdependent tests into independent ones

The goal of this project is to convert the Randoop output, which is currently a set of interdependent tests, into a set of independent tests.

Suppose Randoop outputs 100 small unit tests. An idea is to split each tests into the part that depends on other tests, and the independent part. Now move the dependent parts into a new test. The result is 101 tests: 100 of them are even smaller than before, but are completely independent of one another, and the last one is relatively large but captures all the dependencies.

Either a static or a dynamic analysis could perform this task.

Here are some additional details.

Background

A test consists of alternating actions and observations:

a1
o11
o12
...
o1m_1
a2
o21
o22
...
o2m_2
...

Each action is a method invocation, that Randoop chose because Randoop believed that the action method would affect the state of some object. Each observation is a method invocation, that Randoop chose because Randoop believed that the observer method would NOT affect the state of any object but would reveal information about it. The test oracle checks the result of each method call (actions and observers) against an expected value that was obtained by running the method at test generation time.

Input

n tests that might depend on one another

Output

n independent tests that may be run in any order (each of these is a subset of one of the original tests), plus one big test that concatenates all n of the original tests. The big test must be run in a fresh JVM or sandbox, or at least as the last test of the suite.

The big test may seem problematic -- too hard to understand and debug -- but it is no worse than the original suite, and at least it is honest in putting all the possible effects together. I suspect that slicing, delta debugging, or other test case minimization approaches will be extremely effective on the big test for those situations where it fails but none of the little tests does.

One could imagine putting only a subset of the original tests into the big test suite, but including everything is more likely to catch serendipitous interactions among rarely-called methods. We could empirically evaluate whether that benefit is worthwhile.

An analysis can compute 2 distinct properties of a test case:

Only the latter property matters: An independent test that can be run in any order, and has no side effects on global (static) state. We could compute the property via a static analysis, or by executing the tests to see what happens.

Static analysis seems cleaner to me. Compute what action methods can write to static fields. For each test t, construct a new unit test that is just like t, but it omits all actions (& their associated observers) that side-effect static fields. Now, those unit tests can be run in any order. Also concatenate all omitted actions (or, more likely, all actions whatsoever) into one big test that must be run in a new JVM.

Dynamic analysis

In a new JVM, run all tests and compute oracles. For each test t, run (in a new JVM) t followed by all tests. If any test does not satisfy its oracle, then say that the test t has side effects on global state. (This isn't quite right: The effect of test t may be masked by the effect of the first test, for example.) Retain all side-effect-free tests as unit tests. (Verify that they don't affect one another by running them all in a new JVM, in random order; do this (say) 5 times.) Concatenate all other tests (or maybe all tests whatsoever) as one big test that must be run in a new JVM, or at least as the last test of the suite.

Questions

What % of tests are separable? This will affect whether the technique is able to create independent tests, or those end up empty and the algorithm's output is just the big hairball.test. That would be no worse than the current situation, and at least the user would know that interdependent tests are a problem.

Flakiness

Randoop can call APIs whose value changes from run to run of tests, and this can cause flaky tests (tests that spuriously fail). Examples of such APIs are ones that retrieve the current date, the username, etc. Two ways to mitigate such flakiness are:

EvoSuite follows approach #1 so far, and you can see its source code for the APIs it mocks. Approach #2 would avoid the need for a dedicated runtime library.

Robustness

Randoop executes methods under test with random inputs, and this can cause usability problems. Code under test may exhaust memory or other resources to the point that impedes Randoop from making progress, or even causes Randoop itself to crash. Making Randoop more robust in the face of these kinds of failures would greatly improve the tool's usability and attract more users.

One approach is to create a wrapper around Randoop: a process that that monitors test generation and can terminate/restart the generator if it fails to make progress or aborts unexpectedly. For more details on the wrapper idea, see Finding Errors in .NET with Feedback-Directed Random Testing (Section 3).

Another approach is to use a safe security manager that can easily be extended by the user. Currently, the security manager has to be specified by the user in the normal fashion on the command line.

Parallelizing Randoop

This project would modify Randoop so that it can run on multiple machines. This would greatly increase the efficiency of the system, by distributing the generation effort across multiple processors. The test generation techniques that Randoop uses are amenable to parallelization: Randoop could spawn multiple test generation engine instances, each with a different random seed.

There are a number of challenges to overcome in parallelizing Randoop. Two examples:

A downside of this project is that generation speed is not the most important problem for Randoop; effort would probably be better invested elsewhere.

Additional required skills: Knowledge of Java APIs for spawning processes, creating streams, reading/writing files, would also be helpful.

Side effect analysis

Randoop's --side-effect-free-methods command-line argument informs Randoop which methods are pure (perform no side effects).

There are some enhancements that would make Randoop's use of it even more compelling.

Perhaps use Soot's side effect analysis, or some other side effect analysis.

User interface and user experience projects

Test cases with debugging clues

Debugging is difficult. When a test fails, it can be very time-consuming to understand the source of the failure in order to fix the code (or the test). For example, consider the following test generated by Randoop:

public static void test1() {

  ArrayList l = new ArrayList(7);
  Object o = new Object();
  l.add(o);
  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);
  // This assertion fails
  Assert.assertEquals(s, s);
}

This test reveals an error in Sun's JDK (version 1.5). It shows a short sequence of calls leading up to the creation of an object that is not equal to itself! Unfortunately, it is difficult to tell simply by looking at the test what is the source of the failure.

This project would augment Randoop with the ability to provide debugging clues: comments within a failing test that provide potentially-useful facts about the failure and can help the user with debugging. For example, an improved version of the above test may look something like this:

public static void test1() {

  // Test fails for any capacity >= 0.
  // Test throws IllegalArgumentException for any capacity <0.
  ArrayList l = new ArrayList(7);

  Object o = new Object();

  // Test passes when o is a Comparable.
  l.add(o);

  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);

  // This assertion fails
  Assert.assertEquals(s, s);
}

The new test provides several clues that can help the user avoid time debugging. The key observation is that the test passes if we use a Comparable object in the call l.add(o). Further investigation would reveal that the TreeSet constructor should not accept a list with a non-comparable object in it but does, and this leads to the error.

To output tests with debugging comments, Randoop would have to be modified in two ways:

In theory this sounds simple, but there are several issues that make the problem interesting: creating a set of variations of a test is non-trivial, and synthesizing test executions into useful comments, are both non-trivial tasks.

Other projects

These projects (some of which are small, and others of which are quite ambitious and would make research contributions) would be a good way to get started with Randoop development and make a contribution.