Randoop project ideas

Contents:

Do you love writing unit tests to find bugs in your programs? Or, would you prefer that the tests were written for you automatically?

The Randoop tool writes tests for you.

Writing tests is a difficult and time-consuming activity, and yet it is a crucial part of good software engineering. If you are interested in improving the state of the art in test generation and bug finding, read on.

Our system, Randoop, automatically generates unit tests for Java classes. For example, here is a test case generated by Randoop that reveals an error in the Sun JDK 1.5:

public static void test1() {
  LinkedList l1 = new LinkedList();
  Object o1 = new Object();
  l1.addFirst(o1);
  TreeSet t1 = new TreeSet(l1);
  Set s1 = Collections.unmodifiableSet(t1);
  // This assertion fails
  Assert.assertTrue(s1.equals(s1));
}

The ultimate goal is a testing system that at the push of a button creates a solid JUnit test suite providing good coverage of arbitrary classes, with no manual effort on your part. Randoop is a step in the direction of this vision and it already does a very good job, especially with library classes (such as java.util). Randoop is already used at companies like ABB and Microsoft, and on open-source projects.

Some of the projects are interesting and useful engineering that will make Randoop more effective for developers. Some projects are publishable research that extends scientific knowledge about test generation approaches. And many projects include both aspects.

Required skills: These projects require facility with Java, because Randoop is written in Java. You should be very comfortable with the idea of unit testing. You should know how unit testing helps you and where it can hinder you, and you should be passionate about improving the tradeoff. You should be willing to dive into and understand a moderate-sized codebase.

If you want to discuss these projects, feel free to contact the Randoop developers at randoop-developers@googlegroups.com. We look forward to hearing from you!

Test generation projects

Related to the projects on better random choices, GRT (Guided Random Testing) is a variant of Randoop that adds some features to Randoop. In a recent competition among testing tools, GRT won. GRT's source code isn't available, and some of the features that GRT claims are already part of Randoop anyway. Incorporate the rest into Randoop in order to create a tool with all the positive features of each.

Starter project: fix an issue

One way to get familiar with the Randoop codebase is to fix some issue in the issue tracker. Particularly good choices are issues labeled as "help wanted".

Better choice of methods

Randoop selects a method to test at random. Uniform random selection is the simplest, but not always the best, option. This project would augment Randoop with different, and potentially novel strategies for selecting the next method to test. Here are a few example strategies; the project would involve fleshing out the details.

Mimic real call sequences

Randoop's generated test sequences are random and may not be characteristic of real use. One problem is that there may be restrictions on usage of a class: close should only be called after open, or arguments must have particular relationships, such as the non-array argument to binary search must appear in the array argument. Another problem is that random invocations may not create interesting data structures, such as very large ones or ones with particular qualities such as duplicate elements.

Randoop does a good job of generating tests for library classes (such as java.util.*) that have few restrictions on their calling sequences and parameters. Classes that make up programs often have a number of restrictions on both the sequences of calls and their parameters.

We want to extend the system to provide better coverage for such classes. Randoop could infer which sequences of calls are valid and invalid. For example, Randoop could observe what call sequences real code makes, perhaps using a tool like Palulu. Then, bias could be introduced into the generator to make call sequences that are more realistic, and therefore better test the most commonly-used and important parts of the code.

A possible downside is reduced diversity in tests. A research question is whether mimicking real executions is worthwhile overall.

Better choice of arguments

Suppose that Randoop wants to create a call to a method with a parameter of type T1. Currently, Randoop selects an object of type T1 at random. Randoop can do better. Here is one idea; other approaches are possible.

Randoop could determine what fields a method will read, and then chose an object for which those fields have been recently side-effected. As an example, when obtaining a Point to pass to the getX() method, a Point produced by the setX() or move() method is more interesting than one produced by the setY() method.

As another example, Randoop could choose objects that are most likely to achieve additional coverage in the called method, perhaps because the values will cause particular branches to be taken.

Avoid known bad arguments

Sometimes Randoop creates tests that do not terminate or that take a long time to execute. For example, Randoop once tried to create a test that executed an enormously large exponentiation:


bigFraction3 = -104 / 5
bigInteger25 = 1561321067192303346130420089702102441699871430851380131368332415577876285112504379523988395794851806993803452448222179606997671319406528676747268949561732973027887
bigFraction3.pow(bigInteger25);

Such a call will not terminate in a reasonable amount of time.

To detect infinitely-looping tests, Randoop can run with the --usethreads command-line option. However, that reduces performance.

When a programmer can characterize which inputs to a method are reasonable, it would be nice for Randoop to avoid creating such tests.

Improved test generation via GRT techniques

GRT (Guided Random Testing) is a variant of the Randoop test generator that adds some features to Randoop. In a recent competition among testing tools, GRT won.

Programmers who want to use GRT are out of luck, because GRT's source code isn't available. Furthermore, some of the features that GRT claims were already part of Randoop, which casts doubt on the experimental methodology. It is certain that some of GRT's ideas are good ones, but we don't know which ones.

GRT's techniques are

Constant mining (static)
Extract constants from the program under test for both global usage (seed the main object pool) and for local usage as inputs for specific methods.
Impurity (static + dynamic)
Perform static purity analysis for all methods under test. At run-time, fuzz selected input from the object pool based on a Gaussian distribution and purity results.
Elephant brain (dynamic)
Manage method sequences (to create inputs) in the object pool with the exact types obtained at run-time.
Detective (static + dynamic)
Analyze the method input and return type dependency statically, and construct missing input data on demand at run-time.
Orienteering (dynamic)
Favor method sequences that require lower cost to execute, which accelerates testing for other components.
Bloodhound (dynamic)
Guide method selection and sequence generation by coverage information at run-time.

The goal of this project is to re-implement some or all the GRT techniques in the open-source Randoop implementation, and then to evaluate the benefits of each one.

Creating independent tests

The goal of this project is to make Randoop output independent unit tests, rather than tests that must be run in a specific order.

Randoop makes test sequences out of methods that may have side effects. This can cause running one test to affect the outcome of a subsequent, putatively independent, test.

For example, suppose that method m1 modifies a global variable, and method m2 reads it. Further suppose that test t1 includes a call to m1, and test t2 includes a call to m2. The result of m2 (and, thus, whether test t2 succeeds) depends on whether test t1 has already been run in the same JVM. All of the tests are entangled, which makes it difficult to run a subset of them, to understand them, or to debug failures.

Ideally, all of the tests in a test suite should be independent, in that they can be run in any order.

Approach 1: Undo state changes

One approach is to undo state changes: for each global field that a Randoop test sets (directly or indirectly), the test should reset it at the end of the test.

This requires determining which fields the test changes and restoring them to their original values.

A research question is whether this degrades the fault-finding ability of the test suite. Setting global fields may change the behavior of subsequent tests — is that a positive or a negative? It would also be interesting to exclude the fields from being modified by Randoop, via the --omitmethods and --omit-filed-list command-line options. How much does that degrade Randoop's fault-finding ability?

Approach 2: convert a set of interdependent tests into independent ones

The goal of this project is to convert the Randoop output, which is currently a set of interdependent tests, into a set of independent tests.

Suppose Randoop outputs 100 small unit tests. An idea is to split each tests into the part that depends on other tests, and the independent part. Now move the dependent parts into a new test. The result is 101 tests: 100 of them are even smaller than before, but are completely independent of one another, and the last one is relatively large but captures all the dependencies.

Either a static or a dynamic analysis could perform this task.

Here are some additional details.

Background

A test consists of alternating actions and observations:

a1
o11
o12
...
o1m_1
a2
o21
o22
...
o2m_2
...

Each action is a method invocation, that Randoop chose because Randoop believed that the action method would affect the state of some object. Each observation is a method invocation, that Randoop chose because Randoop believed that the observer method would NOT affect the state of any object but would reveal information about it. The test oracle checks the result of each method call (actions and observers) against an expected value that was obtained by running the method at test generation time.

Input

n tests that might depend on one another

Output

n independent tests that may be run in any order (each of these is a subset of one of the original tests), plus one big test that concatenates all n of the original tests. The big test must be run in a fresh JVM or sandbox, or at least as the last test of the suite.

The big test may seem problematic -- too hard to understand and debug -- but it is no worse than the original suite, and at least it is honest in putting all the possible effects together. I suspect that slicing, delta debugging, or other test case minimization approaches will be extremely effective on the big test for those situations where it fails but none of the little tests does.

One could imagine putting only a subset of the original tests into the big test suite, but including everything is more likely to catch serendipitous interactions among rarely-called methods. We could empirically evaluate whether that benefit is worthwhile.

An analysis can compute 2 distinct properties of a test case:

Only the latter property matters: An independent test that can be run in any order, and has no side effects on global (static) state. We could compute the property via a static analysis, or by executing the tests to see what happens.

Static analysis seems cleaner to me. Compute what action methods can write to static fields. For each test t, construct a new unit test that is just like t, but it omits all actions (& their associated observers) that side-effect static fields. Now, those unit tests can be run in any order. Also concatenate all omitted actions (or, more likely, all actions whatsoever) into one big test that must be run in a new JVM.

Dynamic analysis

In a new JVM, run all tests and compute oracles. For each test t, run (in a new JVM) t followed by all tests. If any test does not satisfy its oracle, then say that the test t has side effects on global state. (This isn't quite right: The effect of test t may be masked by the effect of the first test, for example.) Retain all side-effect-free tests as unit tests. (Verify that they don't affect one another by running them all in a new JVM, in random order; do this (say) 5 times.) Concatenate all other tests (or maybe all tests whatsoever) as one big test that must be run in a new JVM, or at least as the last test of the suite.

Questions

What % of tests are separable? This will affect whether the technique is able to create independent tests, or those end up empty and the algorithm's output is just the big hairball.test. That would be no worse than the current situation, and at least the user would know that interdependent tests are a problem.

Flakiness

Randoop can call APIs whose value changes from run to run of tests, and this can cause flaky tests (tests that spuriously fail). Examples of such APIs are ones that retrieve the current date, the username, etc. Two ways to mitigate such flakiness are:

EvoSuite follows approach #1 so far, and you can see its source code for the APIs it mocks. Approach #2 would avoid the need for a dedicated runtime library.

Robustness

Randoop executes methods under test with random inputs, and this can cause usability problems. Code under test may exhaust memory or other resources to the point that impedes Randoop from making progress, or even causes Randoop itself to crash. Making Randoop more robust in the face of these kinds of failures would greatly improve the tool's usability and attract more users.

One approach is to create a wrapper around Randoop: a process that that monitors test generation and can terminate/restart the generator if it fails to make progress or aborts unexpectedly. For more details on the wrapper idea, see Finding Errors in .NET with Feedback-Directed Random Testing (Section 3).

Another approach is to use a safe security manager that can easily be extended by the user. Currently, the security manager has to be specified by the user in the normal fashion on the command line.

Parallelizing Randoop

This project would modify Randoop so that it can run on multiple machines. This would greatly increase the efficiency of the system, by distributing the generation effort across multiple processors. The test generation techniques that Randoop uses are amenable to parallelization: Randoop could spawn multiple test generation engine instances, each with a different random seed.

There are a number of challenges to overcome in parallelizing Randoop. Two examples:

A downside of this project is that generation speed is not the most important problem for Randoop; effort would probably be better invested elsewhere.

Additional required skills: Knowledge of Java APIs for spawning processes, creating streams, reading/writing files, would also be helpful.

Side effect analysis

Randoop's --observers command-line argument informs Randoop which methods are pure (perform no side effects). Here are two distinct benefits:

  1. Benefit 1: avoiding useless calls in tests.

    Ordinarily, Randoop assumes that a method call may change any object that is passed to it. For example, Randoop considers x to be different after
      x = foo();
    
    than after
      x = foo();
      y = bar(x);
    
    Randoop will treat the x values produced by the sequences as different, and will use both in longer tests. By ignoring the x value produced by the second sequence, Randoop can spend half as long to produce equally good tests. Also, Randoop's tests are easier to understand because they are and do not contain mysterious, useless calls to pure methods.

    This is implemented in part, but needs additional testing and validation.

  2. Benefit 2: more test assertions.

    When Randoop creates regression tests, the tests contain assertions about the objects that the test creates. For instance, Randoop might output
      assertTrue(someObject.itsField == 22);
    In addition to reading fields of an object, Randoop should invoke observer methods. For instance, it should output
      assertTrue(someObject.observerMethod() == 22);
    This is only possible for pure methods — methods that have no side effect and return the same value every time they are called on the same arguments.

The goal of this project is to improve how Randoop utilizes pure methods for both of the purposes outlined above.

There are some other enhancements that would make this project even more compelling.

User interface and user experience projects

Test cases with debugging clues

Debugging is difficult. When a test fails, it can be very time-consuming to understand the source of the failure in order to fix the code (or the test). For example, consider the following test generated by Randoop:

public static void test1() {

  ArrayList l = new ArrayList(7);
  Object o = new Object();
  l.add(o);
  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);
  // This assertion fails
  Assert.assertTrue(s.equals(s));
}

This test reveals an error in Sun's JDK (version 1.5). It shows a short sequence of calls leading up to the creation of an object that is not equal to itself! Unfortunately, it is difficult to tell simply by looking at the test what is the source of the failure.

This project would augment Randoop with the ability to provide debugging clues: comments within a failing test that provide potentially-useful facts about the failure and can help the user with debugging. For example, an improved version of the above test may look something like this:

public static void test1() {

  // Test fails for any capacity >= 0.
  // Test throws IllegalArgumentException for any capacity <0.
  ArrayList l = new ArrayList(7);

  Object o = new Object();

  // Test passes when o is a Comparable.
  l.add(o);

  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);

  // This assertion fails
  Assert.assertTrue(s.equals(s));
}

The new test provides several clues that can help the user avoid time debugging. The key observation is that the test passes if we use a Comparable object in the call l.add(o). Further investigation would reveal that the TreeSet constructor should not accept a list with a non-comparable object in it but does, and this leads to the error.

To output tests with debugging comments, Randoop would have to be modified in two ways:

In theory this sounds simple, but there are several issues that make the problem interesting: creating a set of variations of a test is non-trivial, and synthesizing test executions into useful comments, are both non-trivial tasks.

Other projects

These projects (some of which are small, and others of which are quite ambitious and would make research contributions) would be a good way to get started with Randoop development and make a contribution.