Java Comparability


Contents:

Introduction

Variables of the same declared type (such as int or float) may contain unrelated information. In this case we say that the two variables have different abstract data types. For example the integer variable x may represent the time in seconds since 1970 and the integer variable y may represent the outside temperature in Celsius. The variables x and y represent two different abstract types. Variables of the same abstract type are said to be comparable. Two variables are presumed to be of the same abstract type if they are used together in an expression (eg, x+y, x>y, x=y, etc). We often use the term 'interact' as a shorthand for this.

Comparability is important to dynamic invariant detection. Dynamic invariant detection looks for invariants between each possible combination of variables. If variables are not comparable, then it makes no sense to look for invariants between them. For example, using the definitions of x and y from above, it makes no sense to look for invariants between x and y (even though some (eg, x > y) are almost certainly true).

Our dynamic invariant detection tool (Daikon) looks for invariants over a subset of the variables, fields, and expressions in the target program. We refer to these as Daikon variables. We are interested in determining for each Daikon variable what other Daikon variables it is comparable to. Consider the following class:

    class A {
      int x;
      int y;
      void m1(float f) {}
    }
There are two Daikon class variables: A.x and A.y. There are also three Daikon variables associated with A.m1: A.m1.this.x, A.m1.this.y, and A.m1.f. A.x and A.y represent all of the values of x and y across all instances of A. The m1 variables represent all of the values of x, y, and f across all calls to m1().

Currently, we make use of the static tools Ajax (Java) and Lackwit (C) to determine comparability. Unfortunately, both of these tools are slow, buggy, and can handle only a relatively small subset of programs. It is Mike's opinion that the design/implementation of these tools makes fixing their shortcomings impossible (or at least extremely difficult).

We are thus interested in developing a comparability analysis of our own. We are considering both static and dynamic approaches. The attributes of each approach and possible design approaches are discussed below.

Definition of Comparability

The definition of comparability presented above is an intuitive one. Many definitions are possible, but a thorough understanding of the goal is necessary in order to understand how to achieve it.

Here is a possible more formal definition: comparability captures bidirectional dataflow reachability. Dataflow reachability indicates when one value can affect another. It indicates which definitions flow to (or affect) which uses; equivalently, it determines, for a given use, which definitions might have defined the value that is obtained at the use. We wish to relate all definitions and uses that are (transitively) related.

Here are some questions about the definition, with proposed answers in italics.

  1. If fields in an object interact before that object is assigned to a variable, are the corresponding fields of the variable comparable?

    For example, in a.x += a.y; c = a; Is c.x comparable to c.y?

    Yes, previous information should be included in comparability

  2. If the fields of an object interact after a variable is assigned away from that object, are the corresponding fields in the variable comparable?

    For example, in c = a; c = null; a.x += a.y; Is c.x comparable to c.y?

    No. Variable c's fields never interacted. Interactions that occur after an object is referenced by a variable are not relevant.

  3. Is comparability between formal parameters determined only by the code within the method? Or, if two actual parameters are comparable are the corresponding formal parameters comparable as well?

    Consider the method m1(A a, B b). Formal parameters a and b do not interact within m1(). If we call m1() with actual parameters a1 and b1 that are comparable, do the formal parameters become comparable?

    I think that the comparability of actual parameters should flow to the formal parameters. This implies that if the parameters to a method are comparable in any call that the formal parameters are comparable in all calls.

  4. If the actual parameters interact after the call to the routine, are the formal parameters still related?

    For example, consider m1(a1, b1); m1(a2, b2); a1 < b1; Are the formal parameters for m1 comparable?

    Again, I think they are. This is not exactly the same situation as previously discussed for interactions that occur after a variable refers to an object. A variable can be assigned away from an object, but the formal parameter is in some sense always related to its actual parameters.

  5. If two formal parameters become comparable because their actual parameters in one call are comparable (see previous issue), does this imply that all actual parameters should become comparable?

    For example, consider:

        int a, b, c, d;
        a == b;
        m(a,b);
        m(c,d);
        
    Are c and d comparable?

    No. There was no interaction between c and d. A similar situation exists if the formal parameters interact in only some calls (due to a branch). Only in the calls where they actually interact should the corresponding actual parameters become comparable.

  6. If two variables ever point to the same object, should that make each of their fields comparable?

    For example, consider a1 = a2. Does that make a1.x and a2.x comparable?

    It seems like it should make them comparable. Or you could argue that it really makes a1 and a2 comparable, and has no impact on the fields of a1 and a2.

  7. There is a hierarchy of Daikon variables at different program points. Can Daikon variables which are children of the same parent have different comparability?

    For example, in class A with fields x and y and methods m1() and m2(), the Daikon variables A.m1.this.x and A.m2.this.x are both children of A.x. The same is true for field y. Can A.m1.this.x and A.m1.this.y be comparable while A.m2.this.x and A.m2.this.y are not?

    The answer to this (and several other issues) depends on the definition of comparability. If comparability is defined as capturing bidirectional dataflow reachability then it is reasonable for the comparability of instance variables to be different at different methods. There may be no instances of A in m2() where this.x and this.y were comparable. But there may be other instances of A in m1() where this.x and this.y were comparable.

    If, however, two variables are considered to be comparable if they are of the same abstract type, then it wouldn't seem to be possible for two children of the same parent to have different comparability (because each child would have the same abstract type).

  8. Are fields in a class always comparable with themselves across different instantiations?

    For example, should a1.x and a2.x always be comparable since they are the same field in the same class?

    These should not be comparable. There are many circumstances where a field can be used to hold completely unrelated values. We shouldn't assume the values of a field are always related.

  9. Does assignment (alone) make two variables comparable?

    For example, consider the following:

        Class T {
          int a = 7;
          int b = 15;
          void m1() {
            a = b;
            b = 0;
          }
          void swap() {
            int tmp = a;
            a = b;
            b = tmp;
          }
        }
        
    At the end of m1() or swap() are a and b comparable?

    Under either definition of comparability it seems that a and b should be comparable as a result of the assignment. If comparability is based on abstract types, then the types for a and b would be the same based on the assignment. If it is based on bidirectional dataflow reachability, it seems that the data from b flowed into a and thus they should be comparable.

  10. How do constants affect comparability?

    For example, does a1.x = 7; a2.x = 7; imply that a1.x and a2.x are comparable?

    Certainly for common constants such as 0, 1, -1, not-a-number, etc, assigning like constants should not imply comparability. It is not clear what should happen for more unusual numbers. Certainly if the number represents some common symbolic value it should imply comparability.

Static Analysis

This approach tracks data flow statically through the entire program. Conceptually, it is very similar to the type analysis used by Alan and Adam in Jiggetai, except that sets of variables are flowed rather than sets of types.

The approach can be completely sound, but it must make conservative assumptions. The result is that it may be imprecise (depending, of course, on the program). It's my feeling (based partially on the results from Lackwit) that this imprecision may be a significant problem (i.e., a significant number of variables may be indicated as comparable that are not). For example, consider the following code:

    Class A {
      int x;
      int y;
    }

    List alist = new ArrayList();
    A a1 = new A(); A a2 = new A(); A a3 = new A();
    alist.add (a1); alist.add (a2); alist.add (a3)
    int z = 7;
    A a4 = alist.get(1);
    z += a4.x
It is unlikely that a static analysis will be able to determine which A was retrieved from the list. While only a4.x and a2.x are truly comparable to z, a static analysis will probably conclude that a1.x, a2.x, a3.x, a4.x are comparable to z (and each other).

The advantage of a static approach is that it is not necessary to run the program and the results don't depend on the test suite. We also may be able to take advantage of Alan and Adams work to implement this relatively quickly.

Dynamic Analysis

A dynamic analysis executes the program over some set of input data (perhaps a test suite). All operations are tracked and when two values interact they become comparable. This would probably be implemented by rewriting the Java byte codes to include extra operations to track comparability. It could also be implemented via the JVM debugging interface, but our feeling is that this would be too slow (since many statements would be changed, this would require virtually single stepping through the program).

This document presents two different possible approaches to dynamically determining comparability, called "tracking variables" and "tracking objects". Each approach requires certain information to be tracked about each value (allocated object or primitive) in the program. Each approach also keeps a separate data structure that maps Daikon variables to comparability information, using the tracked information about each value.

The presentation makes two simplifying assumptions. First, we assume that there is some mechanism to associate the tracked information with each object (for example, each object could have extra space allocated within it for the tracked information, or a separate table mapping objects to the tracked information could be used). Second, we assume that primitives can be treated in the same way. The "Maintaining tracking information for each value" section discusses implementation strategies.

Tracking variables

As mentioned, our goal is to determine comparability for Daikon variables. Unfortunately, operations within the program do not take place on Daikon variables. When an operation occurs in the program, it is not known which Daikon variables (if any) are involved. This approach keeps track, for each object, of the set of Daikon variables that refer to it. This is referred to as the object's variable set. The variable set for each object is maintained as the program is running. Any time an interaction between two objects occurs, all of the variables in the corresponding variable sets are marked as comparable.

For example, in the following code, the object created by 'new A()' is referenced by both a1 and a2. The variable set for that object is thus {a1, a2}

    A a1 = new A();
    A a2 = a1;
The variable set for each object is maintained as follows. When an object is created, its variable set is initialized to its classname. In the example above, the variable set for the object created by 'new A()' is initialized to "A".

Variable sets are also updated when an assignment statement (var = obj) is executed. Any Daikon variable names associated with left hand side of the assignment must be removed from the variable set of the object currently referenced by var. The same names also must be added to the variable set of object. The names are also recursively applied to any fields within the object up to Daikon's dereference limit.

Determining the Daikon variable names associated with the left hand side (lhs) depends on the type of the lhs. If the lhs is a program variable (not a field), its Daikon variable name is the fully qualified name of the program variable if that name matches a Daikon variable. If the lhs is a field expression, the object containing the field may itself be referred to by multiple Daikon variable names. The Daikon variable names for the lhs are thus formed by appending the field name to each element of the container objects variable set.

Operational Details

There are two primary data structures:

The following details what the algorithm needs to do for each construct in the program:

Variable Tracking Issues

Tracking objects

In this approach, each value (object or primitive) tracks its equivalence set of other objects that are comparable to it. This can be thought of as a tag, such that all comparable objects have the same tag. It is implemented as a reference into a union-find data structure.

For example, consider:

    class A {
      int x;
      int y;

      void m1() {
      }
    }

    A a1 = new A();
    double z = 705;
    }
When 'new A()' is executed an object of class A is created and assigned a new, unique tag. Fields such as x and y are not assigned tags -- but if they are assigned values, the values will (already) have tags. Likewise, 'double z = 705' assigns a value (including its tag) to variable z.

Whenever two values interact, their tags are merged (indicating that their values are in the same equivalence set) using the union-find algorithm.

There is a global table that contains comparability information for each Daikon variable. At each instrumented program point, each Daikon variable is evaluated (as an expression) to determine the tag of its value. Two variables are made comparable if the values they reference have the same tags. In the previous example, when m1() is called, the instrumented code will determine the value (and thus the tag) referred to by this.x and this.y.

Operational Details

There are two primary data structures.

The procedure update(ppt) copies information from objsets to varsets[ppt]. Each pair of variables (eg, x and y) at program point ppt are compared:

  update(ppt) {
    if find(x.tag) == find(y.tag)
        varsets[ppt].union ("x", "y")
  }
The following details what the algorithm needs to do for each construct in the program:

At the end of the run: For each program point ppt, read off comparability information from varsets[ppt]. Use a bottom-up algorithm to compute comparability for non-leaf program points. Make a final pass across all program points so that if two variables are comparable at a higher point, any versions of them at a lower program point are also related. (For instance, m1.this.x and m1.this.y would be comparable iff m2.this.x and m2.this.y were.)

Object Tracking Issues

Maintaining tracking information for each value

We have assumed that we can add extra information (a union-find tag, or a list of variables) to each value computed by a program.

For objects, this is easy: place the information in a field, or create a separate table mapping objects to the tracked information could be used. In both approaches, the tracked information goes away when the object is garbage-collected -- so long as the separate table uses weakly held keys, as Java's WeakHashMap does. (Likewise, union-find data structures permit elements to be added and removed at will, and this will be done automatically by the garbage collector.)

Adding extra information to a primitive value is more involved. One approach is to convert all primitives to wrappers (e.g., int becomes Integer) and treat them like other objects.

A more efficient implementation is to create a shadow variable for each variable of primitive type. For every operation on the real variable, the instrumenter inserts corresponding operations on the shadow variable.

In the context of the Java Virtual Machine, an elegant implementation of the shadow variables is to maintain a shadow stack. For each primitive value on the main stack, the shadow stack contains its tracked information. Whenever a primitive value is pushed onto the main stack (for any reason), its tracked information is pushed onto the shadow stack. Because procedure call and return use the stack, the tracked information accompanies primitives automatically, without any special mechanism. The same is true for intermediate results of computations, like "(a+b)*c".

Issues/Questions