Bugs as Deviant Buffer: A General Approach to Inferring Errors in Systems
Code

November 27, 2001

This very recent paper (circa 2001) from a group of researchers at Stanford
University presents a new method for detecting and inferring bugs in code.
Previous research from this group had established a method for rule-based
checking of code -- that is, by specifying a rule (:a: must follow :b:), the
program could verify that every occurrence of :a: was preceded by an occurrence
of :b:.  This paper extends that work by providing a method for inferring these
correctness rules from the source code itself, removing programmers' lossy
memories from the rule-creation process and by allowing any program to be
verified without a priori knowledge of correctness.

The main intuition behind this work is that the beliefs of the programmers can
be inferred from common coding practices and common sense.  For instance, the
dereferencing of a pointer implies that the pointer should be non-null.  The
program derives what it believes are the beliefs of the programmers.  If it
encounters a contradiction, it suspects that there must be an error and signals
this for the user to examine.  As the paper emphasises, the key part of this
belief is that contradictory beliefs can be found without any knowledge of what
the correct belief is.

The paper is motivated by the self-professed desire to "find as many serious
bugs as possible."  Six different checkers of the sort described above are
implemented to check for a variety of different types of error classes.  These
checkers consider two types of programmer beliefs -- MAY beliefs and MUST
beliefs.  MUST beliefs are fairly cut and dry -- if a programmer dereferences a
pointer, he must believe it to be non-null.  MAY beliefs are more ambigious --
the fact that foo() always follows bar() implies that the programmer may believe
that they go together.

MUST beliefs are generally discovered based upon the analysis of .  That is, if
a belief is inferred at one place in the code, that belief is propagated to all
related locations.  If two beliefs conflict at any location, that signals a
possible bug.

MAY beliefs lend themselves to a technique termed statistical analysis.  That
is, statistical trends throughout the code are derived, and if parts of the code
deviate from this trend, it is flagged as a possible error.   As one might
expect, there needs to be a ranking metric for errors and some threshold cutoff,
lest all lines of code be flagged as buggy.  They choose a metric that compares
the number of examples of a belief to the number of counter-examples.  The
metric ranks beliefs on the basis of how far they differ from an expected ratio.
Again, as to be expected, the size of the codebase is directly proportional to
the success of statistical analysis.

Four different checkers are examined in depth.  The first method checks for
internal null consistency.  It looks for three types of errors: check-then-use
(a null pointer is checked then used), use-then-check (a pointer is used and
subsequently checked if it is null), and redundant checks (a pointer is
uselessly tested multiple times).  The second method, a security checker, looks
for places where the kernel may unsafely deference a user pointer.  The third
method attempts to infer failure conditions by checking for places where error
conditions are not appropriately handled.  It does this by statistical analysis.
The fourth method described attempts to derive temporal rules.  These rules are
of the form no :a: after :b:, and indicate relative temporal relationships in
code.

This paper seemed incredibly successful -- one of their goals was to find as
many serious bugs as possible, and they proved this by finding a number of bugs
in the OpenBSD and Linux kernels.  Their work also seemed largely successful in
minimizing the false positive rate.  For statistical analysis like the type
employed in this paper, specifying appropriate metrics and thresholds would seem
to be extremely difficult part, but this work succeeded admirably.  The fact
that this method required no a priori of the semantics of this system made it
brilliant.  Most user-specifiable rules about correctness seem to be derivable
from this type of analysis.  One of the cutest and most-interesting parts of
this work is that it is able to determine places in the code where a programmer
might not fully understand the interface, but has managed to remove all compiler
and warnings and get the program to work in the common case.  By discovering the
inconsistent (but syntactically correct) use of APIs, this program is able to
flag suspicious code for audit.

In general, this paper seems to be one of those great ideas paper -- the ideas
are so obvious, I find it hard to believe that this work has not already been
done -- I anticipate the day that gcc or Project Builder for Mac OS X, or Visual
Studio for Windows has this feature.