Cautions on Interpreting and Using the SATE Data

SATE 2008 was the first such exposition that we conducted, and it taught
us many valuable lessons. Most importantly, our analysis should NOT be
used as a direct source for rating or choosing tools; this was never the
goal of SATE.

There is no metric or set of metrics that is considered by the research
community to indicate all aspects of tool performance. We caution readers
not to apply unjustified metrics based on the SATE data.

Due to the variety and different nature of security weaknesses, defining
clear and comprehensive analysis criteria is difficult. As SATE progressed,
we realized that our analysis criteria were not adequate, so we adjusted
the criteria during the analysis phase. As a result, the criteria were
not applied consistently. For instance, we were inconsistent in marking
the severity of the warnings where we disagreed with tool's assessment.
In addition, our analysis of the tool reports has errors.

The test data and analysis procedure employed have serious limitations
and may not indicate how these tools perform in practice. The results
may not generalize to other software because the choice of test cases,
as well as the size of test cases, can greatly influence tool performance.
Also, we analyzed a small, non-random subset of tool warnings and in
many cases did not associate warnings that refer to the same weakness.

The tools were used in this exposition differently from their use
in practice. In practice, users write special rules, suppress false
positives, and write code in certain ways to minimize tool warnings.

We did not consider the user interface, integration with the development
environment, and many other aspects of the tools. In particular, the
tool interface is important for a user to efficiently and correctly
understand a weakness report.

Participants ran their tools against the test sets in February 2008.
The tools continue to progress rapidly, so some observations from
the SATE data may already be obsolete.

Because of the above limitations, SATE should not be interpreted as a
tool testing exercise. The results should not be used to make
conclusions regarding which tools are best for a particular application
or the general benefit of using static analysis tools.

