Cautions on Interpreting and Using the SATE Data

SATE 2009, as well as its predecessor, SATE 2008, taught us many valuable 
lessons.  Most importantly, our analysis should NOT be used as a basis for 
rating or choosing tools; this was never the goal of SATE.

There is no single metric or set of metrics that is considered by the 
research community to indicate or quantify all aspects of tool performance. 
We caution readers not to apply unjustified metrics based on the SATE data.

Due to the variety and different nature of security weaknesses, defining 
clear and comprehensive analysis criteria is difficult. While the analysis 
criteria have been improved since SATE 2008, refinements are necessary and 
are in progress.

The test data and analysis procedure employed have limitations and might not 
indicate how these tools perform in practice. The results may not generalize 
to other software because the choice of test cases, as well as the size of 
test cases, can greatly influence tool performance. Also, we analyzed a 
small subset of tool warnings.

The tools were used in this exposition differently from their use in 
practice. We analyzed tool warnings for correctness and looked for related 
warnings from other tools, whereas developers use tools to determine what 
changes need to be made to software, and auditors look for evidence of 
assurance. Also in practice, users write special rules, suppress false 
positives, and write code in certain ways to minimize tool warnings.

We did not consider the user interface, integration with the development 
environment, and many other aspects of the tools, which are important for a 
user to efficiently and correctly understand a weakness report.

Teams ran their tools against the test sets in late August - early September 
2009. The tools continue to progress rapidly, so some observations from the 
SATE data may already be out of date.

Because of the stated limitations, SATE should not be interpreted as a tool 
testing exercise. The results should not be used to make conclusions 
regarding which tools are best for a particular application or the general 
benefit of using static analysis tools.

