Note: The planning meeting for SATE V was held on Monday, March 4, 2013 at NIST, from 1 to 4pm.
Static Analysis Tool Exposition (SATE) is designed to advance research (based on large test sets) in, and improvement of, static analysis tools that find security-relevant defects in source code. Briefly, participating tool makers run their tools on a set of programs. Researchers led by NIST analyze the tool reports. The results and experiences are reported at a workshop. The tool reports and analysis are made publicly available later.
SATE's purpose is NOT to evaluate nor choose the "best" tools. Rather, it is aimed at exploring the following characteristics of tools: relevance of warnings to security, their correctness, and prioritization. Its goals are:
Note. A warning is an issue (usually, a weakness) identified by a tool. A (tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.
"Report on the Static Analysis Tool Exposition (SATE) IV," Vadim Okun, Aurelien Delaitre, and Paul E. Black, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-297, January, 2013
The NIST Software Assurance Metrics And Tool Evaluation (SAMATE) project conducted the fourth Static Analysis Tool Exposition (SATE IV) to advance research in static analysis tools that find security defects in source code. The main goals of SATE were to enable empirical research based on large test sets, encourage improvements to tools, and promote broader and more rapid adoption of tools by objectively demonstrating their use on production software.
Briefly, eight participating tool makers ran their tools on a set of programs. The programs were four pairs of large code bases selected in regard to entries in the Common Vulnerabilities and Exposures (CVE) dataset and approximately 60 000 synthetic test cases, the Juliet 1.0 test suite. NIST researchers analyzed approximately 700 warnings by hand, matched tool warnings to the relevant CVE entries, and analyzed over 180 000 warnings for Juliet test cases by automated means. The results and experiences were reported at the SATE IV Workshop in McLean, VA, in March, 2012.
SATE IV, as well as its predecessors, taught us many valuable lessons. Most importantly, our analysis should NOT be used as a basis for rating or choosing tools; this was never the goal.
There is no single metric or set of metrics that is considered by the research community to indicate or quantify all aspects of tool performance. We caution readers not to apply unjustified metrics based on the SATE data.
Due to the nature and variety of security weaknesses, defining clear and comprehensive analysis criteria is difficult. While the analysis criteria have been much improved since the first SATE, further refinements are necessary.
The test data and analysis procedure employed have limitations and might not indicate how these tools perform in practice. The results may not generalize to other software because the choice of test cases, as well as the size of test cases, can greatly influence tool performance. Also, we analyzed a small subset of tool warnings.
The procedure that was used for finding CVE locations in the CVE-selected test cases and selecting related tool warnings, though improved since SATE 2010, has limitations, so the results may not indicate tools’ ability to find important security weaknesses.
Synthetic test cases are much smaller and less complex than production software. Weaknesses may not occur with the same frequency in production software. Additionally, for every synthetic test case with a weakness, there is one test case without a weakness, whereas in practice, sites with weaknesses appear much less frequently than sites without weaknesses. Due to these limitations, tool results, including false positive rates, on synthetic test cases may differ from results on production software.
The tools were used in this exposition differently from their use in practice. We analyzed tool warnings for correctness and looked for related warnings from other tools, whereas developers use tools to determine what changes need to be made to software, and auditors look for evidence of assurance. Also in practice, users write special rules, suppress false positives, and write code in certain ways to minimize tool warnings.
We did not consider the tools’ user interfaces, integration with the development environment, and many other aspects of the tools, which are important for a user to efficiently and correctly understand a weakness report.
Teams ran their tools against the test sets in August through October 2011. The tools continue to progress rapidly, so some observations from the SATE data may already be out of date.
Because of the stated limitations, SATE should not be interpreted as a tool testing exercise. The results should not be used to make conclusions regarding which tools are best for a particular application or the general benefit of using static analysis tools.
We invite participation from makers of static analysis tools that find weaknesses relevant to security. We welcome commercial, research, and open source tools. Participation to SATE is FREE.
The following summarizes the steps in the SATE procedure. The dates are subject to change.
The exposition consists of 3 language tracks: C/C++, Java and PHP (Alternatives: C#, Python).
SATE addresses different aspects of static analysis tools by using complementary kinds of test cases and analysis methods.
Teams run their tools and submit reports following specified conditions.
Finding all weaknesses in a reasonably large program is impractical. Also, due to the likely high number of tool warnings, analyzing all warnings may be impractical. Therefore, we select subsets of tool warnings for analysis.
Generally the analyst first selects issues for analysis. Second, find associated warnings from tools. This results in a subset of tool warnings. Analyze this subset.
Statistically select the same number of warnings from each tool report, assigning higher weight to categories of warnings with higher severity and avoiding categories of warnings with low severity.
We selected 30 warnings from each tool report using the following procedure:
If a tool did not assign severity, we assigned severity based on weakness names and our understanding of their relevance to security.
Security experts manually analyze the test cases and identify the most important weaknesses (manual findings). Analyze for both design weaknesses and source code weaknesses focusing on the latter. Since manual analysis combines multiple weaknesses with the same root cause, we anticipate a small number of manual findings, e.g., 10-25 per test case. Take special care to confirm that the manual findings are indeed weaknesses. Tools may be used to aid human analysis, but static analysis tools cannot be the main source of manual findings.
Check the tool reports to find warnings related to the manual findings. For each manual finding, for each tool: find at least one related warning, or conclude that there are no related warnings.
This method is useful because it is largely independent of tools and thus includes weaknesses that may not be found by any tools. It also focuses analysis on weaknesses found most important by security experts.
For each CVE-selected pair of test cases, check the tool reports to find warnings that identify the CVEs in the vulnerable version. Check whether the warnings are still reported for the fixed version.
This method is useful because it focuses analysis on exploited weaknesses.
SATE IV will use the same guidelines as SATE 2010. See the detailed criteria for analysis of correctness and significance and criteria for associating warnings.
Assign one of the following categories to each warning analyzed.
For each tool warning in the list of selected warnings, find warnings from other tools that refer to the same (or related) weakness. For each selected warning instance, our goal is to find at least one related warning instance (if it exists) from each of the other tools. While there may be many warnings reported by a tool that are related to a particular warning, we do not attempt to find all of them.
We will use the following degrees of association:
Mark tool warnings related to manual findings with one of the following:
We plan to analyze the data collected and present the following in our report:
The SATE IV output format is the same as the SATE 2010 format. SATE 2008 and 2009 outputs are subsets and are therefore compliant with the latest version.
In the SATE tool output format, each warning includes:
The latest SATE XML schema file can be downloaded.
Teams are encouraged to use the schema file for validation, for example:
To save time on figuring out how to compile the test cases, we provide a VM. It contains all the test cases for all the tracks. All dependencies required to compile the test cases are already installed in the VM. All the test cases are already installed in the VM. Follow the compilation instructions below to compile them.
Participants will need to download the software VMware Player to run the VM. It is available for free on several operating systems.
The VM runs on Ubuntu Linux 11.04. Sun JavaEE 5 is installed in the directory "/opt". You may want to tune the number of virtual CPUs and the amount of memory of the VM. The virtual machine needs to be shut down to do these changes.
The main account on the VM is "sate" and its passowrd is "sate". It has administration privileges through the "sudo" command.
Merging on Windows:
> copy /b SATE4-VM.tar.bz2.part0+SATE4-VM.tar.bz2.part1+SATE4-VM.tar.bz2.part2 SATE4-VM.tar.bz2
Merging on Linux/Mac:
$ cat SATE4-VM.tar.bz2.part0 SATE4-VM.tar.bz2.part1 SATE4-VM.tar.bz2.part2 > SATE4-VM.tar.bz2
For each test case, we provide the download link(s), additional information about test cases if applicable, and compilation instructions. While we provide the compilation instructions for Ubuntu Linux 10.04, the test cases should compile on other operating systems.
$ ./configure $ make
NOTE. Dovecot does memory allocation differently from other C programs. Its memory management is described here.
Compilation, on a fresh installation of Ubuntu 11.04 with GCC 4.5.2:
$ sudo apt-get install bison flex libgtk2.0-dev libgnutls-dev libpcap-dev $ ./configure $ make
Download the C/C++ synthetic test cases from our server.
NOTE. For the Java test cases, you need to download and install JDK 5.0 with Java EE. Then point your "JAVA_HOME" environment variable to where the JRE is installed ("/opt/SDK/jdk/jre" by default).
$ sudo su # apt-get install ant # export JAVA_HOME=/opt/SDK/jdk/jre # ant
NOTE. We updated some of the compilation scripts for the 5.5.13 version of Tomcat. The code remains unchanged.
NOTE. To compile different versions of Tomcat on the same computer, it may be necessary to remove files left over from a previous compilation in /usr/share/java.
$ sudo apt-get install maven2 $ export JAVA_HOME=/opt/SDK/jdk/jre $ mvn compile test-compile
Download the Java synthetic test cases from our server.
The SATE IV organizing meeting was Friday, 4 March 2011 in McLean, Virginia co-located with the 14th Semi-Annual Software Assurance Forum.