The value of a sound analyzer is that every one of its findings can be assumed to be correct, even if it cannot handle enormous programs or does not handle dozens of weakness classes. During the sixth Static Analysis Tool Exposition (SATE VI) we will analyze results and report tools that satisfy the SATE VI Ockham Sound Analysis Criteria. In brief the criteria are
This is the second Ockham Sound Analysis Criteria. The first is the SATE V Ockham Sound Analysis Criteria.
The report is released. It is "SATE VI Ockham Sound Analysis Criteria" NIST Internal Report (IR) 8304 and is available from DOI 10.6028/NIST.IR.8304. The data and programs to reproduce these results are available at DOI 10.18434/M32187 or https://nist-sate-ockham-sound-analysis-criteria-evaluation-material.s3.amazonaws.com/ockham-sate-VI-2020/ockhamCriteriaSATEVIdata2020.tar.xz (4.5 Megabytes download, 128 Megabytes uncompressed). A README file is available at https://nist-sate-ockham-sound-analysis-criteria-evaluation-material.s3.amazonaws.com/ockham-sate-VI-2020/README
November 2017. All participants decided to draw test cases from Juliet 1.3 C/C++. It is available from the SARD Test Suites page as a stand-alone suite. Test suite 108 is also available to browse or download test cases one by one.
The SATE VI Ockham Sound Analysis Criteria committee plans to be generous in interpreting the rules. We want tools to satisfy the criteria, so we will give tools the benefit of a doubt.
Indented paragraphs, which are emphasized, are informative, not normative.
No manual editing of the tool output is allowed. No automated filtering specialized to a test case or to SATE is allowed, either.
A site is a location in code where a fault might occur. A buggy site is one that has an instance of the weakness, that is, there is some input that will cause a violation. A non-buggy site is one that does not have an instance of the weakness, in other words, is safe or not vulnerable.
For instance, every buffer access in a C program is a site where buffer overflow might occur if the code is incorrect or buggy.
For data flow weaknesses, such as SQL injection, a site is every connected source/sink pair. A tool is allowed to have a different definition of site, as long as it is expressed.
A program may accept input at several places, but access SQL in just one or two places. Other programs may have just a few inputs from which data flows to many SQL calls. Counting pairs may balance these styles.
A finding is a definitive report about a site. In other words, that the site has a specific weakness (is buggy) or that the site does not have a specific weakness (is not buggy).
Uncertain reports like "a site is likely to have this weakness" or "a site is almost certainly safe" are at best ignored (not counted) and may be considered incorrect.
Sound means every finding is correct. The tool need not produce a finding for every site; that is completeness.
1A tool may have optional settings that cause unsound analysis. The tool still qualifies if it has clearly sound settings.
For example, for fast analysis or for some classes of weaknesses, the user may be able to select unsound (approximate) function signatures. A more inclusive statement of criterion 1 is, The tool is claimed to be sound or has a mode in which analysis is sound.
Each tool may have its own test case(s).
Although it is much easier and enlightening if all tools processed the same test case(s), we realize that different tools have strengths in different areas. We offer a variety of test case(s), and the tool maker may select test cases that are most appropriate or useful.
2Appropriate means those sites for which the tool may produce findings.
A tool may produce findings for buggy sites or for good sites or for some combination. The threshold mentioned and the minimum below apply to whatever set of sites apply.
There must be a minimum of 100 appropriate sites in the test case.
Reporting no possible buffer overflows for a Java program or no uncaught exceptions for a C program is not grounds for satisfying SATE VI Ockham criteria.
A test case may be one of the large programs from the classic track. Counts from small synthetic test cases, such as Juliet or SV-COMP, may be grouped to reach the threshold.
All synthetic test cases in the appropriate language and weakness class(es) are considered one test case. The tool maker may exclude specific test cases if the tool cannot handle them because they are too big or have unsupported constructs. To satisfy the SATE VI Ockham criteria, any such exclusions must be publicly reported.
For instance, a tool cannot satisfy the criteria for processing just a handfull of buffer overflow (BOF) or memory allocation (MAL) cases; it must run on all BOF or MAL cases in the collection.
We will estimate the number of sites in a test case by simple programs if there is significant disagreement or uncertainty about the number.
We anticipate accepting tools' count of number of sites. If there are concerns, we will use ad hoc "grep" or similar, simple methods.
We will determine that findings are correct (or incorrect) by simple programs for small synthetic test cases. For large test cases, we plan to compare tool findings by simple programs and manually review differences.
3Even theorem provers have had coding errors, so a proof of correctness of analysis soundness at the mathematical specification level may excuse incorrect findings.
All reasoning is based on models, assumptions, definitions, etc. (collectively, "models") Unexpected findings resulting from model differences need not disqualify a tool. In consultation with the tool maker, we will decide if an unexpected finding results from a reasonable model difference or whether it is incorrect. To satisfy the SATE VI Ockham criteria, any such differences must be publicly reported.
For instance, one tool may assume that file system permissions are set as the test case requires, while another tool makes no assumptions about permissions (that is, assumes the worst case). In this case, the tools could have different findings, yet both are correct.
However, if a tool models the "+" operator as subtraction, it is incorrect. Yes, models are hard to build and validate, but the value of a sound analyzer is its correctness.
The criteria is named for William of Ockham, of Occam's Razor fame. Since the details of the criteria will likely change in the future, the criteria name always includes a time reference: SATE VI Ockham Sound Analysis Criteria.