The value of a sound analyzer is that every one of its findings can be assumed to be correct, even if it cannot handle enormous programs or does not handle dozens of weakness classes. During the fifth Static Analysis Tool Exposition (SATE V) we will analyze results and report tools that satisfy the SATE V Ockham Sound Analysis Criteria. In brief the criteria are
The report is released. It is "SATE V Ockham Sound Analysis Criteria" NIST Internal Report (IR) 8113 and is available from DOI 10.6028/NIST.IR.8113. The data and programs to reproduce these results are available at DOI 10.18434/T4WC7V or https://nist-sate-ockham-sound-analysis-criteria-evaluation-material.s3.amazonaws.com/ockham-sate-V-2016/ockhamCriteriaSATEVdata2017.tar.xz (9.3 Megabytes download, 390 Megabytes uncompressed). A README file is available at https://nist-sate-ockham-sound-analysis-criteria-evaluation-material.s3.amazonaws.com/ockham-sate-V-2016/README
The second Ockham Sound Analysis Criteria is SATE VI Ockham Sound Analysis Criteria.
The SATE V Ockham Sound Analysis Criteria committee plans to be generous in interpreting the rules. We want tools to satisfy the criteria, so we will give tools the benefit of a doubt.
Indented paragraphs, which are emphasized, are informative, not normative.
No manual editing of the tool output is allowed. No automated filtering specialized to a test case or to SATE is allowed, either.
A site is a location in code where a weakness might occur. A buggy site is one that has an instance of the weakness, that is, there is some input that will cause a violation. A non-buggy site is one that does not have an instance of the weakness, in other words, is safe or not vulnerable.
For instance, every buffer access in a C program is a site where buffer overflow might occur if the code is incorrect or buggy.
For data flow weaknesses, such as SQL injection, a site is every connected source/sink pair. A tool is allowed to have a different definition of site, as long as it is expressed.
A program may accept input at several places, but access SQL in just one or two places. Other programs may have just a few inputs from which data flows to many SQL calls. Counting pairs may balance these styles.
A finding is a definitive report about a site. In other words, that the site has a specific weakness (is buggy) or that the site does not have a specific weakness (is not buggy).
Uncertain reports like "a site is likely to have this weakness" or "a site is almost certainly safe" are at best ignored (not counted) and may be considered incorrect.
Sound means every finding is correct. The tool need not produce a finding for every site; that is completeness.
1A tool may have optional settings that cause unsound analysis. The tool still qualifies if it has clearly sound settings.
For example, for fast analysis or for some classes of weaknesses, the user may be able to select unsound (approximate) function signatures. A more inclusive statement of criterion 1 is, The tool is claimed to be sound or has a mode in which analysis is sound.
2Even theorem provers have had coding errors, so a proof of correctness of analysis soundness at the mathematical specification level may excuse incorrect findings.
A test case may be one of the large programs. Counts from the Juliet test cases may be grouped to reach the threshold.
All the Juliet test cases in the appropriate language and weakness class(es) are considered one test case.
For instance, a tool cannot achieve the criteria for running just a selection of buffer overflow cases; it must run on all CWE 121, 122, 123, 124, 126, 127, 129, and 131 cases.
There must be a minimum of 10 sites and two findings in the test case.
Reporting no possible buffer overflows for a Java program or no uncaught exceptions for a C program is not grounds for satisfying SATE V Ockham criteria.
We will estimate the number of sites in a test case by simple programs if there is significant disagreement or uncertainty about the number.
We anticipate accepting tools' count of number of sites. If there are concerns, we will use ad hoc "grep" or similar, simple methods.
We will determine that findings are correct (or incorrect) by simple programs for Juliet test cases. For other test cases, we anticipate comparing tool findings by simple programs and manually reviewing differences.
All reasoning is based on models, assumptions, definitions, etc. (collectively, "models") Unexpected findings resulting from model differences need not disqualify a tool. In consultation with the tool maker, we will decide if an unexpected finding results from a reasonable model difference or whether it is incorrect. To satisfy the SATE V Ockham criteria, any such differences must be publicly reported.
For instance, one tool may assume that file system permissions are set as the test case requires, while another tool makes no assumptions about permissions (that is, assumes the worst case). In this case, the tools could have different findings, yet both are correct.
However, if a tool models the "+" operator as subtraction, it is incorrect. Yes, models are hard to build and validate, but the value of a sound analyzer is its correctness.
The criteria is named for William of Ockham, of Occam's Razor fame. Since the details of the criteria will likely change in the future, the criteria name always includes a time reference: SATE V Ockham Sound Analysis Criteria.