Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Discovery of digital forensic dataset characteristics with CASE-Corpora



Alexander Nelson


The digital forensics community has generated training and reference data over the course of decades. However, significant challenges persist today in the usage pipeline for that data, from research problem formulation, through discovery of applicable shared data, through local processing and analysis. The problems include classic afflictions of Internet resources such as link rot and maintainer departure. Greater challenges remain in covering the gap between research questions in natural language, and the structured metadata in available dataset annotations. A dataset that describes a picture sent between two accounts in a chat app entails many levels of pattern abstraction, and should be discoverable whether a user is searching datasets for, e.g., that app, or for any instance of picture transmission. Yet, disparate datasets today lack a unifying language which can be aggregated and queried in one location. We present CASE-Corpora, a community index of available forensic reference and training datasets. CASE-Corpora aggregates dataset descriptions into a single ontological knowledge graph. Several ontologies are exercised to enable representation of a dataset, its downloadable resources, and how those resources may have migrated between hosts over time. These are represented in the Data Catalog (DCAT) and Provenance (PROV) Ontologies. However, these ontologies intentionally abstain from rich representation in focused domains such as digital forensics. They can express metadata of where to find data, but not why a forensic analyst would want to find it. CASE-Corpora describes forensically-relevant qualities of datasets by way of the Cyber-investigation Analysis and Standard Expression (CASE) Ontology and the Unified Cyber Ontology (UCO). CASE-Corpora uses all of the above ontologies to describe not only what one would download, but what devices, actions, and environmental captures went into the download; the hashes of both downloadable resources and what would be extracted for analysis, verifying chain of custody; and, where available, ground truth for analytic results' cross-verification. CASE and UCO exercise ontology-level interoperability to expertise and practice with domain-agnostic dataset discovery and provenance review to apply to finely-specified forensic detail. This presentation introduces CASE-Corpora and its support for curation and growth by the community. Data review mechanisms assist with ensuring conformant usage of the employed ontologies. Queries written for the corpora show the richness of what the community has already developed for research, such as what devices have been involved in any dataset. Our experience with developing CASE has repeatedly and consistently shown immense value in defining questions and encoding them as queries. For CASE, this has grown the ontology. For the community, CASE-Corpora can grow our collective knowledge on discoverable and testable patterns in data that many in the community have put effort into making available. This presentation will improve ontological competency, analysis interoperability, and data discovery for the community.
Proceedings Title
Conference Dates
July 11-14, 2022
Conference Location
Virtual, DC, US
Conference Title
Digital Forensic Research Workshop USA 2022


digital forensics, ontology, dataset


Nelson, A. (2022), Discovery of digital forensic dataset characteristics with CASE-Corpora, DFRWS USA 2022, Virtual, DC, US, [online], (Accessed May 27, 2024)


If you have any questions about this publication or are having problems accessing it, please contact

Created July 11, 2022, Updated April 24, 2024