Discovery of digital forensic dataset characteristics with CASE-Corpora

Alexander Nelson

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Discovery of digital forensic dataset characteristics with CASE-Corpora

Published

July 11, 2022

Author(s)

Alexander Nelson

Abstract

The digital forensics community has generated training and reference data over the course of decades. However, significant challenges persist today in the usage pipeline for that data, from research problem formulation, through discovery of applicable shared data, through local processing and analysis. The problems include classic afflictions of Internet resources such as link rot and maintainer departure. Greater challenges remain in covering the gap between research questions in natural language, and the structured metadata in available dataset annotations. A dataset that describes a picture sent between two accounts in a chat app entails many levels of pattern abstraction, and should be discoverable whether a user is searching datasets for, e.g., that app, or for any instance of picture transmission. Yet, disparate datasets today lack a unifying language which can be aggregated and queried in one location. We present CASE-Corpora, a community index of available forensic reference and training datasets. CASE-Corpora aggregates dataset descriptions into a single ontological knowledge graph. Several ontologies are exercised to enable representation of a dataset, its downloadable resources, and how those resources may have migrated between hosts over time. These are represented in the Data Catalog (DCAT) and Provenance (PROV) Ontologies. However, these ontologies intentionally abstain from rich representation in focused domains such as digital forensics. They can express metadata of where to find data, but not why a forensic analyst would want to find it. CASE-Corpora describes forensically-relevant qualities of datasets by way of the Cyber-investigation Analysis and Standard Expression (CASE) Ontology and the Unified Cyber Ontology (UCO). CASE-Corpora uses all of the above ontologies to describe not only what one would download, but what devices, actions, and environmental captures went into the download; the hashes of both downloadable resources and what would be extracted for analysis, verifying chain of custody; and, where available, ground truth for analytic results' cross-verification. CASE and UCO exercise ontology-level interoperability to expertise and practice with domain-agnostic dataset discovery and provenance review to apply to finely-specified forensic detail. This presentation introduces CASE-Corpora and its support for curation and growth by the community. Data review mechanisms assist with ensuring conformant usage of the employed ontologies. Queries written for the corpora show the richness of what the community has already developed for research, such as what devices have been involved in any dataset. Our experience with developing CASE has repeatedly and consistently shown immense value in defining questions and encoding them as queries. For CASE, this has grown the ontology. For the community, CASE-Corpora can grow our collective knowledge on discoverable and testable patterns in data that many in the community have put effort into making available. This presentation will improve ontological competency, analysis interoperability, and data discovery for the community.

Proceedings Title

DFRWS USA 2022

Conference Dates

July 11-14, 2022

Conference Location

Virtual, DC, US

Conference Title

Digital Forensic Research Workshop USA 2022

Pub Type

Conferences

Download Paper

Local Download

Keywords

digital forensics, ontology, dataset

Video analytics, Standards education, Software testing, Reference data, Cybersecurity and privacy, Mobile, Interoperability testing, Internet of Things (IoT), Information retrieval, Frameworks, Digital evidence, Cybersecurity measurement, Cybersecurity education and workforce development, Cyber-physical systems, Conformity assessment, Conformance testing, Complex systems, Cloud computing and virtualization and Artificial intelligence

Citation

Nelson, A. (2022), Discovery of digital forensic dataset characteristics with CASE-Corpora, DFRWS USA 2022, Virtual, DC, US, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=934725 (Accessed July 29, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created July 11, 2022, Updated August 29, 2025

Was this page helpful?