Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora

Ira Monarch; Jacob Collard; Sangjin Shin; Eswaran Subrahmanian; Talapady N. Bhat; Ram Sriram

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora

Published

December 1, 2022

Author(s)

Ira Monarch, Jacob Collard, Sangjin Shin, Eswaran Subrahmanian, Talapady N. Bhat, Ram Sriram

Abstract

This report describes the adaptation, composition and use of natural language processing, machine learning and other computational tools to help make implicit informational structures in very large technical corpora explicit. The tools applied to the corpora automatically build normalized multi-word term structures, which in turn are used to build taxonomies, semantic schema, topic models and knowledge graphs. In our cybersecurity use case, we apply these tools to help us understand the threat landscape as exhibited in the Common Vulnerabilities and Exposures (CVE - https://cve.mitre.org/ and https://nvd.nist.gov/) corpus with the aim of proactively anticipating threats. The use case provides the context for development, use and evaluation of the automated tools and processes based on them. The latter are incrementally and iteratively evaluated and improved as they are developed and used. Local evaluation and improvement come before global evaluation. We believe that performing global evaluation of text processing methodologies and processes currently exemplified by the Text REtrieval Conference (TREC), Document Understanding Conference (DUC) and Text Analysis Conference (TAC) are worth pursuing, and we have done so using TREC. However, we will mostly be focusing on local development, evaluation and improvement in this report. We will articulate various aspects of these approaches by describing and showing 1) our multi-word term based process for topic modeling that can be supported by semantic schema, taxonomic structures and knowledge graphs built out of the same multi-word terms; 2) the heuristic methods used to evaluate the performance of the multi-word term-based topic modeling that includes suggestions for measuring how well a topic is represented in the documents that are indexed to it; 3) how these local heuristic methods might be transformed into a new full-blown rigorous evaluation standard like TREC, DUC and TAC, but with emphasis on their contribution to interpretation and understanding of very large corpora. With respect to 3), we will also briefly explore what parts of the heuristic process would need to be automated and an indication of the algorithms needed to do so.

Citation

NIST Interagency/Internal Report (NISTIR) - 8414

Report Number

8414

NIST Pub Series

NIST Interagency/Internal Report (NISTIR)

Pub Type

NIST Pubs

Download Paper

https://doi.org/10.6028/NIST.IR.8414

Local Download

Keywords

Natural language processing, Cybersecurity, Topic models, Common Vulnerabilities and Exposures, Semantics, Root and Rule-based, Knowledge Graphs

Information technology, Cybersecurity and privacy and Artificial intelligence

Citation

Monarch, I. , Collard, J. , Shin, S. , Subrahmanian, E. , Bhat, T. and Sriram, R. (2022), Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora, NIST Interagency/Internal Report (NISTIR), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.IR.8414, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=933620 (Accessed March 9, 2026)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created December 1, 2022, Updated December 10, 2025

Was this page helpful?

Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues