Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora

Published

Author(s)

Ira Monarch, Jacob Collard, Sangjin Shin, Eswaran Subrahmanian, Talapady N. Bhat, Ram D. Sriram

Abstract

This report describes the adaptation, composition and use of natural language processing, machine learning and other computational tools to help make implicit informational structures in very large technical corpora explicit. The tools applied to the corpora automatically build normalized multi-word term structures, which in turn are used to build taxonomies, semantic schema, topic models and knowledge graphs. In our cybersecurity use case, we apply these tools to help us understand the threat landscape as exhibited in the Common Vulnerabilities and Exposures (CVE - https://cve.mitre.org/ and https://nvd.nist.gov/) corpus with the aim of proactively anticipating threats. The use case provides the context for development, use and evaluation of the automated tools and processes based on them. The latter are incrementally and iteratively evaluated and improved as they are developed and used. Local evaluation and improvement come before global evaluation. We believe that performing global evaluation of text processing methodologies and processes currently exemplified by the Text REtrieval Conference (TREC), Document Understanding Conference (DUC) and Text Analysis Conference (TAC) are worth pursuing, and we have done so using TREC. However, we will mostly be focusing on local development, evaluation and improvement in this report. We will articulate various aspects of these approaches by describing and showing 1) our multi-word term based process for topic modeling that can be supported by semantic schema, taxonomic structures and knowledge graphs built out of the same multi-word terms; 2) the heuristic methods used to evaluate the performance of the multi-word term-based topic modeling that includes suggestions for measuring how well a topic is represented in the documents that are indexed to it; 3) how these local heuristic methods might be transformed into a new full-blown rigorous evaluation standard like TREC, DUC and TAC, but with emphasis on their contribution to interpretation and understanding of very large corpora. With respect to 3), we will also briefly explore what parts of the heuristic process would need to be automated and an indication of the algorithms needed to do so.
Citation
NIST Interagency/Internal Report (NISTIR) - 8414
Report Number
8414

Keywords

Natural language processing, Cybersecurity, Topic models, Common Vulnerabilities and Exposures, Semantics, Root and Rule-based, Knowledge Graphs

Citation

Monarch, I. , Collard, J. , Shin, S. , Subrahmanian, E. , Bhat, T. and Sriram, R. (2022), Making Semantic Structures Explicit: Developing and Evaluating Tools and Techniques to Support Understanding of Large Cybersecurity Corpora, NIST Interagency/Internal Report (NISTIR), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=933620 (Accessed April 18, 2024)
Created February 4, 2022, Updated June 27, 2023