Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

A novel root and rule-based natural language processing (NLP) approach to information indexing and searching


A natural language processing approach to information indexing and searching

A natural language processing approach to information indexing and searching

Whether at the level of the individual, team, project or program, research and engineering work must keep abreast of huge amounts of published information that might contain discoveries or needed elements for discoveries crucial to its success. We, in collaboration with Software and Systems Division (ITL) have been developing next generation technologies for the automated creation of terminological and semantic resources from published information. Individual users can use these resources to create their own terminology which creates a common communication ground for groups of all sizes from small teams to large-scale multidisciplinary and multinational research and design projects and programs.  The technologies have been designed to include a set of default modules that can easily be modified or replaced by ones that are more adapted to user needs or domain requirements. These modules cover such things as domain-dependent preprocessing rules (for example, filtering out non-linguistic content), language models providing information about dependency structure, vector semantics, and syntactic annotations such as part of speech and terminology generation rules that can be added to or replaced as users gain increased understanding of the information being tracked. The resources produced can easily be plugged into information seeking tools providing the basis for thesauri that support expanding searches, structured indices (including key words, snippets and phrases) for browsing collections of online text and databases and ontologies helpful for organizing vast amounts of information. These resources could also facilitate detecting important changes in the information as it is updated that no longer fits adequately with the current terminological indices of the ontology. These changes may represent new knowledge paradigms in the information module.   Although our technology shares some common elements with that of general-purpose search engines such as Google, is fundamentally different in its emphasis on adaptability to different knowledge domains, capability of evolving and the ease with which the terminological and semantic resources can be plugged into individual research and engineering systems.

Given below are examples of some of our ongoing projects supporting information seeking in our databases and knowledge bases.

Machine Learning/Artificial Intelligence, data curation and discovery tool (RandR – Root and Rule) patented by NIST (BBD and ITL) is used to create databases of

  1. the COVID-19 Open Research Dataset (CORD-19), a collection of nearly a million scholarly research articles about coronaviruses, SARS-CoV-2 and COVID-19. Visit: with more details at

    A Web Resource for Exploring the CORD-19 Dataset Using Root- and Rule-Based Phrases - Collard, J., Bhat, T., Subrahmanian, E. et al. A Web Resource for Exploring the CORD-19 Dataset Using Root- and Rule-Based Phrases. J Indian Inst Sci 100, 725–731 (2020).  and
  2. In response to 1996 NIST Research Advisory Committee (RAC) recommendations, NIST executive board approved the development of an electronic form of NIST ‘BlueBook’ of NIST expertise. One of the big challenges in creating such as resource was the auto generation of semantically relevant search terms across multiple disciplines.  RandR was then developed with particular emphasis to this challenge. RandR was first tested for MGI projects, CORD-19 and then subsequently used to create a prototype for the NIST ‘BlueBook’ (available only inside the NIST firewall). This project is now being reviewed by OISM as a candidate for integration with the resources maintained by OISM:
  3. Enzyme thermodynamics database

One of his previous work done at NIST by Dr. Bhat on data management (Protein Data Bank, systems has been cited over 41,426 times.

Created May 5, 2017, Updated March 7, 2023