NIST logo
Bookmark and Share

SCORE (System, Component and Operationally Relevant Evaluations)


SCORE (System, Component and Operationally Relevant Evaluations) is a unified set of criteria and software tools for defining a performance evaluation approach for complex intelligent systems. It provides a comprehensive evaluation blueprint that assesses the technical performance of a system and its components through isolating and changing variables as well as capturing end-user utility of the system in realistic use-case environments.

The SCORE framework has proven to be widely-applicable in nature and equally relevant to technologies ranging from manufacturing to military systems. It has been applied to the evaluation of technologies in DARPA programs that range from soldier-worn sensor on patrol to speech-to-speech translation systems. It is also currently being applied to the assessing the control of autonomous vehicles on a shop floor.


Intelligent systems tend to be complex and non-deterministic, involving numerous components that are jointly working together to accomplish some overall goal. Existing approaches to measuring such systems often focus on evaluating the system as a whole or individually evaluating some of the individual components under very controlled, but limited, conditions. These approaches do not comprehensively and quantitatively assess the impact of variables such as environmental variables (e.g., lighting, external distances) and system variables (e.g., processing power, memory size) on the system’s overall performance. Through its comprehensive evaluation criteria and software tools, the SCORE framework has greatly enhanced the ability to quantitatively and qualitatively evaluate intelligent systems at the component level −and the system level− in operationally relevant environments. 

SCORE is unique in that:

  • It is applicable to a wide range of technologies, from manufacturing to defense systems
  • Elements of SCORE can be decoupled and customized based upon evaluation goals  
  • It has the ability to evaluate a technology at various stages of development, from conceptual to full maturation
  • It combines the results of targeted evaluations to produce an extensive picture of a systems’ capabilities and utility 

SCORE was initially applied to intelligent systems developed under the DARPA (Defense Advanced Research Projects Agency) ASSIST and TRANSTAC program, involving eight evaluations (involving over 60 personnel at each evaluation) assessing the performance of technologies developed by twelve independent research teams. The SCORE-based evaluations also provided the researchers and end users with the information that they needed to determine if and when the technology will be ready to be put to use. SCORE allowed developers to identify the various key components of the system and evaluate them both independently and as a whole, thus helping to determine the impact of the individual components on the performance of the overall system. This detailed analysis allows one to more accurately target the aspects of the systems that were shown to provide the greatest benefit to the overall advancement of the technology and therefore helped to identify where the program funding should be applied to get the most “bang for the buck.” 


SCORE Framework

[Download a PowerPoint Show of the SCORE Framework]



The Advanced Soldier Sensor Information Systems and Technology (ASSIST) program is an advanced technology research and development program whose objective is to exploit soldier-worn sensors to augment a Soldier’s mission recall and reporting capability to enhance situational knowledge within Military Operations in Urban Terrain (MOUT) environments.  

The ASSIST evaluation design was driven by the two programmatic metrics laid out by DARPA:  

  • Measure the progressive development of ASSIST system technical capabilities over multiple evaluations
  • Predict the impact these technologies will have on warfighter performance in a variety of missions and job functions  

Guided by the SCORE framework, numerous test types were developed to satisfy the above metrics. Elemental tests was used to measure technical performance at both the component and system levels. Vignette tests were created to perform System Level Testing- Utility Assessment. Task tests were generated to perform Capabilities Level Testing- Utility Assessments.


ASSIST Framework

[Download a PowerPoint Show of the ASSIST Framework]




Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) is a DARPA advanced technology and research program whose goal is to demonstrate capabilities to rapidly develop and field free-form, two-way speech-to-speech translation systems enabling English and foreign language speakers to communicate with one another in real-world tactical situations where an interpreter is unavailable. Technical performance evaluations are intended to support quantitative assessment in the performance of TRANSTAC technologies.

The TRANSTAC evaluation design was driven by the two programmatic metrics laid out by DARPA:  

  • System usability testing - providing overall scores and assessments to the capabilities of the whole system
  • Software component testing – evaluating individual components of a system to see how well they performed in isolation  

The SCORE framework was implemented to address these metrics in the form of numerous evaluation types. Offline Evaluations were developed to measure the technical performance at the component level. Both Main Live Evaluations and Utility-Lab Evaluations were developed to measure System Level Testing at both Technical Performance and Utility Assessment level. The Names Evaluation was developed to measure both the Technical Performance at the Component Level testing and the Utility Assessment at the Capability Level testing. Lastly, Utility-Field evaluations are specifically design to measure Utility Assessment at the System Level.


TRANSTAC Framework

[Download a PowerPoint Show of the TRANSTAC Framework]



The impact of this work has been far-reaching and substantial. This can be seen by:  

  • The SCORE framework has been adopted by multiple programs within DARPA, which has greatly enhanced their ability to quantitatively and qualitatively evaluate intelligent systems at multiple levels.
  • The approaches used in SCORE are starting to redefine the way that performance evaluation is performed on intelligent systems. As a result of the DARPA evaluations, the SCORE Evaluation Team has been asked to advise other programs on how to apply the techniques for their purposes.
  • Research teams are starting to use the SCORE evaluation approach to evaluate their own systems. One researcher stated “We switched to NIST’s evaluation procedures because we found them superior to our own.

Lead Organizational Unit:





Craig Schlenoff
Brian Weiss

Ann Virts
Tony Downs
Fred Proctor
Greg Sanders (ITL)
Micky Steves (ITL)

Related Programs and Projects:

MOAST Project


General Information:
301 975 3456 Telephone
301 990 9688 Facsimile

100 Bureau Drive, M/S 8230
Gaithersburg, MD 20899-823