Because UTC(NIST) is a national measurement standard and a national resource important to many users and industries, its reliability is of the utmost importance. NIST aims for (and has demonstrated) continuity of UTC(NIST) for many years with a combination of monitoring/alarming, “failsafe-by-design” and “defense-in-depth” engineering guidelines, and the operation of redundant systems. In this section, we’ll look at the elements that keep UTC(NIST) reliable, including alarm systems and on-call staff, alternate and contingency time scales located in Boulder that supplement the primary time scale, and physical and cyber security.
Alarm Systems and On-Call Staff – Multiple computers, located in different buildings, running different operating systems, and administered by different staff members, monitor the environmental sensors, clock data, and time scale validation data for several defined alarm conditions. The implementation details of the alarm computers differ, but the primary alarm mechanism for each is a modem, attached to an analog telephone line, that calls a dedicated phone number which is multiplexed to several of the staffs’ personal mobile devices, laboratory phones, and NIST-owned mobile devices. At least one trained staff member is scheduled to be “on-call” to respond to such alarms at all times, a duty that typically rotates biweekly. Secondary alarm messages are also sent including text messages and email (both dependent on the availability of the Internet), and generally these messages contain additional information about the alarm condition. The alarm computers send test messages daily to confirm they are working, and the staff members who are “on-call” are trained to notify the alarm computer operators if the alarm test messages are not received.
The alarm systems were designed to monitor all parameters of UTC(NIST) that can potentially impact its reliability and performance. Hundreds of system parameters are continuously monitored, including the health of each of the cesium and hydrogen maser clocks, the temperature and humidity of the laboratories where the time scale is located, the status of electrical power and backup power systems, the status of data acquisition and measurement systems, and of course, the performance of the physical signals output by the time scale.
Alternate and Contingency UTC(NIST) Time Scales – Redundancy provisions of the NIST time scale can be analyzed in the popular “PACE” framework originally used to describe redundant communication plans. PACE is an acronym for: Primary, Alternate, Contingency, Emergency.
In our usage, the Primary system is employed to continuously produce the best publicly available UTC(NIST) signals and data. An Alternate system is operated continuously to produce indefinitely indistinguishable UTC(NIST) signals. If the Primary system fails, a switchover to the Alternate system occurs, either automatically or semi-automatically, though the use of physical or software switches. Wherever possible, NIST designed the Primary and Alternate systems to have non-common points of failure.
A Contingency system is also operated continuously, with similar performance (at least for short holdover intervals) as the Primary/Alternate UTC(NIST) but without being completely indistinguishable to outside observers. The Contingency system shares few common points of failure with the Primary/Alternate systems. However, if the Contingency system needs to be put into service, some additional effort is required when (when compared to a pre-engineered switch to the Alternate system), as some physical hardware and cabling must be re-arranged.
Finally, an Emergency system is one which will exhibit inferior performance compared to the others, but shares zero common points-of-failure including some amount of geographic diversity. The secondary NIST time scales in Fort Collins, Colorado and Gaithersburg, Maryland (described in the next section) serve as the Emergency systems, but would only become the Primary UTC(NIST) time scale in the event of a catastrophic failure that disabled the Primary, Alternate, and Contingency time scales in Boulder.
The Alternate time scale in the PACE approach is implemented with a parallel system of duplicated hardware components exists to produce TA(NIST)alt continuously (Figure 12). The atomic clocks comprising the Primary ensemble also comprise the Alternate ensemble, but independent MCMS hardware is interrogated by an independent TSPC over independent network devices for data acquisition. An independent AOG is programmed to realize a 5 MHz UTC(NIST)alt signal; the Alternate AOG is referenced by a different clock than the Primary AOG. An independent counter/divider converts the Alternate AOG output into a PPS realization of UTC(NIST)alt. Two hardware switches, one for the 5 MHz signals and one for the PPS signals allow NIST operators to route either the Primary or Alternate outputs to all broadcast channels and downstream users. During normal operation, the Alternate system maintains a time offset of < 50 ps (0.05 ns) between TA(NIST) and TA(NIST)alt.
As shown in Figure 14, a third AOG is deployed whose output is also divided into a PPS signal. This Contingency AOG is sourced by a maser not used by either Primary or Alternate AOGs. It is also not automatically programmed by a TSPC, so without manual operator programming, linear frequency drift of its source maser will degrade it as a stand-in for UTC(NIST). However, use of the Contingency time scale can still provide UTC(NIST) broadcast signals of sufficient quality for most, but not all, users. The Contingency time scale will only be put into use on a temporary basis, during periods when both the Primary and Alternate systems require repair. NIST recently developed even more redundancy by constructing a second Contingency Time Scale, located in another building, that operates with a different set of atomic clocks.
Physical and Cyber Security of UTC(NIST) – UTC(NIST) is protected by internal NIST policies and systems that ensure both physical and cyber security. The time scale components and systems physically secured in buildings and laboratories that can only be accessed by authorized and trained NIST personnel. Time scale cyber security is evaluated in the information security framework known as the “CIA triad,” where CIA is an acronym for Confidentiality, Integrity, and Availability. Let’s take a brief look as to what the CIA triad means in the context of UTC(NIST) cyber security.
Confidentiality is generally not a high risk security component, because the correct time is not considered to be classified information, nor are data products resulting from NIST measurements of atomic clocks and time scale outputs. For example, NIST routinely publishes time scale data and methods in publicly-available journals and conferences, and seeks to support U.S. industry and commerce by freely discussing and promulgating best practices. However, the computers employed in the NIST time scale operations are not usable as general-purpose computing tools, are secured by physical means, and lack public network connectivity except for the Hub PC (Figure 5), which has limited network access subject to special NIST-managed security controls.
Integrity means that the atomic clocks are not exposed to disruption, and that measured data and time scale computation results are accurately acquired and preserved. Our approach begins with air-gapping all atomic clock devices: clock "health" data is collected without the use of general-purpose computer networks, and remote command/control of the atomic clocks is not allowed. In addition, the atomic clock measurement hardware (MSMC machines) and data acquisition computers (TSPC) are deployed on non-public networks that reside behind the highest practical physical security. All critical TSPC command-and-control functions require physical operator presence at the keyboard and are not remotely accessible.
Availability translates to the goal that the Primary, Alternate, and Contingency systems operate as continuously as possible, minimizing avoidable downtime due to preventative maintenance, development, upgrades, faulty software, denial-of-service attempts, etc. Availability is increased by several of our design choices. For example, isolating all critical functions from public networks removes the possibility of outside attacks. In addition, all computer systems have redundant storage devices, power supplies, as well as backup power systems. In addition, the relatively simple local area networks used by UTC(NIST) feature unmanaged switching, no routing, static IP addressing, and no DNS services. For these and other reasons, they have demonstrated unavailability rates many times lower than the general-purpose NIST network.