We provide datasets with certified values for the mean, standard deviation, and (lag-1) autocorrelation coefficient to assess the accuracy of Univariate Summary Statistic calculations in statistical software. Computational inaccuracy has 3 sources:

- truncation error;
- cancellation error;
- accumulation error.

Truncation error is the inexact binary representation error in storing decimal numbers according to the IEEE standard arithmetic. Of course, once these representational digits are lost, they cannot be recovered; their effect can at best be held constant, and at worst propagated to larger errors.

Cancellation error is an error that occurs when analyzing data that has low relative variation; that is, data with a high level of "stiffness". In "Assessing the Accuracy of ANOVA Calculations in Statistical Software" (*Computational Statistics & Data Analysis* 8 (1989), pp 325-332) Simon and Lesage noted that as the number of constant leading digits in a particular dataset increases and the data grows more nearly constant (i.e., the stiffness increases) accurate computation of standard deviations becomes increasingly difficult. This also holds for other similarly computed summary statistics, like the autocorrelation coefficient. In both cases computation is hindered by subtracting data from a mean quite close to the data, leaving behind the digits from the mantissa of each data element that are most likely to have been misrepresented.

Accumulation error (also as noted by Simon & Lesage) is the error that occurs in direct proportion to the total number of arithmetic computations, which in turn in this univariate case is proportional to the number of observations. This increases the accumulation of small errors, making accurate computations difficult.

We include both generated and "real world" datasets so as to allow computational accuracy to be examined at different stiffness levels and different accumulation error levels. We have, in a fashion similar to the ANOVA datasets, drawn from the benchmark work of Simon and Lesage (1989), and have 4 "generated" data sets with the number of constant leading digits set to 7, 1, 7, and 8, respectively, and with the number of observations set to 3, 1001, 1001, and 1001, respectively. 5 "real world" datasets were borrowed from the dataset repository of the Dataplot Statistics/Graphics software system; two of these are from NIST statistical consulting, and the other 3 are "classic" general-interest sets drawn from outside NIST.

Datasets are ordered by level of difficulty (lower, average, and higher) according to their stiffness--the number of constant leading digits. This ordering is simply meant to provide rough guidance for the user; producing correct results on a dataset of higher difficulty does not imply that your software will correctly solve all datasets of average or even lower difficulty. Of the 9 datasets, 6 (5 "real world" and 1 generated) datasets are of the lower level of difficulty, 2 (generated) are of average level of difficulty, and 1 (generated) is of higher level of difficulty.

In computing general summary statistics, if you find your software giving less-than-desirable results in the calculation of the sample standard deviation, one simple remedial measure is to subtract the leading constant from all the observations in that dataset before analyzing it, and a second remedial measure is to assure yourself that the sample standard deviation is computed by the formula which first computes deviations about the mean before squaring and summing, as opposed to using the old desk calculator formula of a generation ago which involves the (computationally unstable) difference of 2 large numbers: the sums of squares of the raw data (uncentered) and the sum of the squared sample mean.

As noted in the General Background Information producing correct results for all datasets in this collection does not imply that your software will do the same for your own particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software.

We plan to update this collection of datasets in the future, and welcome your feedback on specific datasets to include, and on other ways to improve this web service.

Created August 15, 2018