# A Bayesian approach for denoising one-dimensional data

## Guide to reproduce the results

We applied our Bayesian technique to several noisy X-ray scattering datasets, and submitted the results to the Journal of Applied Crystallography. (The paper is currently under review; to request a preprint, please e-mail charles.hogg@nist.gov)

Here, we explain to readers of that paper how to reproduce its results.

### Background and disclaimer

We are presently working to make GPPois into a polished R package, suitable for public consumption. However, the results in our paper pre-date this effort. They stem from an earlier version of the software which is more *ad hoc* and exploratory.

One theme in our paper was that the basic Gaussian Process approach must be extended to handle crystallographic data. In the new software, we implement these extensions as functions which *use* the basic interface, rather than as modifications to that interface. Presently, we have written such functions for the non-stationary lengthscale ℓ(Q) case (simulated Au dataset), but not for the multiple-lengthscales spiky-on-flat case (experimental TiO2 dataset).

### Case 1: Simulated X-ray total scattering from 2 nm Au nanoparticles

Presently, we only provide instructions for reproducing our results with the new software, since it is much easier to use. We are happy to explain in detail how we obtained the results in the paper, but we're unlikely to take the trouble unless requested. In any case, we could find no significant differences between the output of the new and old software. (We *have* noticed one bug: in the old software, the definition of τ^{2} was off by a factor of 2. This did not materially affect the results; we simply used the equivalent value with the correct definition.)

To reproduce our Au results, please download this `.zip` with the following files:

`GPPois.R`
- The main source code for the GPPois package
`testing.R`
- A file containing the specialized functions to optimize ℓ(Q) using
`GPPois`
`simulated_Au_data.txt`
- A tab-delimited file with three columns -- Q, noisy, and true -- giving both the noisy data and the true signal as a function of Q.

The following R code may then be copy-pasted into an R terminal:

source("GPPois.R")
# Setup the Dataset object to store our data:
au.data.frame <- read.table("simulated_Au_data.txt", header=TRUE, sep="\t")
au.data <- Dataset(id="Au", data=au.data.frame, X.names="Q", column="noisy",
poisson.names="noisy", data.offset=0)
# Find optimal hyperparameters (ell(Q) and sigma.f) which describe the data:
ell.bounds <- c(0.1, 20)
sigma.f.bounds <- c(1e-7, 30)
source("testing.R")
opt.hyp <- TrainSEVaryingEllFocusRegions(d=au.data,
n.pts.wide=200, decay.length=20, tau.sq=1e-4,
hyp.norm=FALSE, hyp.lin=FALSE, model=Model(id='VaryingEll'),
varying.ell.bounds=ell.bounds, varying.sigma.f.bounds=sigma.f.bounds);
# Construct a Model using the optimized hyperparameters
M <- Model(id="varyingEll")
M$AddCovariance(CovarianceSEVaryingEll(id="vary", X.ell=opt.hyp$X,
ell =VariableParameterValues('ell' , opt.hyp$opt$par),
sigma.f=VariableParameterValues('sigma.f', opt.hyp$opt$par)))
Au.fit <- M$PosteriorInterval(d=au.data)
write.table(file="Au_results.txt", sep="\t", row.names=FALSE, quote=FALSE,
Au.fit)

At this point, the file `Au_results.txt` will contain four columns: the Q-values, the posterior mean, and the curves for plus/minus one standard deviation of the posterior. (i.e., at each point, the true curve lies between `lower` and `upper` with the standard 1-sigma probability of 68.3%.)

**NOTE:** There is currently a bug in the automatic optimization of the hyperparameters. The Cholesky decomposition will occasionally fail, complaining that "the leading minor of order XXX is not positive definite". Usually this means

the matrix can't be factored in this way, but that's not what's happening here. I haven't figured out what's causing this bug yet, but

*here's what to do about it*. The program will automatically enter debug mode. In the

`R` window, each time you see a prompt of the form

`D(##)>`, hit Enter to step to the next instruction. Alternatively, enter

`go()` at the prompt to execute the remaining instructions automatically. The matrix is

*not* singular, and the program will continue as usual.

### Case 2: X-ray powder diffraction from 20 nm TiO_{2} nanoparticles

This dataset cannot yet be analyzed with the new software. To reproduce the results, we are supplying a legacy version of the code. This older code is tentative, messy, and poorly documented. It contains some ideas which are only partly implemented, and others which simply didn't work out. Nevertheless, it does analyze this dataset: whatever polish it may lack, the underlying ideas are the same.

To reproduce the TiO_{2} results, please download this .zip file containing the necessary scripts and datafiles.

`GPPois_legacy.R`
- The legacy code for the GPPois approach
`TiO2_data.txt`
- A tab-delimited file with three columns:
`Q`, `hi.noise`, and `lo.noise`. `hi.noise` is the short-count-time data which we fit, and `lo.noise` is a (processed) longer-count-time datafile, useful for checking the results.
`manual_boundary_estimates.txt`
- A tab-delimited file with four columns:
`XL.min`, `XL.max`, `XR.min`, and `XR.max`. Each row corresponds to a region which appears to contain peaks. The first two values represent a Q-range expected to contain the left boundary of this peak-region; the next two values bound its right boundary.
`reproduce.R`
- An
`R`-script giving instructions for reproducing this data.

To reproduce the data, simply unzip the `.zip` file, run `R`, and type

source("reproduce.R")

at the prompt. Alternatively, the commands can be copy-pasted from the script, one at a time, to gain understanding.

*NOTE:* This datafile contains (N = 10679) datapoints. Analysis requires numerous (N x N) matrices, each of which takes up almost half a gigabyte. Beware running out of RAM! We suggest a bare minimum of 4 GB RAM; more is much better. The author finds the Au dataset is easily analyzed on his personal workstation (4 GB RAM), but the TiO_{2} dataset is better off submitted to a cluster.