A Bayesian approach for denoising one-dimensional data
Guide to reproduce the results
We applied our Bayesian technique to several noisy X-ray scattering datasets, and submitted the results to the Journal of Applied Crystallography. (The paper is currently under review; to request a preprint, please e-mail email@example.com)
Here, we explain to readers of that paper how to reproduce its results.
Background and disclaimer
We are presently working to make GPPois into a polished R package, suitable for public consumption. However, the results in our paper pre-date this effort. They stem from an earlier version of the software which is more ad hoc and exploratory.
One theme in our paper was that the basic Gaussian Process approach must be extended to handle crystallographic data. In the new software, we implement these extensions as functions which use the basic interface, rather than as modifications to that interface. Presently, we have written such functions for the non-stationary lengthscale ℓ(Q) case (simulated Au dataset), but not for the multiple-lengthscales spiky-on-flat case (experimental TiO2 dataset).
Case 1: Simulated X-ray total scattering from 2 nm Au nanoparticles
Presently, we only provide instructions for reproducing our results with the new software, since it is much easier to use. We are happy to explain in detail how we obtained the results in the paper, but we're unlikely to take the trouble unless requested. In any case, we could find no significant differences between the output of the new and old software. (We have noticed one bug: in the old software, the definition of τ2 was off by a factor of 2. This did not materially affect the results; we simply used the equivalent value with the correct definition.)
To reproduce our Au results, please download this .zip with the following files:
- The main source code for the GPPois package
- A file containing the specialized functions to optimize ℓ(Q) using GPPois
- A tab-delimited file with three columns -- Q, noisy, and true -- giving both the noisy data and the true signal as a function of Q.
The following R code may then be copy-pasted into an R terminal:
# Setup the Dataset object to store our data:
au.data.frame <- read.table("simulated_Au_data.txt", header=TRUE, sep="\t")
au.data <- Dataset(id="Au", data=au.data.frame, X.names="Q", column="noisy",
# Find optimal hyperparameters (ell(Q) and sigma.f) which describe the data:
ell.bounds <- c(0.1, 20)
sigma.f.bounds <- c(1e-7, 30)
opt.hyp <- TrainSEVaryingEllFocusRegions(d=au.data,
n.pts.wide=200, decay.length=20, tau.sq=1e-4,
hyp.norm=FALSE, hyp.lin=FALSE, model=Model(id='VaryingEll'),
# Construct a Model using the optimized hyperparameters
M <- Model(id="varyingEll")
ell =VariableParameterValues('ell' , opt.hyp$opt$par),
Au.fit <- M$PosteriorInterval(d=au.data)
write.table(file="Au_results.txt", sep="\t", row.names=FALSE, quote=FALSE,
At this point, the file Au_results.txt will contain four columns: the Q-values, the posterior mean, and the curves for plus/minus one standard deviation of the posterior. (i.e., at each point, the true curve lies between lower and upper with the standard 1-sigma probability of 68.3%.)
There is currently a bug in the automatic optimization of the hyperparameters. The Cholesky decomposition will occasionally fail, complaining that "the leading minor of order XXX is not positive definite". Usually this means the matrix can't be factored in this way
, but that's not what's happening here. I haven't figured out what's causing this bug yet, but here's what to do about it
. The program will automatically enter debug mode. In the R
window, each time you see a prompt of the form D(##)>
, hit Enter to step to the next instruction. Alternatively, enter go()
at the prompt to execute the remaining instructions automatically. The matrix is not
singular, and the program will continue as usual.
Case 2: X-ray powder diffraction from 20 nm TiO2 nanoparticles
This dataset cannot yet be analyzed with the new software. To reproduce the results, we are supplying a legacy version of the code. This older code is tentative, messy, and poorly documented. It contains some ideas which are only partly implemented, and others which simply didn't work out. Nevertheless, it does analyze this dataset: whatever polish it may lack, the underlying ideas are the same.
To reproduce the TiO2 results, please download this .zip file containing the necessary scripts and datafiles.
- The legacy code for the GPPois approach
- A tab-delimited file with three columns: Q, hi.noise, and lo.noise. hi.noise is the short-count-time data which we fit, and lo.noise is a (processed) longer-count-time datafile, useful for checking the results.
- A tab-delimited file with four columns: XL.min, XL.max, XR.min, and XR.max. Each row corresponds to a region which appears to contain peaks. The first two values represent a Q-range expected to contain the left boundary of this peak-region; the next two values bound its right boundary.
- An R-script giving instructions for reproducing this data.
To reproduce the data, simply unzip the .zip file, run R, and type
at the prompt. Alternatively, the commands can be copy-pasted from the script, one at a time, to gain understanding.
NOTE: This datafile contains (N = 10679) datapoints. Analysis requires numerous (N x N) matrices, each of which takes up almost half a gigabyte. Beware running out of RAM! We suggest a bare minimum of 4 GB RAM; more is much better. The author finds the Au dataset is easily analyzed on his personal workstation (4 GB RAM), but the TiO2 dataset is better off submitted to a cluster.