We applied our Bayesian technique to several noisy X-ray scattering datasets, and submitted the results to the Journal of Applied Crystallography. (The paper is currently under review; to request a preprint, please e-mail charles.hogg [at] nist.gov)
Here, we explain to readers of that paper how to reproduce its results.
We are presently working to make GPPois into a polished R package, suitable for public consumption. However, the results in our paper pre-date this effort. They stem from an earlier version of the software which is more ad hoc and exploratory.
One theme in our paper was that the basic Gaussian Process approach must be extended to handle crystallographic data. In the new software, we implement these extensions as functions which use the basic interface, rather than as modifications to that interface. Presently, we have written such functions for the non-stationary lengthscale ℓ(Q) case (simulated Au dataset), but not for the multiple-lengthscales spiky-on-flat case (experimental TiO2 dataset).
Presently, we only provide instructions for reproducing our results with the new software, since it is much easier to use. We are happy to explain in detail how we obtained the results in the paper, but we're unlikely to take the trouble unless requested. In any case, we could find no significant differences between the output of the new and old software. (We have noticed one bug: in the old software, the definition of τ2 was off by a factor of 2. This did not materially affect the results; we simply used the equivalent value with the correct definition.)
To reproduce our Au results, please download this .zip with the following files:
The following R code may then be copy-pasted into an R terminal:source("GPPois.R") # Setup the Dataset object to store our data: au.data.frame <- read.table("simulated_Au_data.txt", header=TRUE, sep="\t") au.data <- Dataset(id="Au", data=au.data.frame, X.names="Q", column="noisy", poisson.names="noisy", data.offset=0) # Find optimal hyperparameters (ell(Q) and sigma.f) which describe the data: ell.bounds <- c(0.1, 20) sigma.f.bounds <- c(1e-7, 30) source("testing.R") opt.hyp <- TrainSEVaryingEllFocusRegions(d=au.data, n.pts.wide=200, decay.length=20, tau.sq=1e-4, hyp.norm=FALSE, hyp.lin=FALSE, model=Model(id='VaryingEll'), varying.ell.bounds=ell.bounds, varying.sigma.f.bounds=sigma.f.bounds); # Construct a Model using the optimized hyperparameters M <- Model(id="varyingEll") M$AddCovariance(CovarianceSEVaryingEll(id="vary", X.ell=opt.hyp$X, ell =VariableParameterValues('ell' , opt.hyp$opt$par), sigma.f=VariableParameterValues('sigma.f', opt.hyp$opt$par))) Au.fit <- M$PosteriorInterval(d=au.data) write.table(file="Au_results.txt", sep="\t", row.names=FALSE, quote=FALSE, Au.fit)
At this point, the file Au_results.txt will contain four columns: the Q-values, the posterior mean, and the curves for plus/minus one standard deviation of the posterior. (i.e., at each point, the true curve lies between lower and upper with the standard 1-sigma probability of 68.3%.)NOTE: There is currently a bug in the automatic optimization of the hyperparameters. The Cholesky decomposition will occasionally fail, complaining that "the leading minor of order XXX is not positive definite". Usually this means the matrix can't be factored in this way, but that's not what's happening here. I haven't figured out what's causing this bug yet, but here's what to do about it. The program will automatically enter debug mode. In the R window, each time you see a prompt of the form D(##)>, hit Enter to step to the next instruction. Alternatively, enter go() at the prompt to execute the remaining instructions automatically. The matrix is not singular, and the program will continue as usual.
This dataset cannot yet be analyzed with the new software. To reproduce the results, we are supplying a legacy version of the code. This older code is tentative, messy, and poorly documented. It contains some ideas which are only partly implemented, and others which simply didn't work out. Nevertheless, it does analyze this dataset: whatever polish it may lack, the underlying ideas are the same.
To reproduce the TiO2 results, please download this .zip file containing the necessary scripts and datafiles.
To reproduce the data, simply unzip the .zip file, run R, and typesource("reproduce.R") at the prompt. Alternatively, the commands can be copy-pasted from the script, one at a time, to gain understanding.
NOTE: This datafile contains (N = 10679) datapoints. Analysis requires numerous (N x N) matrices, each of which takes up almost half a gigabyte. Beware running out of RAM! We suggest a bare minimum of 4 GB RAM; more is much better. The author finds the Au dataset is easily analyzed on his personal workstation (4 GB RAM), but the TiO2 dataset is better off submitted to a cluster.