Author(s)
Kamal Choudhary
Abstract
Determining complete atomic structures directly from microscopy images remains a longstanding challenge in materials science. MicroscopyGPT is a vision-language model (VLM) that leverages multimodal generative pre-trained transformers to predict full atomic configurations including lattice parameters, element types, and atomic coordinates, from Scanning Transmission Electron Microscopy (STEM) images. The model is trained on a chemically and structurally diverse dataset of simulated STEM images generated using the AtomVision tool and the JARVIS-DFT \textcolorblack}as well as the C2DB two-dimensional (2D)} materials databases. The training set for finetuning comprises approximately 5000 2D materials, enabling the model to learn complex mappings from image features to crystallographic representations. I fine-tune the 11-billion-parameter LLaMA model, allowing efficient training on resource-constrained hardware. The rise of VLMs and the growth of materials datasets offer a major opportunity for microscopy-based analysis. This work highlights the potential of automated structure reconstruction from microscopy, with broad implications for materials discovery, nanotechnology, and catalysis.
Citation
Journal of Physical Chemistry Letters
Citation
Choudhary, K.
(2025),
MicroscopyGPT: Generating 3D Atomic Structure Captions from Microscopy Images Using Vision-Language Transformers, Journal of Physical Chemistry Letters, [online], https://doi.org/10.1021/acs.jpclett.5c01257, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=960025 (Accessed April 25, 2026)
Additional citation formats
Issues
If you have any questions about this publication or are having problems accessing it, please contact [email protected].