The NSRL keeps one copy of each unique (as defined by a SHA-1 hash string) file encountered in processing.
Each file is assigned an integer identification number as it is encountered, and the file is stored in a directory and filename structure based on that integer.
The numbering starts at one (1). A directory is created for every 1,000,000 files, and in each of those directories, a directory is created for each 1,000 files. The filename is a nine character left-padded string (e.g. $filename = sprintf("%09d", $fileID) ).
Thus file number 1 is stored in the directory/filename "000/000/000000001". File number 12,345,678 is stored in "012/345/012345678".
A tab-delimited file is available which contains the corpus location, SHA-1, byte count and full original path of each file in the corpus. 650MB Zip file and for the file hash signature. Check here for an example view of this data.