We have converted the (open part) of the NCI database to SMILES format. It will be made available to the public very shortly, probably at http://epnws1.ncifcrf.gov:2345/dis3d/3ddatabase/pubstruc.html.
This SMILES database contains essentially all open structures of the NCI database up to about a year ago or so. The exact cutoff date is still being decided upon, but the database will contain on the order of 220,000 compounds.
No structures were excluded a priori based on chemical grounds. This means, for example, that metal-containing compounds were left in the database, and it is up to the user to ascertain the usefulness of these and other SMILES strings.
The program Corina v.1.7 was used to generate 3D coordinates from the connection tables. Using a little Perl script, the conversion program Babel v.1.6 was then used to generate the SMILES strings from the 3D coordinates, which Babel needs (see below). (See http://schiele.organik.uni-erlangen.de/corina/corina.html,http://mercury.aichem.arizona.edu/babel.html for the home pages ofCorina and Babel, respectively.)
No other patches were applied, and together with the presence of metals and other 'weird' things in the database this means that the use of this database is at anyone's own risk. So, please caveat emptor! Of course, if people find specific problems, we'd like to know about them so that they can eventually be fixed (but no promises are made!)
A few more technical data: The file with the SMILES database is about 15 MB large (uncompressed). Each record contains the NSC number and the CAS RN at the beginning; a simple colrm 1 23 will create a file with just the SMILES strings. The largest SMILES string in the database is just under 600 characters long.
The promised SMILES version of the open part of the NCI database is up on http://epnws1.ncifcrf.gov:2345/dis3d/3ddatabase/nci_smil.htmland ready for download. (Thanks to Dan Zaharevitz for his help with this.)
In fact, two versions of the SMILES database are available. Thanks to Wolf-Dietrich Ihlenfeldt, we re-converted the NCI database with the program CACTVS v. 3.2, using conversion scripts provided by Wolf- Dietrich that handle formal charge problems and took care of other 'weird stuff' in the NCI database.
The other conversion program used is Babel, v. 1.6. The formal charge problem regarding nitro groups was fixed by simple string substitution in the Babel output; none of the other potential problems was addressed.
We found that the output of the two programs differ substantially if one just does a simple string comparison. For the first 1000 NSC numbers, 64% of the SMILES strings showed differences between Babel and CACTVS. This does not mean, however, that either one of the differing SMILES strings for the same compound need be wrong, since neither conversion program claims to generate Unique SMILES (to our knowledge).
For this reason, we decided to make both versions available. If you compare them to each other, we'd be interested to know which one you find better, more correct, more useful for a certain purpose etc. No further evaluation was conducted on either database, so please PERFORM YOUR OWN CHECKS before you use the SMILES strings for any given purpose.
Both versions contain SMILES strings for the same set of 237,771 structures. The files are just over 4 MB compressed, and uncompress to about 15 MB each.
Marc C. Nicklaus & Bruno Bienfait
Marc C. Nicklaus, Ph.D.