There is a public-domain database of SMILES strings for commonly- occurring small organic molecules and functional groups. ftp://schiele.organik.uni-erlangen.de/pub/nci/nci4000.smi.gz

It is small, just 45K.

These are the first 4000 entries from the NCI database which do not contain a metal atom (just to keep the structures simple). Because the original file does not contain stereochemical information, the SMILES strings do not have any either. Behind the SMILES string, separated by a tab, the corresponding CAS number is appended. Most SMILES readers will understand that only the first word is the string and ignore the second part or even assign it to a name property.

At the same site you will find the full original NCI database as SD file, and the same structures in another file with added 3D coordinates. The full database contains about 126.705 structures and is public domain.

Since we just has on the CCL a discussion about using scripting languages to control modular programs, here is the short CACTVS Tcl script used to produce this file:

set outhandle [molfile open "|gzip >nci4000.smi.gz" w \
   format smiles \
   hydrogens add \
   writeflags {writearo writename}]
set cnt 0
molfile loop ftp://schiele/pub/nci/nciopen.mol.gz enshandle {
   if {![lempty [ens atoms $enshandle metal]]} continue
   ens assign $enshandle E_*CAS_RN* E_NAME
   molfile write $outhandle $enshandle
  if { [incr cnt]>=4000} break }

BTW, beware if you want to repeat this with the popular Babel conversion program (V1.6). First, the program will not be able to read the original datafile because it contains neither display coordinates nor 3D coordinates, only connectivity. Babel will report a cryptic message 'No atoms found in this structure', although the file is syntactically correct. You must use the 3D-enhanced file. Second, from the first 1000 records the CACTVS results and Babel results are structurally different for 69 records (7%). As far as I have checked, this is all because the MDL read routines of Babel do not read the atomic charge field in the SD file. This is of course deadly when reading back the SMILES string without charge information, since the SMILES standard contains hydrogen addition rules. So a nitro group which was coded with a positively charged N and a negatively charged O will suddenly contain an N-O-H group because the SMILES string N(=O)O without charge information demands the addition of a hydrogen atom to the single-bonded oxygen. The N will also receive an extra hydrogen, or end up as a radical, depending on your reader software.

Dr. Wolf-D. Ihlenfeldt
Computer Chemistry Center, University of Erlangen-Nuernberg
Naegelsbachstrasse 25, D-91052 Erlangen (Germany)
Tel (+49)-(0)9131-85-6579 Fax (+49)-(0)9131-85-6566