Find the 10 nearest neighbors in a data set

This task uses the toolkit appropriate way of computing molecular similarities to find the 10 most similar structures in a data set.

A common task in cheminformatics is to find target structures in a data set which are similar to a query structure. The word "similar" is ill-defined. What we have instead are well-defined measures which are hopefully well correlated to what a chemist would call similar. One common measure is to generate a hash fingerprint then use the Tanimoto similarity as a proxy for molecular similarity. Other solutions work directly on the lexical token of the SMILES string, or on molecular conformations, or on the electrostatic surface.

Implementation
The data will come from the gzip'ed SD file of the benzodiazepine data set. Use the first structure as the query structure, and use the rest of the file as the targets to find the 10 most similar structures. The output is sorted by similarity, from most similar to least. Each target match is on its own line, and the line contains the similarity score in the first column in the range 0.00 to 1.00 (preferably to 2 decimal places), then a space, then the target ID, which is the title line from the SD file.

Indigo/C++
'Title' line is not stored in Indigo's Molecule class, that is because the molecule numbers in this example are taken from SDF field 'pubchem_compound_cid'.

Instructions: $ cd graph; make CONF=Release32; cd .. $ cd molecule; make CONF=Release32; cd .. $ cd utils $ gcc nearest_neighbors.cpp -o nearest_neighbors -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++ $ ./nearest_neighbors
 * 1) Unpack 'graph' and 'molecule' projects into some folder
 * 2) Create 'utils' folder nearby
 * 3) Paste the above code into utils/nearest_neighbors.cpp file
 * 4) Compile the file using the following commands:
 * 1) Run the program like that:

Cactvs/Tcl
prop setparam E_SCREEN extended 2 set fh [molfile open benzodiazepine.sdf.gz r hydrogens add] set ehcmp [molfile read $fh] set th [molfile scan $fh "structure ~>= $ehcmp 0" {table score E_NAME}] table sort $th {scores descending} {E_NAME ascending} table loop $th row 10 { puts [format "%.2f %s" [expr [lindex $row 0]/100.0] [lindex $row 1]] }

This script only stores the score and name in the table object, so this works with almost arbitrarily large files. It can be further optimized by using a value larger than 0 as the Tanimoto similarity threshold in the file scan.

The result is

1.00 450820 1.00 9862446 0.99 9928121 0.98 23332591 0.98 44353759 0.98 6452650 0.97 11033472 0.97 3016 0.97 450830 0.97 623112

Cactvs/Python
In Python, this is even terser:

Prop.Setparam('E_SCREEN',{'extended':2}) f=Molfile('benzodiazepine.sdf.gz','r',{'hydrogens':'add'}) ecmp=f.read t=f.scan('structure ~>= {} 0'.format(ecmp),('table','score','E_NAME')) t.sort(('score','descending'),('E_NAME','ascending')) t.loop(lambda row:print('{:.2f} {}'.format(row[0]/100.0,row[1])),maxrows=10)

The output is the same as for the Tcl version. If ultracompact lambda functions make you nervous, you can also use a more verbose loop construct similar to the Tcl statement.