Report the similarity between two structures

Use the toolkit's preferred comparison method to compare two different molecules for similarity. The result must be 0.0 if the molecules are not at all similar and 1.0 if they are completely similar.

A common task in cheminformatics is to find target structures in a data set which are similar to a query structure. The word "similar" is ill-defined. What we have instead are well-defined measures which are hopefully well correlated to what a chemist would call similar.

Each toolkit has different methods for doing this. Some use hashed fingerprints and Tanimoto, Others use feature keys. OEChem's preferred comparison is with LINGOS, which is based on the uninterpreted SMILES. Some toolkits might even use shape descriptors.

Implementation
Report the similarity between "C[N+]12CCC(CC1)C(C2)OC(=O)C(O)(c3ccccc3)c4ccccc4" (PubChem CID 1548943) and "OC(C(=O)O[C@H]1CN2CCC1CC2)(c3ccccc3)c4ccccc4" (PubChem CID 1183).

CDK/Groovy
Save as calcTanimoto.groovy and run with:

groovy calcTanimoto.groovy

Indigo/Python
This example calculates the similarity based on two kinds of fingerprints: similarity fingerprints and substructure fingerprints. Similarity fingerprints are shorter, substructure fingerprints are more descriptive. For each of two kinds of fingerpints, both Tanimoto and Tvesky similarity values are written to standard output.

Indigo/C++
Same calculations as in Indigo/Python but using Indigo core C++ library.

Instructions: $ cd graph; make CONF=Release32; cd .. $ cd molecule; make CONF=Release32; cd .. $ cd utils $ gcc similarity.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++ $ ./similarity
 * 1) Unpack 'graph' and 'molecule' projects into some folder
 * 2) Create 'utils' folder nearby
 * 3) Paste the above code into utils/similarity.cpp file
 * 4) Compile the file using the following commands:
 * 1) Run the program like that:

Expected result: Similarity fingerprints: Tanimoto: 0.448276 Tversky: 0.619048 Substructure fingerprints: Tanimoto: 0.436823 Tversky: 0.608040

OpenBabel/Pybel
This reports a similarity of 0.360465116279.

OpenEye/Python
I think LINGOS is the preferred similarity measure in OEChem but it gives a much lower similarity value for these two structures than I expected, so I also showed how to use its path fingerprints.

The output is LINGOS similarity: 0.0425531901419 Hash similarity: 0.374125868082

RDKit/Python
The output is RDK fingerprint: 0.471502590674 Morgan fingerprint: 0.505494505495

Cactvs/Tcl
puts [expr [prop compare E_SCREEN \ [ens get [ens create "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"] E_SCREEN] \ [ens get [ens create "COC1=C(C=CC(=C1)C=O)O"] E_SCREEN] tanimoto]/100.0]

This computes the Tanimoto similarity on the standard pattern-based fingerprint E_SCREEN (result is 0.68). The toolkit supports various other similarity measures (Cosine,Dice,Hamman,Tversky,Kulcynski,Pearson,Russel-Rao,Simson,Yule) and alternative fingerprints (both fragment- and path-based).

And there is no need to look up the SMILES strings, we can directly work with PubChem CIDs (or SIDs):

puts [expr [prop compare E_SCREEN [ens get [ens create 1548943] E_SCREEN] \ [ens get [ens create 1183] E_SCREEN] tanimoto]/100.0]

The example above uses the fingerprints delivered as part of the PubChem structure data. Since PubChem uses a longer fingerprint than the default, the result is slightly different (0.7). To arrive at identical results, either add a

prop setparam E_SCREEN extended 2

command to the first example, or implicitly force re-computation of the fingerprint bits by specifying a computation parameter in the second example as in

puts [expr [prop compare E_SCREEN \ [ens get 1548943 E_SCREEN {} {extended 0}] \ [ens get 1183 E_SCREEN {} {extended 0}] tanimoto]/100.0]

Here we also further simplify the ensemble creation by instantiating a transient structure directly from the PubChem CID.

Cactvs/Python
And here again the equivalent Python solutions:

With object creation from SMILES:

print(Prop.Compare('E_SCREEN',Ens('CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O').E_SCREEN, Ens('COC1=C(C=CC(=C1)C=O)O').E_SCREEN,'tanimoto')/100.0)

With structure creation from CID:

print(Prop.Compare('E_SCREEN',Ens(1548943).E_SCREEN, Ens(1183).E_SCREEN,'tanimoto')/100.0)

With transient structure objects and computation parameter check:

print(Prop.Compare('E_SCREEN', Ens.Get(1548943,'E_SCREEN',parameters={'extended':0}), Ens.Get(1183,'E_SCREEN',parameters={'extended':0}),'tanimoto')/100.0)