Chemistry Toolkit Rosetta Wiki

Use the toolkit's preferred comparison method to compare two different molecules for similarity. The result must be 0.0 if the molecules are not at all similar and 1.0 if they are completely similar.

A common task in cheminformatics is to find target structures in a data set which are similar to a query structure. The word "similar" is ill-defined. What we have instead are well-defined measures which are hopefully well correlated to what a chemist would call similar.

Each toolkit has different methods for doing this. Some use hashed fingerprints and Tanimoto, Others use feature keys. OEChem's preferred comparison is with LINGOS, which is based on the uninterpreted SMILES. Some toolkits might even use shape descriptors.


Report the similarity between "C[N+]12CCC(CC1)C(C2)OC(=O)C(O)(c3ccccc3)c4ccccc4" (PubChem CID 1548943) and "OC(C(=O)O[C@H]1CN2CCC1CC2)(c3ccccc3)c4ccccc4" (PubChem CID 1183).


Save as calcTanimoto.groovy and run with:

   groovy calcTanimoto.groovy
import org.openscience.cdk.fingerprint.*;
import org.openscience.cdk.smiles.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.similarity.*;

smilesParser = new SmilesParser(
smiles1 = "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"
smiles2 = "COC1=C(C=CC(=C1)C=O)O"
mol1 = smilesParser.parseSmiles(smiles1)
mol2 = smilesParser.parseSmiles(smiles2)
fingerprinter = new HybridizationFingerprinter()
bitset1 = fingerprinter.getFingerprint(mol1)
bitset2 = fingerprinter.getFingerprint(mol2)
tanimoto = Tanimoto.calculate(bitset1, bitset2)
println "Tanimoto: $tanimoto"


This example calculates the similarity based on two kinds of fingerprints: similarity fingerprints and substructure fingerprints. Similarity fingerprints are shorter, substructure fingerprints are more descriptive. For each of two kinds of fingerpints, both Tanimoto and Tvesky similarity values are written to standard output.

from indigo import *

indigo = Indigo()

m1 = indigo.loadMolecule("CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O")
m2 = indigo.loadMolecule("COC1=C(C=CC(=C1)C=O)O")
# Aromatize molecules because second molecule is not in aromatic form

# Calculate similarity between "similarity" fingerprints
print("Similarity fingerprints:");
fp1 = m1.fingerprint("sim");
fp2 = m2.fingerprint("sim");

print("  Tanimoto: %s" % (indigo.similarity(fp1, fp2, "tanimoto")));
print("  Tversky: %s" % (indigo.similarity(fp1, fp2, "tversky")));

# Calculate similarity between "substructure" fingerprints
print("Substructure fingerprints:");
fp1 = m1.fingerprint("sub");
fp2 = m2.fingerprint("sub");

print("  Tanimoto: %s" % (indigo.similarity(fp1, fp2, "tanimoto")));
print("  Tversky: %s" % (indigo.similarity(fp1, fp2, "tversky")));


Same calculations as in Indigo/Python but using Indigo core C++ library.

#include "base_cpp/scanner.h"
#include "molecule/molecule.h"
#include "molecule/smiles_loader.h"
#include "molecule/molecule_arom.h"
#include "molecule/molecule_fingerprint.h"
#include "base_c/bitarray.h"

void _Fingerprints (Molecule &mol1, Molecule &mol2,
                    MoleculeFingerprintParameters &params)
   MoleculeFingerprintBuilder builder1(mol1, params);
   MoleculeFingerprintBuilder builder2(mol2, params);
   int fpsize = params.fingerprintSize();


   int ones1 = bitGetOnesCount(builder1.get(), fpsize);
   int ones2 = bitGetOnesCount(builder2.get(), fpsize);
   int common_ones = bitCommonOnes(builder1.get(), builder2.get(), fpsize);
   float tanimoto = 0, tversky = 0;

   if (common_ones > 0)
      tanimoto = (float)common_ones / (ones1 + ones2 - common_ones);
      tversky = 2.f * common_ones / (ones1 + ones2);
   printf("  Tanimoto: %f\n  Tversky: %f\n", tanimoto, tversky);

int main (void)
   const char *smiles1 = "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O";
   const char *smiles2 = "COC1=C(C=CC(=C1)C=O)O";
   Molecule mol1, mol2;

      BufferScanner scanner1(smiles1);
      SmilesLoader loader1(scanner1);

      loader1.loadMolecule(mol1, false);

      BufferScanner scanner2(smiles2);
      SmilesLoader loader2(scanner2);

      loader2.loadMolecule(mol2, false);

      MoleculeFingerprintParameters params1, params2;

      memset(&params1, 0, sizeof(params1));
      memset(&params2, 0, sizeof(params2)); 

      // 64 bytes -- default value in Bingo for similarity search
      params1.sim_qwords = 8;  
      // 200 bytes -- default value in Bingo for substructure search
      params2.ord_qwords = 25;

      printf("Similarity fingerprints:\n");
      _Fingerprints(mol1, mol2, params1);
      printf("Substructure fingerprints:\n");
      _Fingerprints(mol1, mol2, params2);
   catch (Exception &e)
      fprintf(stderr, "error: %s\n", e.message());
      return -1;

   return 0;


  1. Unpack 'graph' and 'molecule' projects into some folder
  2. Create 'utils' folder nearby
  3. Paste the above code into utils/similarity.cpp file
  4. Compile the file using the following commands:
    $ cd graph; make CONF=Release32; cd ..
    $ cd molecule; make CONF=Release32; cd ..
    $ cd utils
    $ gcc similarity.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++
  5. Run the program like that:
    $ ./similarity

Expected result:

Similarity fingerprints:
  Tanimoto: 0.448276
  Tversky: 0.619048
Substructure fingerprints:
  Tanimoto: 0.436823
  Tversky: 0.608040


import pybel
mol1 = pybel.readstring("smi", "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O")
mol2 = pybel.readstring("smi", "COC1=C(C=CC(=C1)C=O)O")
print mol1.calcfp() | mol2.calcfp()

This reports a similarity of 0.360465116279.


require 'rubabel' 
(mol1, mol2) = %w{CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O COC1=C(C=CC(=C1)C=O)O}.map{|sml| Rubabel[sml]}
puts mol1.tanimoto(mol2)


I think LINGOS is the preferred similarity measure in OEChem but it gives a much lower similarity value for these two structures than I expected, so I also showed how to use its path fingerprints.

from openeye.oechem import *
from openeye.oegraphsim import *

sim = OELingoSim("CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O")
print "LINGOS similarity:", sim.Similarity("COC1=C(C=CC(=C1)C=O)O")

def make_fp(smiles):
    mol = OEGraphMol()
    OEParseSmiles(mol, smiles)
    fp = OEFingerPrint()
    OEMakePathFP(fp, mol)
    return fp

fp1 = make_fp("CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O")
fp2 = make_fp("COC1=C(C=CC(=C1)C=O)O")
print "Hash similarity:", OETanimoto(fp1, fp2)

The output is

LINGOS similarity: 0.0425531901419
Hash similarity: 0.374125868082


from rdkit import Chem,DataStructs
mol1 = Chem.MolFromSmiles("CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O")
mol2 = Chem.MolFromSmiles("COC1=C(C=CC(=C1)C=O)O")

# the default fingerprint is path-based:
fp1 = Chem.RDKFingerprint(mol1)
fp2 = Chem.RDKFingerprint(mol2)
print "RDK fingerprint: ",DataStructs.TanimotoSimilarity(fp1,fp2)

# the Morgan fingerprint (similar to ECFP) is also useful:
from rdkit.Chem import rdMolDescriptors
mfp1 = rdMolDescriptors.GetMorganFingerprint(mol1,2)
mfp2 = rdMolDescriptors.GetMorganFingerprint(mol2,2)
print "Morgan fingerprint: ",DataStructs.DiceSimilarity(mfp1,mfp2)

The output is

RDK fingerprint:  0.471502590674
Morgan fingerprint:  0.505494505495


puts [expr [prop compare E_SCREEN \
    [ens get [ens create "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"] E_SCREEN] \
    [ens get [ens create "COC1=C(C=CC(=C1)C=O)O"] E_SCREEN] tanimoto]/100.0]

This computes the Tanimoto similarity on the standard pattern-based fingerprint E_SCREEN (result is 0.68). The toolkit supports various other similarity measures (Cosine,Dice,Hamman,Tversky,Kulcynski,Pearson,Russel-Rao,Simson,Yule) and alternative fingerprints (both fragment- and path-based).

And there is no need to look up the SMILES strings, we can directly work with PubChem CIDs (or SIDs):

puts [expr [prop compare E_SCREEN [ens get [ens create 1548943] E_SCREEN] \
    [ens get [ens create 1183] E_SCREEN] tanimoto]/100.0]

The example above uses the fingerprints delivered as part of the PubChem structure data. Since PubChem uses a longer fingerprint than the default, the result is slightly different (0.7). To arrive at identical results, either add a

prop setparam E_SCREEN extended 2

command to the first example, or implicitly force re-computation of the fingerprint bits by specifying a computation parameter in the second example as in

puts [expr [prop compare E_SCREEN \
    [ens get 1548943 E_SCREEN {} {extended 0}] \
    [ens get 1183 E_SCREEN {} {extended 0}] tanimoto]/100.0]

Here we also further simplify the ensemble creation by instantiating a transient structure directly from the PubChem CID.


And here again the equivalent Python solutions:

With object creation from SMILES:


With structure creation from CID:


With transient structure objects and computation parameter check: