Changes: Find the 10 nearest neighbors in a data set

Revision as of 05:30, 8 February 2010

This task uses the toolkit appropriate way of computing molecular similarities to find the 10 most similar structures in a data set.

A common task in cheminformatics is to find target structures in a data set which are similar to a query structure. The word "similar" is ill-defined. What we have instead are well-defined measures which are hopefully well correlated to what a chemist would call similar. One common measure is to generate a hash fingerprint then use the Tanimoto similarity as a proxy for molecular similarity. Other solutions work directly on the lexical token of the SMILES string, or on molecular conformations, or on the electrostatic surface.

Implementation

The data will come from the gzip'ed SD file of the benzodiazepine data set. Use the first structure as the query structure, and use the rest of the file as the targets to find the 10 most similar structures. The output is sorted by similarity, from most similar to least. Each target match is on its own line, and the line contains the similarity score in the first column in the range 0.00 to 1.00 (preferably to 2 decimal places), then a space, then the target ID, which is the title line from the SD file.

OpenBabel/Pybel

# This solution uses OpenBabel's default path-based fingerprint (FP2)

import pybel

inputfile = pybel.readfile("sdf", "benzodiazepine.sdf")
query = inputfile.next()
queryfp = query.calcfp()

scores = sorted((-(mol.calcfp() | queryfp), mol.title) for mol in inputfile)
print "\n".join("%.2f %s" % (-sim, title) for sim, title in scores[:10])

1.00 450820
1.00 623918
0.99 20351792
0.99 9862446
0.98 398657
0.98 398658
0.98 11033472
0.98 19002495
0.98 19758676
0.98 20512505

OpenEye/Python

# This solution uses OpenEye's default path-based fingerprints

from openeye.oechem import *
from openeye.oegraphsim import *

ifs = oemolistream()
ifs.open("benzodiazepine.sdf.gz")
mol_iter = ifs.GetOEGraphMols()

# Get the first structure and compute its fingerprint
mol = next(mol_iter)
query_fp = OEFingerPrint()
OEMakePathFP(query_fp, mol)

# Will put each (similarity score, title) into this list.
scores = []

# Create this once and reuse
target_fp = OEFingerPrint()

for mol in mol_iter:
    OEMakePathFP(target_fp, mol)
    similarity = OETanimoto(query_fp, target_fp)
    scores.append( (similarity, mol.GetTitle()) )

# Use a special sort key, to order so most similar
# comes first. Use the title to break ties.
scores.sort(key=lambda (similarity, title): (-similarity, title))

for similarity, title in scores[:10]:
    # Report the similaity to two decimal places
    print "%.2f" % similarity, title

Here is the output

1.00 10913700
1.00 14221920
1.00 14221922
1.00 450820
1.00 9904514
0.92 20528394
0.92 3085818
0.91 19002495
0.91 19758676
0.91 3016

Different version

from openeye.oechem import *
from openeye.oegraphsim import *

def main():
    OEThrow.SetLevel(OEErrorLevel_Error)
    
    ifs = oemolistream()
    ifs.open('benzodiazepine.sdf.gz')

    # CREATE FINGERPRINT DATABASE
    molecules = []
    fpdb = OEFPDatabase(OEFPType_Path)
    
    for mol in ifs.GetOEGraphMols():
        fpdb.AddFP(mol)
        molecules.append(OEGraphMol(mol))
    
    for score in fpdb.GetSortedScores(molecules[0], 10):
        print 'Tanimoto score %.2f for %s' % (score.GetScore(), molecules[score.GetIdx()].GetTitle())

main()

Tanimoto score 1.00 for 9904514
Tanimoto score 1.00 for 14221922
Tanimoto score 1.00 for 10913700
Tanimoto score 1.00 for 1688
Tanimoto score 1.00 for 14221920
Tanimoto score 1.00 for 450820
Tanimoto score 0.92 for 20528394
Tanimoto score 0.92 for 3085818
Tanimoto score 0.91 for 19758676
Tanimoto score 0.91 for 450830

@@ Line 136: / Line 136: @@
 [[Category:OpenEye/Python]]
 [[Category:OpenBabel/Pybel]]
+[[Category:RDKit/Python]]