Chemistry Toolkit Rosetta Wiki
Register
No edit summary
 
 
(15 intermediate revisions by 8 users not shown)
Line 1: Line 1:
  +
Almost every cheminformatics toolkit uses a graph model based on the valance bond model of chemistry. The node and edges are labeled with properties like atomic number, bond type, and chirality to make them represent atoms and bonds. This is an incomplete model of chemistry, and doesn't even handle table salt well, much less clathrate or [http://www.google.com/search?hl=en&client=safari&rls=en&q=%22Cheminformatics+Tool%22+site%3Adepth-first.com&aq=f&aqi=&oq= more advanced chemistry]. But luckily for the pharmaceutical industry, most drugs are well described by graphs.
[[File:Placeholder|thumb|300px]]
 
  +
Replace this text by writing your article here!
 
  +
The mathematics of graphs are extremely well-studied, and some of those results can be applied to chemistry. When a toolkit doesn't implement a graph-based descriptor, you've got to work with the graph model directly.
  +
  +
The point of this task is to get some idea of what it's like to work with atoms and bonds by finding the graph diameter a the non-hydrogen atoms in a structure. Define the distance between two atoms as the shortest number of bonds to get from one atom to the other. The graph diameter is the longest such distance in the graph.
  +
  +
For example, the graph diameter of pentane is 4 and of benzene is 3.
  +
  +
'''NOTE:''' After implementing this I found out it doesn't give much insight into the toolkit. I think this should be changed to something else. Perhaps generate hash fingerprints of linear fragments of length 4, based on a few graph properties?
  +
  +
==Implementation==
  +
  +
Write a function or method which takes a molecule as input and which traverses the graph structure in order to find the graph diameter. The function should ignore hydrogens, may assume all hydrogens are implicit in the input structure.
  +
  +
Test the problem with PubChem CID 5487286, which has the SMILES structure "CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]". It should report a diameter of 24 bonds.
  +
  +
The easiest approach is to find the [http://en.wikipedia.org/wiki/Shortest_path_problem all-pairs shortest path] then find the largest number.
  +
  +
==OpenBabel/Rubabel==
  +
<source lang='ruby'>
  +
require 'rubabel'
  +
puts Rubabel["CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]"].graph_diameter
  +
</source>
  +
==OpenEye/Python==
  +
<source lang="python">
  +
from collections import defaultdict
  +
from itertools import chain
  +
  +
from openeye.oechem import *
  +
  +
INFINITY = float("inf")
  +
  +
class InfinityDict(dict):
  +
def __missing__(self, key):
  +
return INFINITY
  +
  +
def floyd_warshall(atoms, distances):
  +
for k in atoms:
  +
for i in atoms:
  +
for j in atoms:
  +
d = distances[i][k] + distances[k][j]
  +
distances[i][j] = min(distances[i][j], d)
  +
  +
  +
# Each atom has a unique index which is constant over the lifetime of
  +
# the molecule. These numbers should be small, but aren't necessarily
  +
# contiguous because of deletions.
  +
  +
def make_distance_matrix(mol):
  +
atoms = []
  +
distances = defaultdict(InfinityDict)
  +
for atom in mol.GetAtoms():
  +
idx = atom.GetIdx()
  +
distances[idx][idx] = 0
  +
atoms.append(idx)
  +
for other_atom in atom.GetAtoms():
  +
distances[idx][other_atom.GetIdx()] = 1
  +
return atoms, distances
  +
  +
  +
def all_pairs_shortest_distance(mol):
  +
atoms, distances = make_distance_matrix(mol)
  +
floyd_warshall(atoms, distances)
  +
return distances
  +
  +
def diameter(mol):
  +
distances = all_pairs_shortest_distance(mol)
  +
all_values = chain(*(row.values() for row in distances.values()))
  +
return max(d for d in all_values if d != INFINITY)
  +
  +
  +
mol = OEGraphMol()
  +
OEParseSmiles(mol,
  +
"CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]"
  +
)
  +
print "Diameter:\n", diameter(mol)
  +
</source>
  +
  +
  +
===Another version===
  +
  +
<source lang="python">
  +
from openeye.oechem import *
  +
  +
def get_diameter(mol):
  +
return max([OEGetPathLength(a,b) for a in mol.GetAtoms(OEIsHeavy()) for b in mol.GetAtoms(OEIsHeavy()) if a.GetIdx() < b.GetIdx()])
  +
</source>
  +
  +
==RDKit/Python==
  +
<source lang="python">
  +
from rdkit import Chem
  +
import numpy
  +
mol = Chem.MolFromSmiles('CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]')
  +
# GetDistanceMatrix returns the molecules 2D (topological) distance matrix:
  +
dm = Chem.GetDistanceMatrix(mol)
  +
diameter = int(numpy.max(dm))
  +
</source>
  +
  +
==Chemkit/C++==
  +
<source lang="cpp">#include <iostream>
  +
  +
#include <boost/bind.hpp>
  +
  +
#include <chemkit/atom.h>
  +
#include <chemkit/foreach.h>
  +
#include <chemkit/molecule.h>
  +
  +
using namespace chemkit;
  +
  +
int main(void)
  +
{
  +
Molecule molecule("CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)"
  +
"NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]", "smiles");
  +
  +
// remove terminal hydrogens
  +
molecule.removeAtomIf(boost::bind(&Atom::isTerminalHydrogen, _1));
  +
  +
std::cout << molecule.descriptor("graph-diameter").toInt() << std::endl;
  +
  +
return 0;
  +
}
  +
</source>
  +
  +
==Cactvs/Tcl==
  +
<pre lang="tcl">
  +
set eh [ens create {CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]} nohadd]
  +
puts [max {*}[join [ens get $eh A_TOPO_DISTANCE]]]
  +
</pre>
  +
  +
Since the question was to compute the graph diameter without hydrogens, we need to decode the SMILES without implicit hydrogens, otherwise the result is off by one (there are hydrogens on the methyl group).
  +
  +
==Cactvs/Python==
  +
<pre lang=python">
  +
from itertools import chain
  +
e=Ens('CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]','nohadd')
  +
print(max(list(chain(*e.A_TOPO_DISTANCE))))
  +
</pre>
  +
  +
This gives the same result, but it was generated with even more cryptic contortions for the list and tuple flattening ;-)
  +
[[Category:graph traversal]]
  +
[[Category:OpenEye/Python]]
  +
[[Category:RDKit/Python]]
  +
[[Category:Cactvs/Tcl]]
  +
[[Category:Cactvs/Python]]

Latest revision as of 16:54, 11 October 2013

Almost every cheminformatics toolkit uses a graph model based on the valance bond model of chemistry. The node and edges are labeled with properties like atomic number, bond type, and chirality to make them represent atoms and bonds. This is an incomplete model of chemistry, and doesn't even handle table salt well, much less clathrate or more advanced chemistry. But luckily for the pharmaceutical industry, most drugs are well described by graphs.

The mathematics of graphs are extremely well-studied, and some of those results can be applied to chemistry. When a toolkit doesn't implement a graph-based descriptor, you've got to work with the graph model directly.

The point of this task is to get some idea of what it's like to work with atoms and bonds by finding the graph diameter a the non-hydrogen atoms in a structure. Define the distance between two atoms as the shortest number of bonds to get from one atom to the other. The graph diameter is the longest such distance in the graph.

For example, the graph diameter of pentane is 4 and of benzene is 3.

NOTE: After implementing this I found out it doesn't give much insight into the toolkit. I think this should be changed to something else. Perhaps generate hash fingerprints of linear fragments of length 4, based on a few graph properties?

Implementation[]

Write a function or method which takes a molecule as input and which traverses the graph structure in order to find the graph diameter. The function should ignore hydrogens, may assume all hydrogens are implicit in the input structure.

Test the problem with PubChem CID 5487286, which has the SMILES structure "CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]". It should report a diameter of 24 bonds.

The easiest approach is to find the all-pairs shortest path then find the largest number.

OpenBabel/Rubabel[]

require 'rubabel'
puts Rubabel["CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]"].graph_diameter

OpenEye/Python[]

from collections import defaultdict
from itertools import chain

from openeye.oechem import *

INFINITY = float("inf")

class InfinityDict(dict):
    def __missing__(self, key):
        return INFINITY

def floyd_warshall(atoms, distances):
    for k in atoms:
        for i in atoms:
            for j in atoms:
                d = distances[i][k] + distances[k][j]
                distances[i][j] = min(distances[i][j], d)
                    

# Each atom has a unique index which is constant over the lifetime of
# the molecule. These numbers should be small, but aren't necessarily
# contiguous because of deletions.

def make_distance_matrix(mol):
    atoms = []
    distances = defaultdict(InfinityDict)
    for atom in mol.GetAtoms():
        idx = atom.GetIdx()
        distances[idx][idx] = 0
        atoms.append(idx)
        for other_atom in atom.GetAtoms():
            distances[idx][other_atom.GetIdx()] = 1
    return atoms, distances


def all_pairs_shortest_distance(mol):
    atoms, distances = make_distance_matrix(mol)
    floyd_warshall(atoms, distances)
    return distances

def diameter(mol):
    distances = all_pairs_shortest_distance(mol)
    all_values = chain(*(row.values() for row in distances.values()))
    return max(d for d in all_values if d != INFINITY)


mol = OEGraphMol()
OEParseSmiles(mol,
  "CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]"
)
print "Diameter:\n", diameter(mol)


Another version[]

from openeye.oechem import *

def get_diameter(mol):
    return max([OEGetPathLength(a,b) for a in mol.GetAtoms(OEIsHeavy()) for b in mol.GetAtoms(OEIsHeavy()) if a.GetIdx() < b.GetIdx()])

RDKit/Python[]

from rdkit import Chem
import numpy
mol = Chem.MolFromSmiles('CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]')
# GetDistanceMatrix returns the molecules 2D (topological) distance matrix:
dm = Chem.GetDistanceMatrix(mol)
diameter = int(numpy.max(dm))

Chemkit/C++[]

#include <iostream>

#include <boost/bind.hpp>

#include <chemkit/atom.h>
#include <chemkit/foreach.h>
#include <chemkit/molecule.h>

using namespace chemkit;

int main(void)
{
    Molecule molecule("CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)"
                      "NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]", "smiles");

    // remove terminal hydrogens
    molecule.removeAtomIf(boost::bind(&Atom::isTerminalHydrogen, _1));

    std::cout << molecule.descriptor("graph-diameter").toInt() << std::endl;

    return 0;
}

Cactvs/Tcl[]

set eh [ens create {CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]} nohadd]
puts [max {*}[join [ens get $eh A_TOPO_DISTANCE]]]

Since the question was to compute the graph diameter without hydrogens, we need to decode the SMILES without implicit hydrogens, otherwise the result is off by one (there are hydrogens on the methyl group).

Cactvs/Python[]

from itertools import chain
e=Ens('CC(C)C(C(=O)NC(=O)C1CCCN1C(=O)C(C)NC(=O)C(C)NC(=O)CCC(=O)OC)NC2=CC=C(C=C2)[N+](=O)[O-]','nohadd')
print(max(list(chain(*e.A_TOPO_DISTANCE))))

This gives the same result, but it was generated with even more cryptic contortions for the list and tuple flattening ;-)