Chemistry Toolkit Rosetta Wiki
(Added reference output, determined the input data set, links to reference results, added an OEChem/Python solution)
No edit summary
 
(17 intermediate revisions by 9 users not shown)
Line 1: Line 1:
  +
The goals of this task are to show how to read molecules from a SMILES file and how to get the number of cycles in a molecule. As is usual in chemistry software, the data structure called a "molecule" may contain more than one chemistry molecules, which in a graph theory terms is called a connected component.
Read the SMILES file for the [[benzodiazepine]] data set, available from [http://dalkescientific.com/writings/benzodiazepine.smi.gz]. For each input record, find the number of rings. This would be the minimal number of rings and not the total number of all possible rings.
 
  +
  +
Ring counts can be used as descriptors and as ways to classify structures. One way to compute the ring count is to compute the Euler characteristic: #Rings = #Bonds - #Atoms + #Components. Many chemistry toolkits, though not all, implement a ring finding algorithm which identifies the most chemically relevant rings. This is not a simple problem: consider that cubane has 12 bonds and 8 atoms, giving 5 rings, which implies that one of the faces is not in a ring.
  +
  +
There is a large number of ring finding algorithms. Many have the goal of finding a "smallest set of smallest rings", so are called an SSSR algorithm. Frequently the easiest way to get the ring count from a toolkit comes as a direct consequence of computing the SSSR. Note though that ring identification is not necessary for this task.
  +
  +
==Implementation==
  +
 
Use the SMILES file from the [[benzodiazepine]] data set, available from [http://dalkescientific.com/writings/benzodiazepine.smi.gz]. For each input record, find the number of rings. This would be the minimal number of rings and not the total number of all possible rings.
   
 
Many of the 12386 records contain multiple components, most often counter-ions but some are more complex. Please report the total ring count for all components in the SMILES, even though that does not make much pharmacological sense.
 
Many of the 12386 records contain multiple components, most often counter-ions but some are more complex. Please report the total ring count for all components in the SMILES, even though that does not make much pharmacological sense.
Line 10: Line 18:
 
for mol in pybel.readfile("smi", "benzodiazepine.smi"):
 
for mol in pybel.readfile("smi", "benzodiazepine.smi"):
 
print len(mol.OBMol.GetSSSR())
 
print len(mol.OBMol.GetSSSR())
  +
</pre>
  +
  +
==OpenBabel/Rubabel==
  +
<pre>require 'rubabel'
  +
Rubabel.foreach("benzodiazepine.sdf.gz") {|mol| puts mol.ob_sssr.size }
 
</pre>
 
</pre>
   
 
==OpenEye/Python==
 
==OpenEye/Python==
  +
  +
OpenEye is known for it's stance against SSSR (See [http://www.eyesopen.com/docs/toolkits/current/html/OEChem_TK-python/ring.html#smallest-set-of-smallest-rings-sssr-considered-harmful SSSR Considered Harmful].) They do not provide a function to determine this property directly. Instead, use Euler's characteristic to compute it given the number of atoms, bonds, and components: #Rings = #Bonds - #Atoms + #Components .
  +
 
<pre>from openeye.oechem import *
 
<pre>from openeye.oechem import *
   
Line 18: Line 34:
 
ifs.open("benzodiazepine.smi")
 
ifs.open("benzodiazepine.smi")
 
for mol in ifs.GetOEGraphMols():
 
for mol in ifs.GetOEGraphMols():
# OpenEye is known for it's stance against SSSR
 
# http://www.eyesopen.com/docs/html/pyprog/node64.html
 
# They do not have a function to determine this property.
 
# Instead, use Euler's characteristic to compute it
 
# given the number of atoms, bonds, and components
 
 
num_components, component_membership = OEDetermineComponents(mol)
 
num_components, component_membership = OEDetermineComponents(mol)
 
num_rings = mol.NumBonds() - mol.NumAtoms() + num_components
 
num_rings = mol.NumBonds() - mol.NumAtoms() + num_components
Line 28: Line 39:
 
print num_rings
 
print num_rings
 
</pre>
 
</pre>
  +
  +
==RDKit/Python==
  +
  +
As is pointed out so clearly by the OpenEye rant linked above, a true SSSR algorithm returns an arbitrary selection of rings in certain fused ring cases (e.g. adamantane and cubane). This makes little chemical sense and is confusing. Still, it is often useful to have the set of smallest rings in the system. The RDKit, by default, generates a symmetrized version of SSSR; this leads to cubane having the expected 6 rings and a SMARTS match for "[R3]" returning all 8 atoms.
  +
  +
For the purposes of this exercise, the code below shows the use of Euler's rule as well as outputting the SMILES of the molecules where there's a mismatch between symmetrized and standard SSSR algorithms:
  +
  +
<source lang="python">
  +
from rdkit import Chem
  +
  +
suppl = Chem.SmilesMolSupplier('benzodiazepine.smi',titleLine=False)
  +
counts = []
  +
for m in suppl:
  +
rc = m.GetRingInfo().NumRings()
  +
nfrags = len(Chem.GetMolFrags(m))
  +
ec = m.GetNumBonds()-m.GetNumAtoms()+nfrags
  +
if ec!=rc:
  +
print Chem.MolToSmiles(m),ec,rc
  +
counts.append((ec,rc,m.GetProp('_Name')))
  +
</source>
  +
  +
==Chemkit/C++==
  +
<source lang="cpp">#include <iostream>
  +
  +
#include <chemkit/foreach.h>
  +
#include <chemkit/molecule.h>
  +
#include <chemkit/moleculefile.h>
  +
  +
using namespace chemkit;
  +
  +
int main(void)
  +
{
  +
MoleculeFile file("benzodiazepine.smi.gz");
  +
bool ok = file.read();
  +
if(!ok){
  +
std::cerr << "Error: " << file.errorString() << std::endl;
  +
return -1;
  +
}
  +
  +
foreach(const Molecule *molecule, file.molecules()){
  +
std::cout << molecule->ringCount() << std::endl;
  +
}
  +
  +
return 0;
  +
}
  +
</source>
  +
  +
==Cactvs/Tcl==
  +
  +
  +
<pre lang="tcl">
  +
molfile loop benzodiazepine.smi eh {
  +
puts [ens get $eh E_NSSSR]
  +
}
  +
</pre>
  +
  +
For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR.
  +
  +
==Cactvs/Python==
  +
  +
<pre lang="python">
  +
for eh in Molfile('benzodiazepine.smi'):
  +
print(eh.E_NSSSR)
  +
</pre>
  +
  +
For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR.
  +
  +
== CDK/Groovy ==
  +
  +
<pre lang="groovy">
  +
import org.openscience.cdk.interfaces.*;
  +
import org.openscience.cdk.graph.*;
  +
import org.openscience.cdk.io.*;
  +
import org.openscience.cdk.io.IChemObjectReader.Mode;
  +
import org.openscience.cdk.io.iterator.*;
  +
import org.openscience.cdk.ringsearch.*;
  +
import org.openscience.cdk.silent.*;
  +
import org.openscience.cdk.tools.manipulator.*;
  +
import java.io.File;
  +
import java.util.zip.GZIPInputStream;
  +
  +
iterator = new IteratingSMILESReader(
  +
new GZIPInputStream(
  +
new File("ctr/benzodiazepine.smi.gz")
  +
.newInputStream()
  +
),
  +
SilentChemObjectBuilder.getInstance()
  +
)
  +
iterator.setReaderMode(
  +
IChemObjectReader.Mode.STRICT
  +
)
  +
while (iterator.hasNext()) {
  +
mol = iterator.next()
  +
components =
  +
ConnectivityChecker.partitionIntoMolecules(
  +
mol
  +
)
  +
totalRingCount = 0
  +
for (component in components.atomContainers()) {
  +
ringset = new SSSRFinder(component).findSSSR()
  +
totalRingCount += ringset.atomContainerCount
  +
}
  +
println totalRingCount
  +
}
  +
</pre>
  +
[[Category:OpenBabel/Pybel]]
 
[[Category:SMILES]]
 
[[Category:SMILES]]
 
[[Category:Ring count]]
 
[[Category:Ring count]]
  +
[[Category:OpenEye/Python]]
  +
[[Category:RDKit/Python]]
  +
[[Category:Cactvs/Tcl]]
  +
[[Category:CDK/Groovy]]
  +
[[Category:Cactvs/Python]]

Latest revision as of 10:11, 11 November 2013

The goals of this task are to show how to read molecules from a SMILES file and how to get the number of cycles in a molecule. As is usual in chemistry software, the data structure called a "molecule" may contain more than one chemistry molecules, which in a graph theory terms is called a connected component.

Ring counts can be used as descriptors and as ways to classify structures. One way to compute the ring count is to compute the Euler characteristic: #Rings = #Bonds - #Atoms + #Components. Many chemistry toolkits, though not all, implement a ring finding algorithm which identifies the most chemically relevant rings. This is not a simple problem: consider that cubane has 12 bonds and 8 atoms, giving 5 rings, which implies that one of the faces is not in a ring.

There is a large number of ring finding algorithms. Many have the goal of finding a "smallest set of smallest rings", so are called an SSSR algorithm. Frequently the easiest way to get the ring count from a toolkit comes as a direct consequence of computing the SSSR. Note though that ring identification is not necessary for this task.

Implementation[]

Use the SMILES file from the benzodiazepine data set, available from [1]. For each input record, find the number of rings. This would be the minimal number of rings and not the total number of all possible rings.

Many of the 12386 records contain multiple components, most often counter-ions but some are more complex. Please report the total ring count for all components in the SMILES, even though that does not make much pharmacological sense.

The output will have one integer count per line, in the same order as the input file. You should check your results against the reference output.

OpenBabel/Pybel[]

import pybel

for mol in pybel.readfile("smi", "benzodiazepine.smi"):
    print len(mol.OBMol.GetSSSR())

OpenBabel/Rubabel[]

require 'rubabel'
Rubabel.foreach("benzodiazepine.sdf.gz") {|mol| puts mol.ob_sssr.size }

OpenEye/Python[]

OpenEye is known for it's stance against SSSR (See SSSR Considered Harmful.) They do not provide a function to determine this property directly. Instead, use Euler's characteristic to compute it given the number of atoms, bonds, and components: #Rings = #Bonds - #Atoms + #Components .

from openeye.oechem import *

ifs = oemolistream()
ifs.open("benzodiazepine.smi")
for mol in ifs.GetOEGraphMols():
    num_components, component_membership = OEDetermineComponents(mol)
    num_rings = mol.NumBonds() - mol.NumAtoms() + num_components
    
    print num_rings

RDKit/Python[]

As is pointed out so clearly by the OpenEye rant linked above, a true SSSR algorithm returns an arbitrary selection of rings in certain fused ring cases (e.g. adamantane and cubane). This makes little chemical sense and is confusing. Still, it is often useful to have the set of smallest rings in the system. The RDKit, by default, generates a symmetrized version of SSSR; this leads to cubane having the expected 6 rings and a SMARTS match for "[R3]" returning all 8 atoms.

For the purposes of this exercise, the code below shows the use of Euler's rule as well as outputting the SMILES of the molecules where there's a mismatch between symmetrized and standard SSSR algorithms:

from rdkit import Chem

suppl = Chem.SmilesMolSupplier('benzodiazepine.smi',titleLine=False)
counts = []
for m in suppl:
    rc = m.GetRingInfo().NumRings()
    nfrags = len(Chem.GetMolFrags(m))
    ec = m.GetNumBonds()-m.GetNumAtoms()+nfrags
    if ec!=rc:
        print Chem.MolToSmiles(m),ec,rc
    counts.append((ec,rc,m.GetProp('_Name')))

Chemkit/C++[]

#include <iostream>

#include <chemkit/foreach.h>
#include <chemkit/molecule.h>
#include <chemkit/moleculefile.h>

using namespace chemkit;

int main(void)
{
    MoleculeFile file("benzodiazepine.smi.gz");
    bool ok = file.read();
    if(!ok){
        std::cerr << "Error: " << file.errorString() << std::endl;
        return -1;
    }

    foreach(const Molecule *molecule, file.molecules()){
        std::cout << molecule->ringCount() << std::endl;
    }

    return 0;
}

Cactvs/Tcl[]

molfile loop benzodiazepine.smi eh {
   puts [ens get $eh E_NSSSR]
}

For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR.

Cactvs/Python[]

for eh in Molfile('benzodiazepine.smi'):
   print(eh.E_NSSSR)

For the ESSSR count (contains rings covering all 3-ringatom-chain, e.g. 6 rings in cubane) substitute property E_NESSSR.

CDK/Groovy[]

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.graph.*;
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.IChemObjectReader.Mode;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.ringsearch.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.tools.manipulator.*;
import java.io.File;
import java.util.zip.GZIPInputStream;

iterator = new IteratingSMILESReader(
  new GZIPInputStream(
    new File("ctr/benzodiazepine.smi.gz")
      .newInputStream()
  ),
  SilentChemObjectBuilder.getInstance()
)
iterator.setReaderMode(
  IChemObjectReader.Mode.STRICT
)
while (iterator.hasNext()) {
  mol = iterator.next()
  components =
    ConnectivityChecker.partitionIntoMolecules(
      mol
    )
  totalRingCount = 0
  for (component in components.atomContainers()) {
    ringset = new SSSRFinder(component).findSSSR()
    totalRingCount += ringset.atomContainerCount
  }
  println totalRingCount
}