Chemistry Toolkit Rosetta Wiki
(filled in the task details)
 
(19 intermediate revisions by 11 users not shown)
Line 1: Line 1:
Ertle, Rohde, and Selzer (J. Med. Chem., 43:3714-3717, 2000) published an algorithm for fast molecular polar surface area (PSA). Part of it involves summing up partial surface values based on fragment contributions. Each fragment corresponds to a SMARTS match.
+
Ertl, Rohde, and Selzer (J. Med. Chem., 43:3714-3717, 2000) published an algorithm for fast molecular polar surface area (PSA). Part of it involves summing up partial surface values based on fragment contributions. Each fragment corresponds to a SMARTS match.
   
  +
 
  +
The goal of this task is get an idea of how to do a set of SMARTS matches when the data comes in from an external table. In this case it's a data table from TJ O'Donnell's [http://www.gnova.com/index.php?page=http://www.gnova.com/software.html CHORD chemistry extension] for PostgreSQL, listed at http://www.gnova.com/book/tpsa.tab and available for use here with permission. Each line in the file contains three tab-separated fields. The first line is the header. The other lines define a fragment contribution. The first field is the partial surface area contribution, for each SMARTS pattern match defined in the second column. The last column is a comment.
+
The goal of this task is get an idea of how to do a set of SMARTS matches when the data comes in from an external table. In this case it's a data table from TJ O'Donnell's [http://www.gnova.com/index.php?page=http://www.gnova.com/software.html CHORD chemistry extension] for PostgreSQL, listed at http://www.gnova.com/book/tpsa.tab and available for use here with permission. Each line in the file contains three tab-separated fields. The first line is the header. The other lines define a fragment contribution. The first field is the partial surface area contribution, for each SMARTS pattern match defined in the second column. The last column is a comment. Note that the first SMARTS definition contains a typo, it should be "[N+0;H0;D1;v3]" instead of "[N0;H0;D1;v3]".
   
 
To compute the topological polar surface area (for purposes of this task) of a given structure, take the sum over all fragment contributions, weighted by the number of times that fragment matches.
 
To compute the topological polar surface area (for purposes of this task) of a given structure, take the sum over all fragment contributions, weighted by the number of times that fragment matches.
   
  +
==Implementation==
The code for this task should implement a function or method named "TPSA" which gets its data from the file "tpsa.tab". The function should take a molecule record as input, and return the TPSA value as a float. Use the function to calculate the TPSA of "CN2C(=O)N(C)C(=O)C1=C2N=CN1C". The answer should be 56.22, which agrees exactly with [http://www.daylight.com/meetings/emug00/Ertl/tpsa.html Ertl's online TPSA tool] but not with [http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2519 PubChem's value of 58.4].
 
  +
 
Write a function or method named "TPSA" which gets its data from the file "tpsa.tab". The function should take a molecule record as input, and return the TPSA value as a float. Use the function to calculate the TPSA of "CN2C(=O)N(C)C(=O)C1=C2N=CN1C". The answer should be 61.82, which agrees exactly with [http://www.daylight.com/meetings/emug00/Ertl/tpsa.html Ertl's online TPSA tool] but not with [http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=2519 PubChem's value of 58.4].
  +
  +
==Indigo/Python==
  +
<source lang="python">
  +
import sys
  +
import collections
  +
import indigo
 
  +
indigo = indigo.Indigo()
  +
  +
# Some place to store the pattern defintions
  +
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
  +
patterns = []
  +
  +
# Get the patterns from the tpsa.tab file, ignoring the header line
  +
for line in open("tpsa.tab").readlines()[1:]:
  +
# Extract the fields
  +
value, smarts, comment = line.split("\t")
  +
  +
subsearch = indigo.loadSmarts(smarts)
  +
  +
# Store for later use
  +
patterns.append( Pattern(float(value), subsearch) )
  +
  +
# Helper function to count how many times a substructure matches
  +
def count_matches(subsearch, mol):
  +
return indigo.countSubstructureMatches(subsearch, mol)
  +
  +
def TPSA(mol):
  +
"Compute the topological polar surface area of a molecule"
  +
return sum(count_matches(pattern.subsearch, mol)*pattern.value
  +
for pattern in patterns)
  +
  +
# Test it with the reference structure
  +
mol = indigo.loadMolecule("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
  +
print TPSA(mol)
  +
</source>
  +
  +
==OpenBabel/Rubabel==
  +
<source lang='ruby'>
  +
require 'rubabel'
  +
lines = IO.readlines("tpsa.tab")
  +
header = lines.shift
  +
@patterns = lines.map {|line| line.chomp.split("\t") }
  +
  +
def TPSA(mol)
  +
@patterns.inject(0.0) {|s,p| s + p[0].to_f * mol.matches(p[1], false).size }
  +
end
  +
  +
puts TPSA( Rubabel["CN2C(=O)N(C)C(=O)C1=C2N=CN1C"] )
  +
</source>
  +
  +
==OpenEye/Python==
  +
  +
<source lang="python">
  +
from openeye.oechem import *
  +
import collections
  +
  +
# Some place to store the pattern defintions
  +
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
  +
patterns = []
  +
  +
# Get the patterns from the tpsa.tab file, ignoring the header line
  +
for line in open("tpsa.tab").readlines()[1:]:
  +
# Extract the fields
  +
value, smarts, comment = line.split("\t")
  +
  +
# Use the SMARTS to define a subsearch object
  +
subsearch = OESubSearch(smarts)
  +
  +
# Store for later use
  +
patterns.append( Pattern(float(value), subsearch) )
  +
  +
# Helper function to count how many times a substructure matches
  +
def count_matches(subsearch, mol):
  +
return sum(1 for match in subsearch.Match(mol))
  +
  +
def TPSA(mol):
  +
"Compute the topological polar surface area of a molecule"
  +
return sum(count_matches(pattern.subsearch, mol)*pattern.value
  +
for pattern in patterns)
  +
  +
# Test it with the reference structure
  +
mol = OEGraphMol()
  +
OEParseSmiles(mol, "CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
  +
print TPSA(mol)
  +
</source>
  +
==RDKit/Python==
  +
  +
<source lang="python">
  +
from rdkit import Chem
  +
import collections
  +
  +
# Some place to store the pattern defintions
  +
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
  +
patterns = []
  +
  +
# Get the patterns from the tpsa.tab file, ignoring the header line
  +
for line in open("tpsa.tab").readlines()[1:]:
  +
# Extract the fields
  +
value, smarts, comment = line.split("\t")
  +
  +
# Use the SMARTS to define a subsearch object
  +
subsearch = Chem.MolFromSmarts(smarts)
  +
  +
# Store for later use
  +
patterns.append( Pattern(float(value), subsearch) )
  +
  +
# Helper function to count how many times a substructure matches
  +
def count_matches(subsearch, mol):
  +
return len(mol.GetSubstructMatches(subsearch))
  +
  +
def TPSA(mol):
  +
"Compute the topological polar surface area of a molecule"
  +
return sum(count_matches(pattern.subsearch, mol)*pattern.value
  +
for pattern in patterns)
  +
  +
# Test it with the reference structure
  +
mol = Chem.MolFromSmiles("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
  +
print TPSA(mol)
  +
</source>
  +
  +
==Cactvs/Tcl==
  +
  +
<pre lang="tcl">
  +
set cactvs(aromaticity_model) daylight
  +
set eh [ens create CN2C(=O)N(C)C(=O)C1=C2N=CN1C]
  +
set tpsa 0.0
  +
table loop [table read tpsa.tab] row {
  +
lassign $row v smarts
  +
set tpsa [expr $tpsa+[match ss -charge 1 -mode distinct $smarts $eh]*$v]
  +
}
  +
puts $tpsa
  +
</pre>
  +
  +
The table reader needs no detailed instructions - it automatically and correctly analyzes the structure of the parameter file.
  +
  +
We need to switch the aromaticity model to the decidedly weird Daylight definition to get the requested result. Cactvs by default does not think that exocyclic keto groups are compatible with aromaticity. With its own model, the result is a familiar 58.44 (and that is no coincidence).
  +
  +
==Cactvs/Python==
  +
  +
<pre lang="python">
  +
cactvs['aromaticity_model']='daylight'
  +
e=Ens('CN2C(=O)N(C)C(=O)C1=C2N=CN1C')
  +
tpsa=0.0
  +
for row in Table.Read('tpsa.tab'):
  +
tpsa +=match('ss',row[1],e,charge=True,mode='distinct')*row[0]
  +
print(tpsa)
  +
</pre>
 
[[Category:TPSA]]
 
[[Category:TPSA]]
 
[[Category:feature counts]]
 
[[Category:feature counts]]
  +
[[Category:OpenEye/Python]]
  +
[[Category:RDKit/Python]]
  +
[[Category:Indigo/Python]]
  +
[[Category:Cactvs/Tcl]]
  +
[[Category:Cactvs/Python]]

Latest revision as of 22:04, 11 October 2013

Ertl, Rohde, and Selzer (J. Med. Chem., 43:3714-3717, 2000) published an algorithm for fast molecular polar surface area (PSA). Part of it involves summing up partial surface values based on fragment contributions. Each fragment corresponds to a SMARTS match.


The goal of this task is get an idea of how to do a set of SMARTS matches when the data comes in from an external table. In this case it's a data table from TJ O'Donnell's CHORD chemistry extension for PostgreSQL, listed at http://www.gnova.com/book/tpsa.tab and available for use here with permission. Each line in the file contains three tab-separated fields. The first line is the header. The other lines define a fragment contribution. The first field is the partial surface area contribution, for each SMARTS pattern match defined in the second column. The last column is a comment. Note that the first SMARTS definition contains a typo, it should be "[N+0;H0;D1;v3]" instead of "[N0;H0;D1;v3]".

To compute the topological polar surface area (for purposes of this task) of a given structure, take the sum over all fragment contributions, weighted by the number of times that fragment matches.

Implementation[]

Write a function or method named "TPSA" which gets its data from the file "tpsa.tab". The function should take a molecule record as input, and return the TPSA value as a float. Use the function to calculate the TPSA of "CN2C(=O)N(C)C(=O)C1=C2N=CN1C". The answer should be 61.82, which agrees exactly with Ertl's online TPSA tool but not with PubChem's value of 58.4.

Indigo/Python[]

import sys
import collections
import indigo
 
indigo = indigo.Indigo()

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []
 
# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")
 
    subsearch = indigo.loadSmarts(smarts)
 
    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )
 
# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return indigo.countSubstructureMatches(subsearch, mol)
 
def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)
 
# Test it with the reference structure
mol = indigo.loadMolecule("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

OpenBabel/Rubabel[]

require 'rubabel'
lines = IO.readlines("tpsa.tab")
header = lines.shift
@patterns = lines.map {|line| line.chomp.split("\t") }

def TPSA(mol)
  @patterns.inject(0.0) {|s,p| s + p[0].to_f * mol.matches(p[1], false).size }
end

puts TPSA( Rubabel["CN2C(=O)N(C)C(=O)C1=C2N=CN1C"] )

OpenEye/Python[]

from openeye.oechem import *
import collections

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []

# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")

    # Use the SMARTS to define a subsearch object
    subsearch = OESubSearch(smarts)

    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )

# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return sum(1 for match in subsearch.Match(mol))

def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)

# Test it with the reference structure
mol = OEGraphMol()
OEParseSmiles(mol, "CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

RDKit/Python[]

from rdkit import Chem
import collections

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []

# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")

    # Use the SMARTS to define a subsearch object
    subsearch = Chem.MolFromSmarts(smarts)

    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )

# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return len(mol.GetSubstructMatches(subsearch))

def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)

# Test it with the reference structure
mol = Chem.MolFromSmiles("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

Cactvs/Tcl[]

set cactvs(aromaticity_model) daylight
set eh [ens create CN2C(=O)N(C)C(=O)C1=C2N=CN1C]
set tpsa 0.0
table loop [table read tpsa.tab] row {
    lassign $row v smarts
    set tpsa [expr $tpsa+[match ss -charge 1 -mode distinct $smarts $eh]*$v]
}
puts $tpsa

The table reader needs no detailed instructions - it automatically and correctly analyzes the structure of the parameter file.

We need to switch the aromaticity model to the decidedly weird Daylight definition to get the requested result. Cactvs by default does not think that exocyclic keto groups are compatible with aromaticity. With its own model, the result is a familiar 58.44 (and that is no coincidence).

Cactvs/Python[]

cactvs['aromaticity_model']='daylight'
e=Ens('CN2C(=O)N(C)C(=O)C1=C2N=CN1C')
tpsa=0.0
for row in Table.Read('tpsa.tab'):
   tpsa +=match('ss',row[1],e,charge=True,mode='distinct')*row[0]
print(tpsa)