Changes: Working with SD tag data

Revision as of 06:45, 23 December 2009

An SD file contains a set of records. Each record has a connection table and set of fields which together describe the molecule. This is call the 'molfile'. A record also contains a set of data fields often called "tag data" which in its simplest and most common form looks like

> <PUBCHEM_NIST_INCHIKEY>
PUMYFTJOWAJIKF-UHFFFAOYSA-N

> <PUBCHEM_XLOGP3>
3.8

In this case the tag PUBCHEM_XLOGP3 has the value 3.8.

The tag data is often used as a way to store extra information about the molecule. This can be supplier information, additional identifiers, an encoded fingerprint, or many other possibilities.

This task show how to use a toolkit to get and set tag data by implementing Lipinski's Rule of Five where the inputs come from the data data and the result goes into a new tag field named RULE5.

Lipinski's rule says that, in general, an orally active drug has no more than one violation of the following criteria:

Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms)
Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)
A molecular weight under 500 daltons
An octanol-water partition coefficient log P of less than 5

Implementation

The input file is SD file from the benzodiazepine data set, available directly from [1]. Every record contains the tags PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_HBOND_ACCEPTOR and PUBCHEM_MOLECULAR_WEIGHT, and most of the records contain the tag PUBCHEM_XLOGP3. Here is an example of those tag data fields:

> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
4
> <PUBCHEM_CACTVS_HBOND_DONOR>
1
> <PUBCHEM_XLOGP3>
1.2
> <PUBCHEM_MOLECULAR_WEIGHT>
359.20938

The program must create a new SD file which is the same as the input file but with a new tag data field named "RULE5". This must be "1" if the record passes Lipinski's rule, "0" if it does not, and "no logP" if the PUBCHEM_XLOGP3 field is missing.

"The same" deserves some clarification. Most toolkits do not preserve the SD input precisely. For example, the "CTfile Formats" specification says that some fields are obsolete or unneeded, or are only appropriate to query and reaction formats. Different toolkits handle these differently, and it's okay for these fields to differ. There are also places where the amount of whitespace isn't fully specified, and different toolkits give different answers.

It's expected that a toolkit will modify those fields. The exception will be those solutions which manipulate the SD file directly as text and don't go through a molecule object and back again.

Line #2 of each record is supposed to include the name of the program which generated the output and include time stamp, although it may also be blank. Some solutions will keep the line as-is, some will put the toolkits's name and current timestamp in its place, and still others will force it to be blank. Please note what the solution does.

A solution must not reorder or modify the atom and bond information nor change the contents of any existing tag. It should only add a single new tag to the output. If it does anything else, please note the exception.

OpenBabel/Pybel

import pybel
output = pybel.Outputfile("sdf", "RULE5.sdf")
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    if "PUBCHEM_XLOGP3" not in mol.data:
        mol.data["RULE5"] = "no logP"
    else:
        count = ((int(mol.data["PUBCHEM_CACTVS_HBOND_DONOR"]) &lt;= 5) +
                 (int(mol.data["PUBCHEM_CACTVS_HBOND_ACCEPTOR"]) &lt;= 10) +
                 (float(mol.data["PUBCHEM_MOLECULAR_WEIGHT"]) &lt;= 500) +
                 (float(mol.data["PUBCHEM_XLOGP3"]) &lt;= 5))
        if count &gt;= 3:
            mol.data["RULE5"] = "1"
        else:
            mol.data["RULE5"] = "0"
    output.write(mol)
output.close()

This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp.

OpenEye/Python

from openeye.oechem import *

ifs = oemolistream()
ofs = oemolostream()
ifs.open("benzodiazepine.sdf.gz")
ofs.open("RULE5.sdf")

def num_Lipinski_failures(mol):
    failures = 0
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_DONOR")) &gt; 5:
        failures += 1
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_ACCEPTOR")) &gt; 10:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_MOLECULAR_WEIGHT")) &gt; 500.0:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_XLOGP3")) &gt; 5.0:
        failures += 1
    return failures

for mol in ifs.GetOEGraphMols():
    if not OEHasSDData(mol, "PUBCHEM_XLOGP3"):
        rule5 = "no logP"
    else:
        if num_Lipinski_failures(mol) &gt; 1:
            rule5 = "0"
        else:
            rule5 = "1"
    OESetSDData(mol, "RULE5", rule5)
    OEWriteMolecule(ofs, mol)

This updates the #2 line of each molfile to mention "OEChem" and the current timestamp.

Unix/Python

Because the tag data fields are line oriented, this is a problem that can be solved with standard Unix tools and without a chemistry toolkit. Here's one solution, but while it does work I don't recommend it as a general approach.

infile = open("benzodiazepine.sdf", "U")
outfile = open("RULE5.sdf", "w")

# Helper function. Use to copy and use
# the first data line of a tag I need.
def next_line():
    line = next(infile)
    outfile.write(line)
    return line

# Initialize for the first record
seen_xlogp = False
failures = 0

for line in infile:
    if line == "$$$$\n":
        # Write the RULE5 tag data before ending the record
        if not seen_xlogp:
            outfile.write("&gt; &lt;RULE5&gt;\nno logP\n\n")
        elif failures > 1:
            outfile.write("&gt; &lt;RULE5&gt;\n0\n\n")
        else:
            outfile.write("&gt; &lt;RULE5&gt;\n1\n\n")
        outfile.write(line)

        # reset for the next record
        seen_xlogp = False
        failures = 0
        continue

    # Copy the input line to the output line
    outfile.write(line)
    if line[:1] == "&gt;":
        # Check for tag data fields
        if "&lt;PUBCHEM_CACTVS_HBOND_DONOR&gt;" in line:
            failures += int(next_line()) &gt; 5
        elif "&lt;PUBCHEM_CACTVS_HBOND_ACCEPTOR&gt;" in line:
            failures += int(next_line()) &gt; 10
        elif "&lt;PUBCHEM_MOLECULAR_WEIGHT&gt;" in line:
            failures += float(next_line()) &gt; 500.0
        elif "&lt;PUBCHEM_XLOGP3&gt;" in line:
            failures += float(next_line()) &gt; 5.0
            seen_xlogp = True

@@ Line 1: / Line 1: @@
+An SD file contains a set of records. Each record has a connection table and set of fields which together describe the molecule. This is call the 'molfile'. A record also contains a set of data fields often called "tag data" which in its simplest and most common form looks like
-Use the benzodiazepine SD file to make an output file. Each output record is the input record plus a tag named "CANSMILES" which is the toolkit's canonical SMILES representation for the molecule in the record, based on the connection table. The record order must be preserved.
+<pre>
-A toolkit may change line #2 of each molfile record to include the program name and time stamp, or it may leave the field untouched, or make it a blank line. A toolkit may also omit some fields which the "CTfile Formats" specification says are obsolete or unneeded, or which are specific to non-molecular MDL formats like those for reactions and queries. It may also change some whitespace fields which are not fully specified, like the amount of spaces in the data header line (such as between the "&gt;" and the "&lt;tag&gt;").
+&gt; &lt;PUBCHEM_NIST_INCHIKEY&gt;
+PUMYFTJOWAJIKF-UHFFFAOYSA-N
+&gt; &lt;PUBCHEM_XLOGP3&gt;
-It should not reorder or modify the atoms or bonds, alter the coordinates, or change any other tag or property which was in the input file. It should only add a single data tag to the output. Please note if the toolkit does otherwise.
+.8
+</pre>
+In this case the tag PUBCHEM_XLOGP3 has the value 3.8.
+The tag data is often used as a way to store extra information about the molecule. This can be supplier information, additional identifiers, an encoded fingerprint, or many other possibilities.
-==OpenBabel/Pybel==
-<source lang="python">import pybel
+This task show how to use a toolkit to get and set tag data by implementing [http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five Lipinski's Rule of Five] where the inputs come from the data data and the result goes into a new tag field named RULE5.
-output = pybel.Outputfile("sdf", "withSMILES.sdf")
+Lipinski's rule says that, in general, an orally active drug has no more than one violation of the following criteria:
+* Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms)
+* Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)
+* A molecular weight under 500 daltons
+* An octanol-water partition coefficient log P of less than 5
+==Implementation==
+The input file is SD file from the [[benzodiazepine]] data set, available directly from [http://dalkescientific.com/writings/benzodiazepine.sdf.gz]. Every record contains the tags PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_HBOND_ACCEPTOR and PUBCHEM_MOLECULAR_WEIGHT, and most of the records contain the tag PUBCHEM_XLOGP3. Here is an example of those tag data fields:
+<pre>
+&gt; &lt;PUBCHEM_CACTVS_HBOND_ACCEPTOR&gt;
+&gt; &lt;PUBCHEM_CACTVS_HBOND_DONOR&gt;
+&gt; &lt;PUBCHEM_XLOGP3&gt;
+.2
+&gt; &lt;PUBCHEM_MOLECULAR_WEIGHT&gt;
+.20938
+</pre>
+The program must create a new SD file which is the same as the input file but with a new tag data field named "RULE5". This must be "1" if the record passes Lipinski's rule, "0" if it does not, and "no logP" if the PUBCHEM_XLOGP3 field is missing.
+"The same" deserves some clarification. Most toolkits do not preserve the SD input precisely. For example, the "CTfile Formats" specification says that some fields are obsolete or unneeded, or are only appropriate to query and reaction formats. Different toolkits handle these differently, and it's okay for these fields to differ. There are also places where the amount of whitespace isn't fully specified, and different toolkits give different answers.
+It's expected that a toolkit will modify those fields. The exception will be those solutions which manipulate the SD file directly as text and don't go through a molecule object and back again.
+Line #2 of each record is supposed to include the name of the program which generated the output and include time stamp, although it may also be blank. Some solutions will keep the line as-is, some will put the toolkits's name and current timestamp in its place, and still others will force it to be blank. Please note what the solution does.
+A solution must not reorder or modify the atom and bond information nor change the contents of any existing tag. It should only add a single new tag to the output. If it does anything else, please note the exception.
+==OpenBabel/Pybel==
+<source lang="python">
+import pybel
+output = pybel.Outputfile("sdf", "RULE5.sdf")
 for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
+    if "PUBCHEM_XLOGP3" not in mol.data:
-    mol.data['CANSMILES'] = mol.write("can").split()[0]
+        mol.data["RULE5"] = "no logP"
+    else:
+        count = ((int(mol.data["PUBCHEM_CACTVS_HBOND_DONOR"]) &lt;= 5) +
+                 (int(mol.data["PUBCHEM_CACTVS_HBOND_ACCEPTOR"]) &lt;= 10) +
+                 (float(mol.data["PUBCHEM_MOLECULAR_WEIGHT"]) &lt;= 500) +
+                 (float(mol.data["PUBCHEM_XLOGP3"]) &lt;= 5))
+        if count &gt;= 3:
+            mol.data["RULE5"] = "1"
+        else:
+            mol.data["RULE5"] = "0"
     output.write(mol)
 output.close()
 </source>
-This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp. In addition to the "SMILES" tag it adds the "Canonical Atom Order" tag, which gives canonical labels to each atom.
+This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp.
 ==OpenEye/Python==
-<source lang="python">from openeye.oechem import *
+<source lang="python">
+from openeye.oechem import *
 ifs = oemolistream()
 ofs = oemolostream()
 ifs.open("benzodiazepine.sdf.gz")
-ofs.open("withSMILES.sdf")
+ofs.open("RULE5.sdf")
+def num_Lipinski_failures(mol):
-for mol in ifs.GetOEMols():
+    failures = 0
-    OESetSDData(mol, "CANSMILES", OECreateCanSmiString(mol))
+    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_DONOR")) &gt; 5:
+        failures += 1
+    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_ACCEPTOR")) &gt; 10:
+        failures += 1
+    if float(OEGetSDData(mol, "PUBCHEM_MOLECULAR_WEIGHT")) &gt; 500.0:
+        failures += 1
+    if float(OEGetSDData(mol, "PUBCHEM_XLOGP3")) &gt; 5.0:
+        failures += 1
+    return failures
+for mol in ifs.GetOEGraphMols():
+    if not OEHasSDData(mol, "PUBCHEM_XLOGP3"):
+        rule5 = "no logP"
+    else:
+        if num_Lipinski_failures(mol) &gt; 1:
+            rule5 = "0"
+        else:
+            rule5 = "1"
+    OESetSDData(mol, "RULE5", rule5)
     OEWriteMolecule(ofs, mol)
 </source>
-This updates the #2 line of each molfile to mention "OEChem" and the current timestamp. Because the input file was created with OEChem it's hard to tell how much OEChem will alter the output.
+This updates the #2 line of each molfile to mention "OEChem" and the current timestamp.
+==Unix/Python==
+Because the tag data fields are line oriented, this is a problem that can be solved with standard Unix tools and without a chemistry toolkit. Here's one solution, but while it does work I don't recommend it as a general approach.
+<source lang="python">
+infile = open("benzodiazepine.sdf", "U")
+outfile = open("RULE5.sdf", "w")
+# Helper function. Use to copy and use
+# the first data line of a tag I need.
+def next_line():
+    line = next(infile)
+    outfile.write(line)
+    return line
+# Initialize for the first record
+seen_xlogp = False
+failures = 0
+for line in infile:
+    if line == "$$$$\n":
+        # Write the RULE5 tag data before ending the record
+        if not seen_xlogp:
+            outfile.write("&gt; &lt;RULE5&gt;\nno logP\n\n")
+        elif failures > 1:
+            outfile.write("&gt; &lt;RULE5&gt;\n0\n\n")
+        else:
+            outfile.write("&gt; &lt;RULE5&gt;\n1\n\n")
+        outfile.write(line)
+        # reset for the next record
+        seen_xlogp = False
+        failures = 0
+        continue
+    # Copy the input line to the output line
+    outfile.write(line)
+    if line[:1] == "&gt;":
+        # Check for tag data fields
+        if "&lt;PUBCHEM_CACTVS_HBOND_DONOR&gt;" in line:
+            failures += int(next_line()) &gt; 5
+        elif "&lt;PUBCHEM_CACTVS_HBOND_ACCEPTOR&gt;" in line:
+            failures += int(next_line()) &gt; 10
+        elif "&lt;PUBCHEM_MOLECULAR_WEIGHT&gt;" in line:
+            failures += float(next_line()) &gt; 500.0
+        elif "&lt;PUBCHEM_XLOGP3&gt;" in line:
+            failures += float(next_line()) &gt; 5.0
+            seen_xlogp = True
+</source>
 [[Category:OpenBabel/Pybel]]
 [[Category:SDF]]