Working with SD tag data

An SD file contains a set of records. Each record has a connection table and set of fields which together describe the molecule. This is call the 'molfile'. A record also contains a set of data fields often called "tag data" which in its simplest and most common form looks like

> <PUBCHEM_NIST_INCHIKEY>
PUMYFTJOWAJIKF-UHFFFAOYSA-N

> <PUBCHEM_XLOGP3>
3.8

In this case the tag PUBCHEM_XLOGP3 has the value 3.8.

The tag data is often used as a way to store extra information about the molecule. This can be supplier information, additional identifiers, an encoded fingerprint, or many other possibilities.

This task show how to use a toolkit to get and set tag data by implementing Lipinski's Rule of Five where the inputs come from the data data and the result goes into a new tag field named RULE5.

Lipinski's rule says that, in general, an orally active drug has no more than one violation of the following criteria:

Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms)
Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)
A molecular weight under 500 daltons
An octanol-water partition coefficient log P of less than 5

Implementation

The input file is SD file from the benzodiazepine data set, available directly from [1]. Every record contains the tags PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_HBOND_ACCEPTOR and PUBCHEM_MOLECULAR_WEIGHT, and most of the records contain the tag PUBCHEM_XLOGP3. Here is an example of those tag data fields:

> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
4
> <PUBCHEM_CACTVS_HBOND_DONOR>
1
> <PUBCHEM_XLOGP3>
1.2
> <PUBCHEM_MOLECULAR_WEIGHT>
359.20938

The program must create a new SD file which is the same as the input file but with a new tag data field named "RULE5". This must be "1" if the record passes Lipinski's rule, "0" if it does not, and "no logP" if the PUBCHEM_XLOGP3 field is missing.

"The same" deserves some clarification. Most toolkits do not preserve the SD input precisely. For example, the "CTfile Formats" specification says that some fields are obsolete or unneeded, or are only appropriate to query and reaction formats. Different toolkits handle these differently, and it's okay for these fields to differ. There are also places where the amount of whitespace isn't fully specified, and different toolkits give different answers.

It's expected that a toolkit will modify those fields. The exception will be those solutions which manipulate the SD file directly as text and don't go through a molecule object and back again.

Line #2 of each record is supposed to include the name of the program which generated the output and include time stamp, although it may also be blank. Some solutions will keep the line as-is, some will put the toolkits's name and current timestamp in its place, and still others will force it to be blank. Please note what the solution does.

A solution must not reorder or modify the atom and bond information nor change the contents of any existing tag. It should only add a single new tag to the output. If it does anything else, please note the exception.

OpenBabel/Pybel

import pybel
output = pybel.Outputfile("sdf", "RULE5.sdf")
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    if "PUBCHEM_XLOGP3" not in mol.data:
        mol.data["RULE5"] = "no logP"
    else:
        count = (int(mol.data["PUBCHEM_CACTVS_HBOND_DONOR"]) <= 5) + \
                (int(mol.data["PUBCHEM_CACTVS_HBOND_ACCEPTOR"]) <= 10) + \
                (float(mol.data["PUBCHEM_MOLECULAR_WEIGHT"]) <= 500) + \
                (float(mol.data["PUBCHEM_XLOGP3"]) <= 5)
        if count >= 3:
            mol.data["RULE5"] = "1"
        else:
            mol.data["RULE5"] = "0"
    output.write(mol)
output.close()

This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp.

OpenEye/Python

from openeye.oechem import *

ifs = oemolistream()
ofs = oemolostream()
ifs.open("benzodiazepine.sdf.gz")
ofs.open("RULE5.sdf")

def num_Lipinski_failures(mol):
    failures = 0
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_DONOR")) > 5:
        failures += 1
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_ACCEPTOR")) > 10:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_MOLECULAR_WEIGHT")) > 500.0:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_XLOGP3")) > 5.0:
        failures += 1
    return failures

for mol in ifs.GetOEGraphMols():
    if not OEHasSDData(mol, "PUBCHEM_XLOGP3"):
        rule5 = "no logP"
    elif num_Lipinski_failures(mol) > 1:
        rule5 = "0"
    else:
        rule5 = "1"
    OESetSDData(mol, "RULE5", rule5)
    OEWriteMolecule(ofs, mol)

This updates the #2 line of each molfile to mention "OEChem" and the current timestamp.

Unix/Python

Because the tag data fields are line oriented, this is a problem that can be solved with standard Unix tools and without a chemistry toolkit. Here's one solution, but while it does work I don't recommend it as a general approach.

infile = open("benzodiazepine.sdf", "U")
outfile = open("RULE5.sdf", "w")

# Helper function. Use to copy and use
# the first data line of a tag I need.
def next_line():
    line = next(infile)
    outfile.write(line)
    return line

# Initialize for the first record
seen_xlogp = False
failures = 0

for line in infile:
    if line == "$$$$\n":
        # Write the RULE5 tag data before ending the record
        if not seen_xlogp:
            outfile.write("> <RULE5>\nno logP\n\n")
        elif failures > 1:
            outfile.write("> <RULE5>\n0\n\n")
        else:
            outfile.write("> <RULE5>\n1\n\n")
        outfile.write(line)

        # reset for the next record
        seen_xlogp = False
        failures = 0
        continue

    # Copy the input line to the output line
    outfile.write(line)
    if line[:1] == ">":
        # Check for tag data fields
        if "<PUBCHEM_CACTVS_HBOND_DONOR>" in line:
            failures += int(next_line()) > 5
        elif "<PUBCHEM_CACTVS_HBOND_ACCEPTOR>" in line:
            failures += int(next_line()) > 10
        elif "<PUBCHEM_MOLECULAR_WEIGHT>" in line:
            failures += float(next_line()) > 500.0
        elif "<PUBCHEM_XLOGP3>" in line:
            failures += float(next_line()) > 5.0
            seen_xlogp = True

Working with SD tag data

Contents

Implementation

OpenBabel/Pybel

OpenEye/Python

Unix/Python

Fan Feed