Working with SD tag data

An SD file contains a set of records. Each record has a connection table and set of fields which together describe the molecule. This is call the 'molfile'. A record also contains a set of data fields often called "tag data" which in its simplest and most common form looks like

>  PUMYFTJOWAJIKF-UHFFFAOYSA-N

>  3.8

In this case the tag PUBCHEM_XLOGP3 has the value 3.8.

The tag data is often used as a way to store extra information about the molecule. This can be supplier information, additional identifiers, an encoded fingerprint, or many other possibilities.

This task show how to use a toolkit to get and set tag data by implementing Lipinski's Rule of Five where the inputs come from the data data and the result goes into a new tag field named RULE5.

Lipinski's rule says that, in general, an orally active drug has no more than one violation of the following criteria:
 * Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms)
 * Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)
 * A molecular weight under 500 daltons
 * An octanol-water partition coefficient log P of less than 5

Implementation
The input file is SD file from the benzodiazepine data set, available directly from. Every record contains the tags PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_HBOND_ACCEPTOR and PUBCHEM_MOLECULAR_WEIGHT, and most of the records contain the tag PUBCHEM_XLOGP3. Here is an example of those tag data fields:

>  4 >  1 >  1.2 >  359.20938

The program must create a new SD file which is the same as the input file but with a new tag data field named "RULE5". This must be "1" if the record passes Lipinski's rule, "0" if it does not, and "no logP" if the PUBCHEM_XLOGP3 field is missing.

"The same" deserves some clarification. Most toolkits do not preserve the SD input precisely. For example, the "CTfile Formats" specification says that some fields are obsolete or unneeded, or are only appropriate to query and reaction formats. Different toolkits handle these differently, and it's okay for these fields to differ. There are also places where the amount of whitespace isn't fully specified, and different toolkits give different answers.

It's expected that a toolkit will modify those fields. The exception will be those solutions which manipulate the SD file directly as text and don't go through a molecule object and back again.

Line #2 of each record is supposed to include the name of the program which generated the output and include time stamp, although it may also be blank. Some solutions will keep the line as-is, some will put the toolkits's name and current timestamp in its place, and still others will force it to be blank. Please note what the solution does.

A solution must not reorder or modify the atom and bond information nor change the contents of any existing tag. It should only add a single new tag to the output. If it does anything else, please note the exception.

OpenBabel/Pybel
This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp.

OpenEye/Python
This updates the #2 line of each molfile to mention "OEChem" and the current timestamp.

Unix/Python
Because the tag data fields are line oriented, this is a problem that can be solved with standard Unix tools and without a chemistry toolkit. Here's one solution, but while it does work I don't recommend it as a general approach.

RDKit/Python
This updates the #2 line of each molfile to mention "RDKit".

SDF toolkit
The SDF toolkit has been programmed to handle the SD tag data. No changes are made to the connection table or the other SD tags.

Standard input and output are used.

require MDL_sdf ; while( my $mol = MDL_sdf_non_parsed_molecule->readFromInput ) { my $logP = $mol->get("PUBCHEM_XLOGP3"); my $rule5 = "no logp"; if(defined($logP)) { my $count = $mol->get("PUBCHEM_CACTVS_HBOND_DONOR") > 5; $count += $mol->get("PUBCHEM_CACTVS_HBOND_ACCEPTOR") > 10; $count += $mol->get("PUBCHEM_MOLECULAR_WEIGHT") > 500; $count += $logP > 5; $rule5 = ($count>=3 ? 1:0); } 	$mol->set("RULE5", $rule5); $mol->write }
 * 1) !/usr/bin/perl

Cactvs/Tcl
prop create E_MY_LIPINSKY datatype string originalname RULE5 set fh1 [molfile open benzodiazepine.sdf.gz] set fh2 [molfile open stdout w format sdf] molfile loop $fh1 eh { if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1 fields]} { ens set $eh E_MY_LIPINSKY "no logp" } else { ens set $eh E_MY_LIPINSKY [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+ \      (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+(PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}] } molfile set $fh2 writelist [concat [molfile get $fh1 fields] E_MY_LIPINSKY] molfile write $fh2 $eh }

Above solution re-encodes the atoms and bonds block, but has the advantage that it works directly on gzip'ed files.

Below is an alternative solution with a byte-exact copy of the input records minus the end-of-record marker, which are then added by direct writing of the additional field to the output channel. The disadvantage is that since the I/O module needs to backspace, this only works (with reasonable performance) on uncompressed input data. Nevertheless, the capability to output a byte-exact copy of the leading original input record is a major advantage not provided by other toolkits, given the many obscure SGROUP data fields etc. that could be present in the input.

This solution additionally takes advantage of property expressions. All required property data associated with the structure object is automatically looked up via its primary name, or an alias such as the SD file tag. If the property data is not yet present, an attempt is made to compute it.

The record copy and line output commands automatically use the EOL characters configured on the structure file output channel.

set fh1 [molfile open benzodiazepine.sdf] set fh2 [molfile open stdout w format sdf writeend none] molfile loop $fh1 eh { if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1 fields]} { set v "no logp" } else { set v [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\         (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\          (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}] }    molfile copy $fh1 $fh2 1 -1 molfile putline $fh2 "> " $v "" {$$$$} }

But nobody stops us from opening a file more than once. Using this technique, we can combine byte-exact header copy and reading from compressed input:

set fh1a [molfile open benzodiazepine.sdf.gz] set fh1b [molfile open benzodiazepine.sdf.gz] set fh2 [molfile open stdout w format sdf writeend none] molfile loop $fh1a eh { if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1a fields]} { set v "no logp" } else { set v [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\         (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\          (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}] }    molfile copy $fh1b $fh2 1 molfile putline $fh2 "> " $v "" {$$$$} }

Cactvs/Python
And here the Python variant of the last solution:

fh1a=Molfile('benzodiazepine.sdf.gz') fh1b=Molfile('benzodiazepine.sdf.gz') fh2=Molfile('stdout','w',{'format':'sdf','writeend':'none'}) for eh in fh1a: if ('E_PUBCHEM_XLOGP3' not in fh1a.fields): v='no logp' else: v=eh.expr('((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\ (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\ (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3') fh1b.copy(fh2,1) fh2.putline('> ',v,'','$$$$')