Take a SMILES query and an SDF file; compute fingerprints for the given query and molecules from the SDF file; perform a substructure search (using the fingerprints for screening); report the total amount of molecules in SDF file; report the number of substructure matches and the number of false positives (i.e. molecules that passed the screening phase but failed the substructure matching phase).
This task originated from a topic on BlueObelisk.
Implementation[]
The sample database will be gzip'ed SD file of the benzodiazepine data set. The query structure will be defined as "C1C=C(NC=O)C=CC=1". The output will look like
%d total %d matched %d false positives
Indigo/C++[]
#include "base_cpp/scanner.h"
#include "molecule/molecule.h"
#include "molecule/sdf_loader.h"
#include "molecule/molfile_loader.h"
#include "molecule/molecule_arom.h"
#include "molecule/molecule_fingerprint.h"
#include "molecule/smiles_loader.h"
#include "molecule/molecule_substructure_matcher.h"
#include "base_c/bitarray.h"
int main (void)
{
MoleculeFingerprintParameters parameters;
memset(¶meters, 0, sizeof(parameters));
parameters.ord_qwords = 25; // default value in Bingo
int fpsize = parameters.fingerprintSize();
Molecule query;
Array<byte> queryfp;
{
BufferScanner scanner("C1C=C(NC=O)C=CC=1");
SmilesLoader loader(scanner);
loader.loadMolecule(query, true);
QueryMoleculeAromatizer::aromatizeBonds(query);
query.findBondsInRings();
MoleculeFingerprintBuilder builder(query, parameters);
builder.process();
queryfp.copy(builder.get(), fpsize);
}
int total = 0;
int passed = 0;
int matched = 0;
try
{
FileScanner scanner("benzodiazepine.sdf.gz");
SdfLoader loader(scanner);
int cnt = 0;
while (!loader.isEOF())
{
loader.readNext();
Molecule mol;
BufferScanner molscan(loader.data);
try
{
MolfileLoader molload(molscan);
molload.loadMolecule(mol, false);
mol.calcImplicitHydrogens(true);
}
catch (Exception &e)
{
continue;
}
mol.findBondsInRings();
MoleculeAromatizer::aromatizeBonds(mol);
MoleculeFingerprintBuilder builder(mol, parameters);
builder.process();
total++;
if (bitTestOnes(queryfp.ptr(), builder.get(), fpsize))
{
passed++;
MoleculeSubstructureMatcher matcher(mol);
matcher.setQuery(query);
if (matcher.find())
matched++;
}
}
}
catch (Exception &e)
{
fprintf(stderr, "error: %s\n", e.message());
return -1;
}
printf("%d total\n%d matched\n%d false positives\n", total, matched, passed - matched);
return 0;
}
Instructions:
- Unpack 'graph' and 'molecule' projects into some folder
- Create 'utils' folder nearby
- Paste the above code into utils/false_positives.cpp file
- Compile the file using the following commands:
$ cd graph; make CONF=Release32; cd ..
$ cd molecule; make CONF=Release32; cd ..
$ cd utils
$ gcc false_positives.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++ - Run the program like that:
$ ./false_positives
Expected result:
12386 total 8836 matched 1 false positives
Cactvs/Tcl[]
set ss [ens create C1C=C(NC=O)C=CC=1 smarts]
set fh [molfile open benzodiazepine.sdf.gz]
prop setparam E_SCREEN extended 2
set nscreen_passed [molfile scan $fh "structure ~= $ss" count]
set nss_passed [molfile scan $fh "structure >= $ss" count]
puts [format "%d total" [molfile count $fh]]
puts [format "%d matched" $nscreen_passed]
puts [format "%d false positives" [expr $nscreen_passed-$nss_passed]]
This is a rather unrealistic example. Cactvs does not perform screening on files where the complete structure needs to be read first in order to compute these - in that case, a direct substructure match is in all likelihood faster. So in this example we perform a screen-only search and a full substructure search on the same file, and then compare the results. For query-optimized files, the screen statistics and other performance data can be read after a "molfile scan $fh" formatted as a dictionary with a "molfile get $fh scandata" - but in this case, the dictionary entry for the number of read structures is the same as the total record count, and the number of processed screens zero.
The result:
12386 total 9219 matched 382 false positives
Cactvs/Python[]
ss=Ens('C1C=C(NC=O)C=CC=1','smarts')
f=Molfile('benzodiazepine.sdf.gz')
Prop.Setparam('E_SCREEN','extended',2)
nscreen_passed=f.scan('structure ~= {}'.format(ss),'count')
nss_passed=f.scan('structure >= {}'.format(ss),'count')
print("{:d} total".format(f.count()))
print("{:d} matched".format(nscreen_passed))
print("{:d} false positives".format(nscreen_passed-nss_passed))
Once more, the output is identical to the Tcl version.