Perform a substructure search on SDF file and report the number of false positives

Take a SMILES query and an SDF file; compute fingerprints for the given query and molecules from the SDF file; perform a substructure search (using the fingerprints for screening); report the total amount of molecules in SDF file; report the number of substructure matches and the number of false positives (i.e. molecules that passed the screening phase but failed the substructure matching phase).

This task originated from a topic on BlueObelisk.

Implementation
The sample database will be gzip'ed SD file of the benzodiazepine data set. The query structure will be defined as "C1C=C(NC=O)C=CC=1". The output will look like %d total %d matched %d false positives

Indigo/C++
Instructions: $ cd graph; make CONF=Release32; cd .. $ cd molecule; make CONF=Release32; cd .. $ cd utils $ gcc false_positives.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++ $ ./false_positives
 * 1) Unpack 'graph' and 'molecule' projects into some folder
 * 2) Create 'utils' folder nearby
 * 3) Paste the above code into utils/false_positives.cpp file
 * 4) Compile the file using the following commands:
 * 1) Run the program like that:

Expected result: 12386 total 8836 matched 1 false positives

Cactvs/Tcl
set ss [ens create C1C=C(NC=O)C=CC=1 smarts] set fh [molfile open benzodiazepine.sdf.gz] prop setparam E_SCREEN extended 2 set nscreen_passed [molfile scan $fh "structure ~= $ss" count] set nss_passed [molfile scan $fh "structure >= $ss" count] puts [format "%d total" [molfile count $fh]] puts [format "%d matched" $nscreen_passed] puts [format "%d false positives" [expr $nscreen_passed-$nss_passed]]

This is a rather unrealistic example. Cactvs does not perform screening on files where the complete structure needs to be read first in order to compute these - in that case, a direct substructure match is in all likelihood faster. So in this example we perform a screen-only search and a full substructure search on the same file, and then compare the results. For query-optimized files, the screen statistics and other performance data can be read after a "molfile scan $fh" formatted as a dictionary with a "molfile get $fh scandata" - but in this case, the dictionary entry for the number of read structures is the same as the total record count, and the number of processed screens zero.

The result: 12386 total 9219 matched 382 false positives

Cactvs/Python
ss=Ens('C1C=C(NC=O)C=CC=1','smarts') f=Molfile('benzodiazepine.sdf.gz') Prop.Setparam('E_SCREEN','extended',2) nscreen_passed=f.scan('structure ~= {}'.format(ss),'count') nss_passed=f.scan('structure >= {}'.format(ss),'count') print("{:d} total".format(f.count)) print("{:d} matched".format(nscreen_passed)) print("{:d} false positives".format(nscreen_passed-nss_passed))

Once more, the output is identical to the Tcl version.