Changes: Convert a SMILES string to canonical SMILES

Revision as of 07:04, 4 April 2022

$\ce{c6h6}$ A SMILES string is a way to represent a 2D molecular graph as a 1D string. In most cases there are many possible SMILES strings for the same structure. Canonicalization is a way to determine which of all possible SMILES will be used as the reference SMILES for a molecular graph.

Suppose you want to find if a structure already exists in a data set. In graph theory this is the graph isomorphism problem. Using the canonical SMILES instead of the graphs reduces the problem to a simple text matching problem. Keep track of the canonical SMILES for each compound in a database and convert the query structure to its canonical SMILES. If that SMILES doesn't already exist then it is a new structure.

There is no universal canonical SMILES. Every toolkit uses a different algorithm, and sometimes the algorithm changes with different versions of the toolkit. There are even different forms of canonical SMILES, depending on if atomic properties like isotope are important for the result.

Canonical SMILES is mostly important inside of software tools. It isn't meant as an exchange format and no one really types in canonical SMILES. That's why this task doesn't have any form of I/O. The point of this task is to see how to convert a in-memory SMILES string to a molecule then generate the canonical SMILES for it.

Implementation

Parse two SMILES strings and convert them to canonical form. Check that the results give the same string. The input SMILES structures are:

[Cl-][Zr+4]123456789([Cl-])[CH]=%10C=%119C(=CC=CC%118[C-]7(C%101C)[Si](C)(C)[C-
]%126C=%135C=CC=C(C=%14C=C(C=C(C%14)C(C)(C)C)C(C)(C)C)C%134[CH]3=C%122
C)C=%15C=C(C=C(C%15)C(C)(C)C)C(C)(C)C

CDK/Groovy

import org.openscience.cdk.smiles.SmilesGenerator;
import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.nonotify.NoNotificationChemObjectBuilder;

parser = new SmilesParser(
  NoNotificationChemObjectBuilder.getInstance()
);
generator = new SmilesGenerator();

smi = [
  "CN2C(=O)N(C)C(=O)C1=C2N=CN1C",
  "CN1C=NC2=C1C(=O)N(C)C(=O)N2C"
]
can = [];

smi.each { smiles ->
  can.add(
    generator.createSMILES(
      parser.parseSmiles(smiles)
    )
  )
}

assert can[0] == can[1]

Indigo/C

#include <stdio.h>
#include <string.h>
#include "indigo.h"

int main (int argc, const char *argv[])
{
   int mol1 = indigoLoadMoleculeFromString("CN2C(=O)N(C)C(=O)C1=C2N=CN1C");
   int mol2 = indigoLoadMoleculeFromString("CN1C=NC2=C1C(=O)N(C)C(=O)N2C");
   char *smi1, *smi2;
   
   indigoAromatize(mol1);
   indigoAromatize(mol2);

   smi1 = strdup(indigoCanonicalSmiles(mol1));
   smi2 = strdup(indigoCanonicalSmiles(mol2));

   if (strcmp(smi1, smi2) != 0)
      fprintf(stderr, "canonical SMILES strings do not match\n");

   free(smi1);
   free(smi2);
   indigoFree(mol1);
   indigoFree(mol2);
}

Indigo/Java

package test;
import com.gga.indigo.*;
import java.io.*;
import java.util.*;
 
public class Main
{
   public static void main (String[] args) throws java.io.IOException
   {
      Indigo indigo = new Indigo();
      IndigoObject mol1 = indigo.loadMolecule("CN2C(=O)N(C)C(=O)C1=C2N=CN1C");
      IndigoObject mol2 = indigo.loadMolecule("CN1C=NC2=C1C(=O)N(C)C(=O)N2C");

      mol1.aromatize();
      mol2.aromatize();
      assert mol1.canonicalSmiles().equals(mol2.canonicalSmiles());
   }
}

Indigo/Python

from indigo import *
indigo = Indigo()
mol1 = indigo.loadMolecule("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
mol2 = indigo.loadMolecule("CN1C=NC2=C1C(=O)N(C)C(=O)N2C")
mol1.aromatize()
mol2.aromatize()
assert mol1.canonicalSmiles() == mol2.canonicalSmiles()

OpenBabel/Pybel

import pybel

smiles = ["CN2C(=O)N(C)C(=O)C1=C2N=CN1C",
          "CN1C=NC2=C1C(=O)N(C)C(=O)N2C"]

cans = [pybel.readstring("smi", smile).write("can") for smile in smiles]
assert cans[0] == cans[1]

OpenBabel/Rubabel

require 'rubabel'
smiles = %w{CN2C(=O)N(C)C(=O)C1=C2N=CN1C CN1C=NC2=C1C(=O)N(C)C(=O)N2C}
cans = smiles.map {|smile| Rubabel[smile] }
fail unless cans.reduce(:==)

OpenEye/Python

from openeye.oechem import *

def canonicalize(smiles):
    mol = OEGraphMol()
    OEParseSmiles(mol, smiles)
    return OECreateCanSmiString(mol)

assert (canonicalize("CN2C(=O)N(C)C(=O)C1=C2N=CN1C") ==
        canonicalize("CN1C=NC2=C1C(=O)N(C)C(=O)N2C"))

RDKit/Python

from rdkit import Chem

smis = ["CN2C(=O)N(C)C(=O)C1=C2N=CN1C",
          "CN1C=NC2=C1C(=O)N(C)C(=O)N2C"]

cans = [Chem.MolToSmiles(Chem.MolFromSmiles(smi),True) for smi in smis]
assert cans[0] == cans[1]

Cactvs/Tcl

prop setparam E_SMILES unique 1
set s1 [ens new [ens create CN2C(=O)N(C)C(=O)C1=C2N=CN1C] E_SMILES]
set s2 [ens new [ens create CN1C=NC2=C1C(=O)N(C)C(=O)N2C] E_SMILES]
if {$s1 ne $s2} {error "SMILES not equal"}

Cactvs/Python

Prop.Setparam('E_SMILES',{'unique':True})
s1=Ens('CN2C(=O)N(C)C(=O)C1=C2N=CN1C').new('E_SMILES')
s2=Ens('CN1C=NC2=C1C(=O)N(C)C(=O)N2C').new('E_SMILES')
if (s1!=s2): raise RuntimeError('SMILES not equal')