Skip to content

Instantly share code, notes, and snippets.

@egonw
Created October 29, 2017 16:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save egonw/eb07cf69278c0be78d560c103ef29e5a to your computer and use it in GitHub Desktop.
Save egonw/eb07cf69278c0be78d560c103ef29e5a to your computer and use it in GitHub Desktop.
Bioclipse code to check chemical compounds with isomeric SMILES and no InChI
// Copyright (C) 2017 Egon Willighagen
// MIT license
// see https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Tools#Chemical_compound_without_InChI
/* results (2017-10-29):
Number of missing InChIs: 2098
InChI too long: 1736
With undefined stereo: 348
Bad SMILES: 2
Compound classes: 12
*/
query = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?compound ?smiles WHERE {
?compound wdt:P31 wd:Q11173 ;
wdt:P2017 ?smiles .
MINUS {?compound wdt:P234 ?d }
}
"""
service = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
results = rdf.sparqlRemote(service, query)
qsFile = "/Wikidata/curation1.quickstatements"
def renewFile(file) {
if (ui.fileExists(file)) ui.remove(file)
ui.newFile(file)
return file
}
renewFile(qsFile)
tooLong = 0
undefinedStereo = 0
badSmiles = 0
compoundClass = 0
for (i=1;i<=results.rowCount;i++) {
// for (i=1;i<=2;i++) {
rowVals = results.getRow(i)
item = rowVals[0].substring(3)
smiles = rowVals[1]
if (!smiles.contains("*")) { // skip compound classes
try {
mol = cdk.fromSMILES(smiles)
inchiObj = inchi.generate(mol)
inchiShort = inchiObj.value.substring(6)
key = inchiObj.key
if (!inchiShort.contains("?")) { // skip undefined stereochemistry
if (inchiShort.length() <= 400) {
statement = """
CREATE
$item\tP31\tQ11173$paperProv
$item\tDen\t\"chemical compound\"$paperProv
$item\t$smilesProp\t\"$smiles\"
$item\tP274\t\"$formula\"
$item\tP234\t\"$inchiShort\"
$item\tP235\t\"$key\"
$pubchemLine
"""
ui.append(qsFile, statement + "\n")
} else {
// println "$item has an InChI that is too long: $inchiShort"
tooLong++
}
} else {
// println "$item has undefined stereochemistry(?): $inchiShort"
undefinedStereo++
}
} catch (Exception e) {
// println "$item has bad SMILES: " + e.message
badSmiles++
} // skip bad SMILES
} else {
// println "$item is a compound class (has a '*'): $smiles"
compoundClass++
}
}
println "Number of missing InChIs: " + results.rowCount
println "InChI too long: $tooLong"
println "With undefined stereo: $undefinedStereo"
println "Bad SMILES: $badSmiles"
println "Compound classes: $compoundClass"
ui.open(qsFile)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment