How to convert SMILES to IUPAC name:
bash-3.2$ python pubchem_convert_SMILES_to_IUPAC.py "CC=O"
IUPAC name (Allowed): acetaldehyde
IUPAC name (CAS-like Style): acetaldehyde
IUPAC name (Preferred): acetaldehyde
IUPAC name (Systematic): ethanal
IUPAC name (Traditional): acetaldehyde
bash-3.2$ python pubchem_convert_SMILES_to_IUPAC.py "Oc2ccc(C=Cc1cc(O)cc(O)c1)cc2"
IUPAC name (Allowed): 5-[2-(4-hydroxyphenyl)vinyl]benzene-1,3-diol
IUPAC name (CAS-like Style): 5-[2-(4-hydroxyphenyl)ethenyl]benzene-1,3-diol
IUPAC name (Preferred): 5-[2-(4-hydroxyphenyl)ethenyl]benzene-1,3-diol
IUPAC name (Systematic): 5-[2-(4-hydroxyphenyl)ethenyl]benzene-1,3-diol
IUPAC name (Traditional): 5-[2-(4-hydroxyphenyl)vinyl]resorcinol
see https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html for more details and examples.
The "weird" find/findall
expressions are because of how lxml
(or some C lib they use) handles XML namespaces. In this case every XML element is prefixed with the "default" namespace in this case {http://www.ncbi.nlm.nih.gov}
, for example: {http://www.ncbi.nlm.nih.gov}PC-Urn_label
. The problem with that is that it does not work with XPath; so there seem to be 2 solutions to this problem:
-
remove, strip the namespace prefix from the XML doc, so you can use XPath expressions on it
it = ET.iterparse('somefile.xml') for _, el in it: el.tag = el.tag.split('}', 1)[1] # strip all namespaces root = it.root
-
give up on using XPath expressions, and use the
lxml
find/findall functions that do support expressions likeroot.findall(".//{*}PC-Urn_label")
, where the*
is a simple wildcard (in this case to make the code slightly better readable than using the actual namespace string as inroot.findall(".//{http://www.ncbi.nlm.nih.gov}PC-Urn_label")
).
Given the structure of the XML data in this (pubchem) case I choose this approach.