Skip to content

Instantly share code, notes, and snippets.

@kenwebb
Last active August 29, 2015 14:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kenwebb/6c71b62ab83af820939a to your computer and use it in GitHub Desktop.
Save kenwebb/6c71b62ab83af820939a to your computer and use it in GitHub Desktop.
SMILES
<?xml version="1.0" encoding="UTF-8"?>
<!--Xholon Workbook http://www.primordion.com/Xholon/gwt/ MIT License, Copyright (C) Ken Webb, Sat Jun 28 2014 08:10:36 GMT-0400 (EDT)-->
<XholonWorkbook>
<Notes><![CDATA[
Xholon
------
Title: SMILES
Description: Simplified Molecular Input Line Entry System
Url: http://www.primordion.com/Xholon/gwt/
InternalName: 6c71b62ab83af820939a
Keywords:
My Notes
--------
According to wikipedia (1):
The Simplified Molecular-Input Line-Entry System or SMILES is a specification in form of a line notation for describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
In this workbook I explore how SMILES can be integrated with Xholon.
In a chemical graph, the nodes are atoms, and the edges are semi-rigid bonds that can be single, double, or triple according to the rules of valence bond theory.[3]
Xholon containment hierarchy doesn't seem to make sense for molecules and SMILES. But I am using Xholon hierarchy to represent branches in the chemical structure.
TODO
- use explicit ports between siblings
- the bonds are the active objects; all ports are from the bonds to the atoms
- the bonds are any of: Sngl Dobl Trpl Rmtc
- possibly include explicit Sngl bonds between all otherwise unbonded siblings
- handle cycles and cross-branch bonds
- add an extra bond to represent the final part of the cycle ?
- `In a SMILES string such as "C1CCCCC1", the first occurrence of a ring-closure number (an "rnum") creates an "open bond" to the atom that precedes the ring-closure number (the "rnum"). When that same rnum is encountered later in the string, a bond is made between the two atoms, which typically forms a cyclic structure.`[3]
Tentative Conclusions June 28, 2014
---------------------
My exploration of SMILES is incomplete, but I do have some tentative conclusions.
- SMILES chemical branches are analogous to Xholon hierarchy
- SMILES branch chains are effectibely contained within a SMILES main chain
- SMILES siblings are connected with single bonds by default,
while Xholon siblings are not connected (SMILES .) by default
- SMILES siblings are ordered, while Xholon siblings are unordered
- I don't think SMILES has a way of naming main and branch chains
- if there's only a main chain, then it's name is the same as the molecule name
- SMILES branches are specified using unnamed ( and )
- every SMILES branch has an ASCII-specified structure which functions as an implicit name
- all Xholon subtrees have names which are separate from the details of their structure
- SMILES allows any atom to bond with any other atom,
which is analogous to Xholon ports
References
----------
(1) http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
(2) http://www.daylight.com/dayhtml/doc/theory/
(3) http://www.opensmiles.org/opensmiles.html
(4) http://cactus.nci.nih.gov/chemical/structure
http://cactus.nci.nih.gov/chemical/structure/aspirin/smiles
converts chemical names to SMILES and other formats
(5) http://pubchem.ncbi.nlm.nih.gov/edit2/index.html
converts SMILES to SVG
]]></Notes>
<_-.XholonClass>
<MoleculeSystem/>
<Mlcl superClass="Attribute_String"/> <!-- molecule -->
<Molecule/>
<Atom>
<!-- Organic Subset, in SMILES, these atoms don't require square brackets around them -->
<Rgnc>
<Rgnl>
<!--
aliphatic_organic ::= 'B' | 'C' | 'N' | 'O' | 'S' | 'P' | 'F' | 'Cl' | 'Br' | 'I'
-->
<B/>
<C/>
<N/>
<O/>
<S/>
<P/>
<F/>
<Cl/>
<Br/>
<I/>
</Rgnl>
<Rgnr>
<!--
aromatic_organic ::= 'b' | 'c' | 'n' | 'o' | 's' | 'p'
-->
<b/>
<c/>
<n/>
<o/>
<s/>
<p/>
</Rgnr>
</Rgnc>
<!--
BRACKET ATOMS in SMILES, these atoms DO require square brackets around them
element_symbols ::= 'H' | 'He' | 'Li' | 'Be' | 'B' | 'C' | 'N' | 'O' | 'F' | 'Ne' | 'Na' | 'Mg' | 'Al' | 'Si' | 'P' | 'S' | 'Cl' | 'Ar' | 'K' | 'Ca' | 'Sc' | 'Ti' | 'V' | 'Cr' | 'Mn' | 'Fe' | 'Co' | 'Ni' | 'Cu' | 'Zn' | 'Ga' | 'Ge' | 'As' | 'Se' | 'Br' | 'Kr' | 'Rb' | 'Sr' | 'Y' | 'Zr' | 'Nb' | 'Mo' | 'Tc' | 'Ru' | 'Rh' | 'Pd' | 'Ag' | 'Cd' | 'In' | 'Sn' | 'Sb' | 'Te' | 'I' | 'Xe' | 'Cs' | 'Ba' | 'Hf' | 'Ta' | 'W' | 'Re' | 'Os' | 'Ir' | 'Pt' | 'Au' | 'Hg' | 'Tl' | 'Pb' | 'Bi' | 'Po' | 'At' | 'Rn' | 'Fr' | 'Ra' | 'Rf' | 'Db' | 'Sg' | 'Bh' | 'Hs' | 'Mt' | 'Ds' | 'Rg' | 'Cn' | 'Fl' | 'Lv' | 'La' | 'Ce' | 'Pr' | 'Nd' | 'Pm' | 'Sm' | 'Eu' | 'Gd' | 'Tb' | 'Dy' | 'Ho' | 'Er' | 'Tm' | 'Yb' | 'Lu' | 'Ac' | 'Th' | 'Pa' | 'U' | 'Np' | 'Pu' | 'Am' | 'Cm' | 'Bk' | 'Cf' | 'Es' | 'Fm' | 'Md' | 'No' | 'Lr'
aromatic_symbols ::= 'b' | 'c' | 'n' | 'o' | 'p' | 's' | 'se' | 'as'
-->
<Brkt>
<Brkl>
<_H/>
<_He/>
<_Li/>
<_Be/>
<_B/>
<_C/>
<_N/>
<_O/>
<_F/>
<_Ne/>
<_Na/>
<_Mg/>
<_Al/>
<_Si/>
<_P/>
<_S/>
<_Cl/>
<_Ar/>
<_K/>
<_Ca/>
<_Sc/>
<_Ti/>
<_V/>
<_Cr/>
<_Mn/>
<_Fe/>
<_Co/>
<_Ni/>
<_Cu/>
<_Zn/>
<_Ga/>
<_Ge/>
<_As/>
<_Se/>
<_Br/>
<_Kr/>
<_Rb/>
<_Sr/>
<_Y/>
<_Zr/>
<_Nb/>
<_Mo/>
<_Tc/>
<_Ru/>
<_Rh/>
<_Pd/>
<_Ag/>
<_Cd/>
<_In/>
<_Sn/>
<_Sb/>
<_Te/>
<_I/>
<_Xe/>
<_Cs/>
<_Ba/>
<_Hf/>
<_Ta/>
<_W/>
<_Re/>
<_Os/>
<_Ir/>
<_Pt/>
<_Au/>
<_Hg/>
<_Tl/>
<_Pb/>
<_Bi/>
<_Po/>
<_At/>
<_Rn/>
<_Fr/>
<_Ra/>
<_Rf/>
<_Db/>
<_Sg/>
<_Bh/>
<_Hs/>
<_Mt/>
<_Ds/>
<_Rg/>
<_Cn/>
<_Fl/>
<_Lv/>
<_La/>
<_Ce/>
<_Pr/>
<_Nd/>
<_Pm/>
<_Sm/>
<_Eu/>
<_Gd/>
<_Tb/>
<_Dy/>
<_Ho/>
<_Er/>
<_Tm/>
<_Yb/>
<_Lu/>
<_Ac/>
<_Th/>
<_Pa/>
<_U/>
<_Np/>
<_Pu/>
<_Am/>
<_Cm/>
<_Bk/>
<_Cf/>
<_Es/>
<_Fm/>
<_Md/>
<_No/>
<_Lr/>
</Brkl>
<Brkr>
<_b/>
<_c/>
<_n/>
<_o/>
<_p/>
<_s/>
<_se/>
<_as/>
</Brkr>
</Brkt>
</Atom>
<!-- chemical bonds -->
<Bond>
<!-- - single bond -->
<Sngl/>
<!-- = double bond -->
<Dobl/>
<!-- # triple bond -->
<Trpl/>
<!-- $ quadrupal bond OpenSMILES -->
<Qdpl/>
<!-- : aromatic bond -->
<Rmtc/>
<!-- / directional bonds -->
<!-- \ directional bonds -->
</Bond>
<!-- disconnected structures; indicates that adjacent atoms are not bonded to each other -->
<Dscn/>
<Brch/>
</_-.XholonClass>
<xholonClassDetails>
<Sngl xhType="XhtypePureActiveObject"/>
<Dobl xhType="XhtypePureActiveObject"/>
<Trpl xhType="XhtypePureActiveObject"/>
<Rmtc xhType="XhtypePureActiveObject"/>
</xholonClassDetails>
<MoleculeSystem>
<Mlcl roleName="ethane">CC</Mlcl>
<Mlcl roleName="carbon dioxide">O=C=O</Mlcl>
<Mlcl roleName="triethylamine">CCN(CC)CC</Mlcl>
<Mlcl roleName="pentane">CCCCC</Mlcl>
<Mlcl roleName="aspirin">C1=CC=CC(=C1C(O)=O)OC(C)=O</Mlcl>
<Mlcl roleName="thiosulfate">OS(=O)(=S)O</Mlcl>
<!-- TODO + and - are not yet handled -->
<Mlcl roleName="sodium chloride">[Na+].[Cl-]</Mlcl>
<Mlcl roleName="ring">C1CCCCC1</Mlcl>
<Mlcl roleName="cubane">C12C3C4C1C5C4C3C25</Mlcl>
<Mlcl roleName="ring-closure number test">C0123456789C0C1C2C3C4C5C6C7C8C9</Mlcl>
<Mlcl roleName="syntax test">BCNOSPFIbcnospBrCl-=#:()[]XYZxyz</Mlcl>
<!-- TODO "arbitrary atom names" needs more work -->
<!--<Mlcl roleName="arbitrary atom names">[one]2[two]([three][three]3[three])[four]2[five]3</Mlcl>-->
</MoleculeSystem>
<MoleculeSystembehavior implName="org.primordion.xholon.base.Behavior_gwtjs"><![CDATA[
var me;
var allowArbitraryAtomNames = false;
var beh = {
postConfigure: function() {
me = this.cnode.parent();
$wnd.xh.param("MaxPorts","2");
var service = $wnd.xh.service("XholonHelperService");
var mlcl = me.first();
while (mlcl) {
if (mlcl.xhc().name() != "Mlcl") {break;}
var txt = mlcl.text().trim();
me.println(txt);
var xml = this.parse(txt, mlcl.role());
service.call(-2013, xml, me);
var mlclNext = mlcl.next();
mlcl.remove();
mlcl = mlclNext;
}
this.cnode.remove();
}, // end postConfigure()
parse: function(txt, role) {
var xml = '<Molecule roleName="' + role + '">';
xml += "<Annotation>" + txt + "</Annotation>";
var i = 0;
while (i < txt.length) {
var token = txt.charAt(i);
switch (token) {
case 'B':
if (txt.charAt(i+1) == "r") {
i++;
token = "Br";
}
xml += this.makeXmlNode(token);
break;
case 'C':
if (txt.charAt(i+1) == "l") {
i++;
token = "Cl";
}
xml += this.makeXmlNode(token);
break;
case 'N':
case 'O':
case 'S':
case 'P':
case 'F':
case 'I':
case 'b':
case 'c':
case 'n':
case 'o':
case 's':
case 'p':
xml += this.makeXmlNode(token);
break;
// bond
case '-':
xml += this.makeXmlNode("Sngl");
break;
case '=':
xml += this.makeXmlNode("Dobl");
break;
case '#':
xml += this.makeXmlNode("Trpl");
break;
case ':':
xml += this.makeXmlNode("Rmtc");
break;
// branch
case '(':
xml += "<Brch>";
break;
case ')':
xml += "</Brch>";
break;
// bracketed atom
case '[':
var bracketedXml = this.parseBracketed(txt, i); // [H] becomes < _H />
if (bracketedXml && bracketedXml.length > 4) {
xml += bracketedXml;
i += bracketedXml.length - 4; // ignore < _ / >
}
break;
case ']':
// no need to do anything
break;
// ring-closure number (an "rnum")
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
xml += '<' + 'Sngl' + ' val="' + '10' + token + '.0"' + '/>';
break;
// charge
case '+':
case '-':
// TODO
break;
// disconnection
case '.':
xml += this.makeXmlNode("Dscn");
break;
default: break;
} // end switch
i++;
} // end while
xml += "</Molecule>\n";
me.println(xml);
return xml;
}, // end parse()
makeXmlNode: function(tagName) {
return "<" + tagName + "/>";
},
// txt.charAt(i) equals '['
parseBracketed: function(txt, i) {
var start = ++i;
var end = txt.indexOf("]", i);
var token = txt.substring(start, end);
$wnd.console.log(token);
if (token) {
// the token may end with + or -
var lastChar = token.charAt(token.length-1);
if (lastChar == "+" || lastChar == "-") {
token = token.substring(0, token.length-1);
}
$wnd.console.log(token);
switch (token) {
case 'H':
case 'He':
case 'Li':
case 'Be':
case 'B':
case 'C':
case 'N':
case 'O':
case 'F':
case 'Ne':
case 'Na':
case 'Mg':
case 'Al':
case 'Si':
case 'P':
case 'S':
case 'Cl':
case 'Ar':
case 'K':
case 'Ca':
case 'Sc':
case 'Ti':
case 'V':
case 'Cr':
case 'Mn':
case 'Fe':
case 'Co':
case 'Ni':
case 'Cu':
case 'Zn':
case 'Ga':
case 'Ge':
case 'As':
case 'Se':
case 'Br':
case 'Kr':
case 'Rb':
case 'Sr':
case 'Y':
case 'Zr':
case 'Nb':
case 'Mo':
case 'Tc':
case 'Ru':
case 'Rh':
case 'Pd':
case 'Ag':
case 'Cd':
case 'In':
case 'Sn':
case 'Sb':
case 'Te':
case 'I':
case 'Xe':
case 'Cs':
case 'Ba':
case 'Hf':
case 'Ta':
case 'W':
case 'Re':
case 'Os':
case 'Ir':
case 'Pt':
case 'Au':
case 'Hg':
case 'Tl':
case 'Pb':
case 'Bi':
case 'Po':
case 'At':
case 'Rn':
case 'Fr':
case 'Ra':
case 'Rf':
case 'Db':
case 'Sg':
case 'Bh':
case 'Hs':
case 'Mt':
case 'Ds':
case 'Rg':
case 'Cn':
case 'Fl':
case 'Lv':
case 'La':
case 'Ce':
case 'Pr':
case 'Nd':
case 'Pm':
case 'Sm':
case 'Eu':
case 'Gd':
case 'Tb':
case 'Dy':
case 'Ho':
case 'Er':
case 'Tm':
case 'Yb':
case 'Lu':
case 'Ac':
case 'Th':
case 'Pa':
case 'U':
case 'Np':
case 'Pu':
case 'Am':
case 'Cm':
case 'Bk':
case 'Cf':
case 'Es':
case 'Fm':
case 'Md':
case 'No':
case 'Lr':
return this.makeXmlNode("_" + token);
default:
if (allowArbitraryAtomNames) {
return this.makeXmlNode(token);
}
return "";
} // end switch
} // end if
return "";
} // end parseBracketed()
} // end beh
]]></MoleculeSystembehavior>
<Snglbehavior implName="org.primordion.xholon.base.Behavior_gwtjs"><![CDATA[
var bond;
var beh = {
postConfigure: function() {
bond = this.cnode.parent();
//bond.println(bond.toString());
var rnum = bond.val();
if (rnum == 0) {
bond.port(0, this.findPreviousAtom(bond.prev()));
bond.port(1, bond.next());
}
else {
var resultNode = this.findMatchingNode(rnum, bond.next());
if (resultNode) {
//bond.println(" resultNode: " + resultNode.toString());
bond.port(0, this.findPreviousAtom(bond.prev()));
bond.port(1, this.findPreviousAtom(resultNode.prev()));
resultNode.remove();
}
}
this.cnode.remove();
},
// find the next node with the same rnum
findMatchingNode: function(rnum, node) {
if (node == null) {return null;}
if (rnum == node.val()) {return node;}
if (node.first()) {
var resultNode = this.findMatchingNode(rnum, node.first());
if (resultNode) {return resultNode;}
}
if (node.next()) {
var resultNode = this.findMatchingNode(rnum, node.next());
if (resultNode) {return resultNode;}
}
},
// find a node's previous sibling that's an atom; skip over bond nodes
findPreviousAtom: function(node) {
while (node != null) {
if (node.xhc().parent() && node.xhc().parent().name() == "Bond") {
node = node.prev();
}
else {
return node;
}
}
return null;
}
}
]]></Snglbehavior>
<SvgClient><Attribute_String roleName="svgUri"><![CDATA[data:image/svg+xml,
<svg width="100" height="50" xmlns="http://www.w3.org/2000/svg">
<g>
<title>SMILES</title>
<rect id="MoleculeSystem" fill="#98FB98" height="50" width="50" x="25" y="0"/>
<g>
<title>Carbon</title>
<rect id="MoleculeSystem/Molecule/C" fill="#6AB06A" height="50" width="10" x="80" y="0"/>
</g>
</g>
</svg>
]]></Attribute_String><Attribute_String roleName="setup">${MODELNAME_DEFAULT},${SVGURI_DEFAULT}</Attribute_String></SvgClient>
</XholonWorkbook>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment