Skip to content

Instantly share code, notes, and snippets.

@captn3m0
Created August 11, 2014 17:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save captn3m0/97a672e5dbc69e7d2015 to your computer and use it in GitHub Desktop.
Save captn3m0/97a672e5dbc69e7d2015 to your computer and use it in GitHub Desktop.
ENML Scrubber in ruby using Loofah

The problem is converting a valid html document to a valid ENML (Evernote Markup Language) schema. Validation is performed, but performing validation over and over is not a good solution. The idea is to transform the doc by scrubbing out invalid tags and attributes, which come under the following categories:

  1. allowed_tags : These are the only valid tags in ENML, and have a set of valid attributes for each
  2. disallowed_tags: Remove these tags completely from the doc

If a tag other than these is found, strip it (replace it with its children). To "know which tags to remove and which to keep", I have a dtd file (enml2.dtd) and want to parse it.

<!--
Evernote Markup Language (ENML) 2.0 DTD
This expresses the structure of an XML document that can be used as the
'content' of a Note within Evernote's data model.
The Evernote service will reject attempts to create or update notes if
their contents do not validate against this DTD.
This is based on a subset of XHTML which is intentionally broadened to
reject less real-world HTML, to reduce the likelihood of synchronization
failures. This means that all attributes are defined as CDATA instead of
more-restrictive types, and every HTML element may embed every other
HTML element.
Copyright (c) 2007-2009 Evernote Corp.
$Date: 2007/10/15 18:00:00 $
-->
<!--=================== Generic Attributes ===============================-->
<!ENTITY % coreattrs
"style CDATA #IMPLIED
title CDATA #IMPLIED"
>
<!ENTITY % i18n
"lang CDATA #IMPLIED
xml:lang CDATA #IMPLIED
dir CDATA #IMPLIED"
>
<!ENTITY % focus
"accesskey CDATA #IMPLIED
tabindex CDATA #IMPLIED"
>
<!ENTITY % attrs
"%coreattrs;
%i18n;"
>
<!ENTITY % TextAlign
"align CDATA #IMPLIED"
>
<!ENTITY % cellhalign
"align CDATA #IMPLIED
char CDATA #IMPLIED
charoff CDATA #IMPLIED"
>
<!ENTITY % cellvalign
"valign CDATA #IMPLIED"
>
<!ENTITY % AnyContent
"( #PCDATA |
a |
abbr |
acronym |
address |
area |
b |
bdo |
big |
blockquote |
br |
caption |
center |
cite |
code |
col |
colgroup |
dd |
del |
dfn |
div |
dl |
dt |
em |
en-crypt |
en-media |
en-todo |
font |
h1 |
h2 |
h3 |
h4 |
h5 |
h6 |
hr |
i |
img |
ins |
kbd |
li |
map |
ol |
p |
pre |
q |
s |
samp |
small |
span |
strike |
strong |
sub |
sup |
table |
tbody |
td |
tfoot |
th |
thead |
tr |
tt |
u |
ul |
var )*"
>
<!--=========== Evernote-specific Elements and Attributes ===============-->
<!ELEMENT en-note %AnyContent;>
<!ATTLIST en-note
%attrs;
bgcolor CDATA #IMPLIED
text CDATA #IMPLIED
xmlns CDATA #FIXED 'http://xml.evernote.com/pub/enml2.dtd'
>
<!ELEMENT en-crypt (#PCDATA)>
<!ATTLIST en-crypt
hint CDATA #IMPLIED
cipher CDATA "RC2"
length CDATA "64"
>
<!ELEMENT en-todo EMPTY>
<!ATTLIST en-todo
checked (true|false) "false"
>
<!ELEMENT en-media EMPTY>
<!ATTLIST en-media
%attrs;
type CDATA #REQUIRED
hash CDATA #REQUIRED
height CDATA #IMPLIED
width CDATA #IMPLIED
usemap CDATA #IMPLIED
align CDATA #IMPLIED
border CDATA #IMPLIED
hspace CDATA #IMPLIED
vspace CDATA #IMPLIED
longdesc CDATA #IMPLIED
alt CDATA #IMPLIED
>
<!--=========== Simplified HTML Elements and Attributes ===============-->
<!ELEMENT a %AnyContent;>
<!ATTLIST a
%attrs;
%focus;
charset CDATA #IMPLIED
type CDATA #IMPLIED
name CDATA #IMPLIED
href CDATA #IMPLIED
hreflang CDATA #IMPLIED
rel CDATA #IMPLIED
rev CDATA #IMPLIED
shape CDATA #IMPLIED
coords CDATA #IMPLIED
target CDATA #IMPLIED
>
<!ELEMENT abbr %AnyContent;>
<!ATTLIST abbr
%attrs;
>
<!ELEMENT acronym %AnyContent;>
<!ATTLIST acronym
%attrs;
>
<!ELEMENT address %AnyContent;>
<!ATTLIST address
%attrs;
>
<!ELEMENT area %AnyContent;>
<!ATTLIST area
%attrs;
%focus;
shape CDATA #IMPLIED
coords CDATA #IMPLIED
href CDATA #IMPLIED
nohref CDATA #IMPLIED
alt CDATA #IMPLIED
target CDATA #IMPLIED
>
<!ELEMENT b %AnyContent;>
<!ATTLIST b
%attrs;
>
<!ELEMENT bdo %AnyContent;>
<!ATTLIST bdo
%coreattrs;
lang CDATA #IMPLIED
xml:lang CDATA #IMPLIED
dir CDATA #IMPLIED
>
<!ELEMENT big %AnyContent;>
<!ATTLIST big
%attrs;
>
<!ELEMENT blockquote %AnyContent;>
<!ATTLIST blockquote
%attrs;
cite CDATA #IMPLIED
>
<!ELEMENT br %AnyContent;>
<!ATTLIST br
%coreattrs;
clear CDATA #IMPLIED
>
<!ELEMENT caption %AnyContent;>
<!ATTLIST caption
%attrs;
align CDATA #IMPLIED
>
<!ELEMENT center %AnyContent;>
<!ATTLIST center
%attrs;
>
<!ELEMENT cite %AnyContent;>
<!ATTLIST cite
%attrs;
>
<!ELEMENT code %AnyContent;>
<!ATTLIST code
%attrs;
>
<!ELEMENT col %AnyContent;>
<!ATTLIST col
%attrs;
%cellhalign;
%cellvalign;
span CDATA #IMPLIED
width CDATA #IMPLIED
>
<!ELEMENT colgroup %AnyContent;>
<!ATTLIST colgroup
%attrs;
%cellhalign;
%cellvalign;
span CDATA #IMPLIED
width CDATA #IMPLIED
>
<!ELEMENT dd %AnyContent;>
<!ATTLIST dd
%attrs;
>
<!ELEMENT del %AnyContent;>
<!ATTLIST del
%attrs;
cite CDATA #IMPLIED
datetime CDATA #IMPLIED
>
<!ELEMENT dfn %AnyContent;>
<!ATTLIST dfn
%attrs;
>
<!ELEMENT div %AnyContent;>
<!ATTLIST div
%attrs;
%TextAlign;
>
<!ELEMENT dl %AnyContent;>
<!ATTLIST dl
%attrs;
compact CDATA #IMPLIED
>
<!ELEMENT dt %AnyContent;>
<!ATTLIST dt
%attrs;
>
<!ELEMENT em %AnyContent;>
<!ATTLIST em
%attrs;
>
<!ELEMENT font %AnyContent;>
<!ATTLIST font
%coreattrs;
%i18n;
size CDATA #IMPLIED
color CDATA #IMPLIED
face CDATA #IMPLIED
>
<!ELEMENT h1 %AnyContent;>
<!ATTLIST h1
%attrs;
%TextAlign;
>
<!ELEMENT h2 %AnyContent;>
<!ATTLIST h2
%attrs;
%TextAlign;
>
<!ELEMENT h3 %AnyContent;>
<!ATTLIST h3
%attrs;
%TextAlign;
>
<!ELEMENT h4 %AnyContent;>
<!ATTLIST h4
%attrs;
%TextAlign;
>
<!ELEMENT h5 %AnyContent;>
<!ATTLIST h5
%attrs;
%TextAlign;
>
<!ELEMENT h6 %AnyContent;>
<!ATTLIST h6
%attrs;
%TextAlign;
>
<!ELEMENT hr %AnyContent;>
<!ATTLIST hr
%attrs;
align CDATA #IMPLIED
noshade CDATA #IMPLIED
size CDATA #IMPLIED
width CDATA #IMPLIED
>
<!ELEMENT i %AnyContent;>
<!ATTLIST i
%attrs;
>
<!ELEMENT img %AnyContent;>
<!ATTLIST img
%attrs;
src CDATA #IMPLIED
alt CDATA #IMPLIED
name CDATA #IMPLIED
longdesc CDATA #IMPLIED
height CDATA #IMPLIED
width CDATA #IMPLIED
usemap CDATA #IMPLIED
ismap CDATA #IMPLIED
align CDATA #IMPLIED
border CDATA #IMPLIED
hspace CDATA #IMPLIED
vspace CDATA #IMPLIED
>
<!ELEMENT ins %AnyContent;>
<!ATTLIST ins
%attrs;
cite CDATA #IMPLIED
datetime CDATA #IMPLIED
>
<!ELEMENT kbd %AnyContent;>
<!ATTLIST kbd
%attrs;
>
<!ELEMENT li %AnyContent;>
<!ATTLIST li
%attrs;
type CDATA #IMPLIED
value CDATA #IMPLIED
>
<!ELEMENT map %AnyContent;>
<!ATTLIST map
%i18n;
title CDATA #IMPLIED
name CDATA #IMPLIED
>
<!ELEMENT ol %AnyContent;>
<!ATTLIST ol
%attrs;
type CDATA #IMPLIED
compact CDATA #IMPLIED
start CDATA #IMPLIED
>
<!ELEMENT p %AnyContent;>
<!ATTLIST p
%attrs;
%TextAlign;
>
<!ELEMENT pre %AnyContent;>
<!ATTLIST pre
%attrs;
width CDATA #IMPLIED
xml:space (preserve) #FIXED 'preserve'
>
<!ELEMENT q %AnyContent;>
<!ATTLIST q
%attrs;
cite CDATA #IMPLIED
>
<!ELEMENT s %AnyContent;>
<!ATTLIST s
%attrs;
>
<!ELEMENT samp %AnyContent;>
<!ATTLIST samp
%attrs;
>
<!ELEMENT small %AnyContent;>
<!ATTLIST small
%attrs;
>
<!ELEMENT span %AnyContent;>
<!ATTLIST span
%attrs;
>
<!ELEMENT strike %AnyContent;>
<!ATTLIST strike
%attrs;
>
<!ELEMENT strong %AnyContent;>
<!ATTLIST strong
%attrs;
>
<!ELEMENT sub %AnyContent;>
<!ATTLIST sub
%attrs;
>
<!ELEMENT sup %AnyContent;>
<!ATTLIST sup
%attrs;
>
<!ELEMENT table %AnyContent;>
<!ATTLIST table
%attrs;
summary CDATA #IMPLIED
width CDATA #IMPLIED
border CDATA #IMPLIED
cellspacing CDATA #IMPLIED
cellpadding CDATA #IMPLIED
align CDATA #IMPLIED
bgcolor CDATA #IMPLIED
>
<!ELEMENT tbody %AnyContent;>
<!ATTLIST tbody
%attrs;
%cellhalign;
%cellvalign;
>
<!ELEMENT td %AnyContent;>
<!ATTLIST td
%attrs;
%cellhalign;
%cellvalign;
abbr CDATA #IMPLIED
rowspan CDATA #IMPLIED
colspan CDATA #IMPLIED
nowrap CDATA #IMPLIED
bgcolor CDATA #IMPLIED
width CDATA #IMPLIED
height CDATA #IMPLIED
>
<!ELEMENT tfoot %AnyContent;>
<!ATTLIST tfoot
%attrs;
%cellhalign;
%cellvalign;
>
<!ELEMENT th %AnyContent;>
<!ATTLIST th
%attrs;
%cellhalign;
%cellvalign;
abbr CDATA #IMPLIED
rowspan CDATA #IMPLIED
colspan CDATA #IMPLIED
nowrap CDATA #IMPLIED
bgcolor CDATA #IMPLIED
width CDATA #IMPLIED
height CDATA #IMPLIED
>
<!ELEMENT thead %AnyContent;>
<!ATTLIST thead
%attrs;
%cellhalign;
%cellvalign;
>
<!ELEMENT tr %AnyContent;>
<!ATTLIST tr
%attrs;
%cellhalign;
%cellvalign;
bgcolor CDATA #IMPLIED
>
<!ELEMENT tt %AnyContent;>
<!ATTLIST tt
%attrs;
>
<!ELEMENT u %AnyContent;>
<!ATTLIST u
%attrs;
>
<!ELEMENT ul %AnyContent;>
<!ATTLIST ul
%attrs;
type CDATA #IMPLIED
compact CDATA #IMPLIED
>
<!ELEMENT var %AnyContent;>
<!ATTLIST var
%attrs;
>
allowed_tags:
en-note:
- style
- title
- lang
- "xml:lang"
- dir
- bgcolor
- text
a:
- style
- title
- lang
- "xml:lang"
- dir
- accesskey
- tabindex
- charset
- type
- name
- href
- hreflang
- rel
- rev
- shape
- coords
- target
abbr:
- style
- title
- lang
- "xml:lang"
- dir
acronym:
- style
- title
- lang
- "xml:lang"
- dir
address:
- style
- title
- lang
- "xml:lang"
- dir
area:
- style
- title
- lang
- "xml:lang"
- dir
- accesskey
- tabindex
- shape
- coords
- href
- nohref
- alt
- target
b:
- style
- title
- lang
- "xml:lang"
- dir
bdo:
- style
- title
- lang
- "xml:lang"
- dir
big:
- style
- title
- lang
- "xml:lang"
- dir
blockquote:
- style
- title
- lang
- "xml:lang"
- dir
- cite
br:
- style
- title
- clear
caption:
- style
- title
- lang
- "xml:lang"
- dir
- align
center:
- style
- title
- lang
- "xml:lang"
- dir
cite:
- style
- title
- lang
- "xml:lang"
- dir
code:
- style
- title
- lang
- "xml:lang"
- dir
col:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
- span
- width
colgroup:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
- span
- width
dd:
- style
- title
- lang
- "xml:lang"
- dir
del:
- style
- title
- lang
- "xml:lang"
- dir
- cite
- datetime
dfn:
- style
- title
- lang
- "xml:lang"
- dir
div:
- style
- title
- lang
- "xml:lang"
- dir
- align
dl:
- style
- title
- lang
- "xml:lang"
- dir
- compact
dt:
- style
- title
- lang
- "xml:lang"
- dir
em:
- style
- title
- lang
- "xml:lang"
- dir
en-crypt:
- hint
- cipher
- length
en-media:
- style
- title
- lang
- "xml:lang"
- dir
- type
- hash
- height
- width
- usemap
- align
- border
- hspace
- vspace
- longdesc
- alt
en-todo:
- checked
font:
- style
- title
- lang
- "xml:lang"
- dir
- size
- color
- face
h1:
- style
- title
- lang
- "xml:lang"
- dir
- align
h2:
- style
- title
- lang
- "xml:lang"
- dir
- align
h3:
- style
- title
- lang
- "xml:lang"
- dir
- align
h4:
- style
- title
- lang
- "xml:lang"
- dir
- align
h5:
- style
- title
- lang
- "xml:lang"
- dir
- align
h6:
- style
- title
- lang
- "xml:lang"
- dir
- align
hr:
- style
- title
- lang
- "xml:lang"
- dir
- align
- noshade
- size
- width
i:
- style
- title
- lang
- "xml:lang"
- dir
img:
- style
- title
- lang
- "xml:lang"
- dir
- src
- alt
- name
- longdesc
- height
- width
- usemap
- ismap
- align
- border
- hspace
- vspace
ins:
- style
- title
- lang
- "xml:lang"
- dir
- cite
- datetime
kbd:
- style
- title
- lang
- "xml:lang"
- dir
li:
- style
- title
- lang
- "xml:lang"
- dir
- type
- value
map:
- lang
- "xml:lang"
- dir
- title
- name
ol:
- style
- title
- lang
- "xml:lang"
- dir
- type
- compact
- start
p:
- style
- title
- lang
- "xml:lang"
- dir
- align
pre:
- style
- title
- lang
- "xml:lang"
- dir
- width
q:
- style
- title
- lang
- "xml:lang"
- dir
- cite
s:
- style
- title
- lang
- "xml:lang"
- dir
samp:
- style
- title
- lang
- "xml:lang"
- dir
small:
- style
- title
- lang
- "xml:lang"
- dir
span:
- style
- title
- lang
- "xml:lang"
- dir
strike:
- style
- title
- lang
- "xml:lang"
- dir
strong:
- style
- title
- lang
- "xml:lang"
- dir
sub:
- style
- title
- lang
- "xml:lang"
- dir
sup:
- style
- title
- lang
- "xml:lang"
- dir
table:
- style
- title
- lang
- "xml:lang"
- dir
- summary
- width
- border
- cellspacing
- cellpadding
- align
- bgcolor
tbody:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
td:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
- abbr
- rowspan
- colspan
- nowrap
- bgcolor
- width
- height
tfoot:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
th:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
- abbr
- rowspan
- colspan
- nowrap
- bgcolor
- width
- height
thead:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
tr:
- style
- title
- lang
- "xml:lang"
- dir
- align
- char
- charoff
- valign
- bgcolor
tt:
- style
- title
- lang
- "xml:lang"
- dir
u:
- style
- title
- lang
- "xml:lang"
- dir
ul:
- style
- title
- lang
- "xml:lang"
- dir
- type
- compact
var:
- style
- title
- lang
- "xml:lang"
- dir
text:
disallowed_tags:
- applet
- base
- basefont
- bgsound
- blink
- body
- button
- dir
- embed
- fieldset
- form
- frame
- frameset
- head
- html
- iframe
- ilayer
- input
- isindex
- label
- layer
- legend
- link
- marquee
- menu
- meta
- noframes
- noscript
- object
- optgroup
- option
- param
- plaintext
- script
- select
- style
- textarea
- xml
valid_void_tags:
- area
- br
- col
- hr
- img
class EvernoteScrubber < Loofah::Scrubber
@@dtd = YAML.load_file Rails.root.join('lib','assets','enml2.yml')
def initialize
@direction = :bottom_up
end
def scrub(node)
# Remove all disallowed tags
if @@dtd['disallowed_tags'].include? node.name
node.remove
Loofah::Scrubber::STOP # don't bother with the rest of the subtree
end
# This is lifted from the strip scrubber in
# Loofah. It replaces the node with its contents
unless @@dtd['allowed_tags'].keys.include? node.name
node.before node.children
node.remove
stripped = true
end
# Remove all disallowed attributes
unless stripped
valid_attributes = @@dtd['allowed_tags'][node.name]
node.attribute_nodes.each do |attr_node|
# Get the attribute name if we have a namespace
attr_name = if attr_node.namespace
"#{attr_node.namespace.prefix}:#{attr_node.node_name}"
else
attr_node.node_name
end
unless valid_attributes.include? attr_name
attr_node.remove
end
end
end
# Remove any empty tags
# This basically makes sure that a tag is considered empty iff
# 1. Its inner text is blank
# 2. It has no children
# 3. Its not a valid void tag (like br,hr)
if node.inner_text == '' and @@dtd['valid_void_tags'].exclude? node.name and node.children.size == 0
node.remove
Loofah::Scrubber::STOP
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment