Skip to content

Instantly share code, notes, and snippets.

@jackrusher
Last active August 29, 2015 14:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jackrusher/1d0841416aeaf7464e41 to your computer and use it in GitHub Desktop.
Save jackrusher/1d0841416aeaf7464e41 to your computer and use it in GitHub Desktop.

This drives me crazy, and I see something like it in almost every literary document corpus:

<blockquote>
  <i>
    <q data-eblatype="startmarker" data-eblasegid="38073">[</q>
      Act I, Scene 3
    <q data-eblatype="endmarker" data-eblasegid="38073">]</q>
  </i>
</blockquote>

The intent here is obvious: they're trying to mark parts of the text with an ID, but instead of just using XML — which is already hylomorphic — to wrap a single tag around the intended inclusion, like this:

<blockquote>
  <i data-eblasegid="38073">
    <q>[</q>Act I, Scene 3<q>]</q>
  </i>
</blockquote>

... they insert pairs of start/end tags, which means the consumer of the data must implement a little parser to put the text into buckets by ID. Also, why the square braces at all?

(def othello
(enlive/xml-resource (java.net.URL. "file:/Users/jack/tmp/othello.xml")))
;; select every document, count how many documents are in there
(count (enlive/select othello [:eblacorpus :documents :document]))
;; => 39
;; get the names of the documents
(map #(-> % :attrs :name)
(enlive/select othello [:eblacorpus :documents :document]))
;; => ("" "Baudissin (edited by Bab and Levy)" "Baudissin (edited by Brunner)" "Baudissin (edited by Mommsen)" "Baudissin (edited by Wenig)" "Baudissin (edited by Wolff)" "Benda (1826)" "Bodenstedt" "Boito (translated by Felsenstein and Stueber)" "Bolte and Hamblock" "Buhss" "Bärfuß" "Engel" "Engler" "Eschenburg (edited by Eckert)" "Flatter" "Fried" "Gildemeister" "Gundolf" "Günther" "Karbus" "Laube" "Lauterbach and Gleisberg" "Leonard" "Motschach" "Ortlepp" "Rothe" "Rüdiger" "Schaller" "Schiller and Voss" "Schröder" "Schwarz" "Swaczynna" "Vischer" "Wachsmann" "Wieland" "Zaimoglu and Senkel" "Zeynek" "Zimmer")
;; how many of these labeled chunks are there?
(count (map #(-> % :attrs :data-eblasegid)
(enlive/select othello [(enlive/attr? :data-eblasegid)])))
;; => 11520
;; hm, there occur in start/end pairs, how many distinct ones?
(count (distinct (map #(-> % :attrs :data-eblasegid)
(enlive/select othello [(enlive/attr? :data-eblasegid)]))))
;; => 5760 over the whole corpus
;; in the first document?
(def first-document (first (enlive/select othello [:eblacorpus :documents :document])))
(count (distinct (enlive/select first-document [(enlive/attr? :data-eblasegid)])))
;; => 320
;; just pull out all the text and use the positional segment data to
;; look up chunks
(def text-content
(clojure.string/replace
(apply str (map enlive/text (enlive/select first-document [:doccontent]))) #"[\[\]]" ""))
;; get the english segment definitions
(def english-segments
(map
#(merge (reduce (fn [a [k v]] (assoc a k (read-string v))) {} (% :attrs))
(-> % :content second :content second :attrs))
(enlive/select first-document [:segmentdefinition])))
(take 5 (drop 5 (sort-by :startpos english-segments)))
;; => ({:attribname "type", :attribval "Speech", :id 38077, :startpos 179, :length 76} {:attribname "type", :attribval "Speech", :id 38078, :startpos 269, :length 30} {:attribname "type", :attribval "Speech", :id 38079, :startpos 313, :length 200} {:attribname "type", :attribval "Speech", :id 38561, :startpos 471, :length 42} {:attribname "type", :attribval "Speech", :id 38080, :startpos 527, :length 124})
;; combine segment and text data
(map #(vector (subs text-content (% :startpos) (+ (% :startpos) (% :length)))
(keyword (.toLowerCase (% :attribval))))
(take 15 (filter :attribval (sort-by :startpos english-segments))))
;; => (["Act I, Scene 3" :s.d.] ["A council-chamber." :s.d.] ["The DUKE and Senators sitting at a table; Officers attending" :s.d.] ["There is no composition in these newsThat gives them credit." :speech] ["Indeed, they are disproportioned;My letters say a hundred and seven galleys." :speech] ["And mine, a hundred and forty." :speech] ["And mine, two hundred:But though they jump not on a just account, -As in these cases, where the aim reports,'Tis oft with difference - yet do they all confirmA Turkish fleet, and bearing up to Cyprus." :speech] ["A Turkish fleet, and bearing up to Cyprus." :speech] ["Nay, it is possible enough to judgment:I do not so secure me in the error,But the main article I do approveIn fearful sense." :speech] ["Within " :s.d.] ["What, ho! what, ho! what, ho!" :speech] ["A messenger from the galleys. " :speech] ["Enter a Sailor" :s.d.] ["Now, what's the business?" :speech] ["The Turkish preparation makes for Rhodes;So was I bid report here to the stateBy signior Angelo." :speech])
@stphnthiel
Copy link

the markers are def. not necessary. some other decisions on the format are outlined here. most importantly, segments (the entities mapped by id) can potentially self-overlap, which is in conflict with XML’s specs afaik.

@jackrusher
Copy link
Author

If they need overlap they'd be better off numbering all the words in the corpus and expressing segments as ranges.

@stphnthiel
Copy link

which would be WordHoard’s way of doing this, but sub-word segments need to be possible as well, hence the decision to define segments on a character level within the html/xml doc. it’s tricky

@ftrain
Copy link

ftrain commented May 5, 2014

Some total guesses, just for kicks.

  • The Text Encoding Initiative formats got locked down in the 1990s based on 1980s ideas/hardware/software/programming.
  • A lot of the output is/was from automated super-custom editing tools that, like, used Motif interfaces. This kind of granularity has a weird 90s flavor to me.
  • Prior to, like, now, the typical digital etext use case has typically been "transformation via pipeline into a singular often non-digital entity" like a critical edition, and given the sorts of people who work on critical editions, redundancy is seen as a feature.
  • Thinking back to the early SGML parsers, this sort of thing could maybe allow for you to build less of a stack as you do a parsing run. One-to-one rules per tag--useful if your tooling was some proprietary typesetting framework (FWIW I remember nroff was a target in this world as well). I remember the output from the nsgmls parser, it was all a stream of events and up to you to build a tree. With SGML and the early SAX parsers, it was kind of in your best interest to avoid building a tree, labor-wise. XSL was actually kind of revolutionary giving you the whole tree to transform (I can hear you shaking your head from Berlin).
  • I don't remember anyone ever talking about dynamically exploring historical texts in the way that we work now. Filtering them into databases, yes. Running analysis over them. Exploring them in memory as sexps, no. So it's sort of like, in the 1980s they decided to mate SGML encoding to Unix pipeline-style development somehow assuming the child would be LISP? They got the lists, just not the processing.

@jackrusher
Copy link
Author

To be clear, the character-by-character aspect doesn't bother me at all, especially in cases where the data comes from OCR and one wants to be able to trace the character back to a source with a bounding box in an original raster image. It was the practice of inserting those inline "q" markers and square braces, which only make interacting with the data more difficult. Not terribly difficult, of course, as I've just included an excerpt of my livecoding exploration of this document in clojure to demonstrate.

@kftrans
Copy link

kftrans commented May 12, 2014

So, 'wrap a single tag around the intended inclusion' ... I'd quite agree if the text itself has no tags, but what if you're trying to demarcate a span of text in a document that is, say, HTML, like in your example? Suppose the span to be demarcated starts in the middle of one 'blockquote' element, and ends in the next? Depending on the purpose of the demarcation, you could encode the whole HTML document (&amp, &lt etc.) so its structure is completely opaque (though that definitely wasn't an option for the project you're commenting on) - but what if you need to define spans that overlap? XML won't allow that with a single tag pair for each. You've talked about numbering all the words in the corpus and expressing the data as ranges. That's exactly what the app in question does, except it does so for characters rather than words, and that information is in the corpus export file, as you've found. Internally, everything's manipulated like that, not by adding tags. But sometimes, offline users want to edit the texts, add missed words, correct spellings, etc., without losing the valuable segmentation and alignment information that's been accrued. How to enable them to do that? By providing a document with visible segment 'markers', so they can make whatever changes they want in an XML or even WYSIWYG HTML editor, then upload the revised document for the app to parse, so it can update the segment position info as well as the document content. Of course, if someone got in touch and asked us (nicely) for an export without those markers, we'd be happy to provide one, though as you say, they're really not that much of a problem ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment