Skip to content

Instantly share code, notes, and snippets.

@bocajnotnef
Last active March 28, 2024 08:17
Show Gist options
  • Save bocajnotnef/f3f43acc065a2a1a4dd433b8eace3f2b to your computer and use it in GitHub Desktop.
Save bocajnotnef/f3f43acc065a2a1a4dd433b8eace3f2b to your computer and use it in GitHub Desktop.
scripts for converting the 1993 annotated RTF files of "A Fire Upon The Deep" to HTML

source material

You'll need the download of the Hugo 1993 files, available from here. If that link dies, you can scour the original HackerNews post to see if anyone has a mirror, or if there's other helpful files

deps

mac

  • libreoffice installed to the Applications/ dir
  • xmllint I think comes for free?
  • coreutils from brew (e.g. brew install coreutils)
  • python3
    • bs4 from pip (e.g. pip3 install bs4)

linux

  • modify the LIBRE_OFFICE_APPLICATION_PATH in pipeline.sh to point to wherever your soffice binary is
  • modify the ghead call in pipeline.sh to plain head
  • I've no idea if xmllint is available on normal distros; it's not strictly necessary, you could modify my beautifulsoup stuff to do what I use xmllint for, I'm just lazy and don't feel like doing that myself
  • python3 and bs4, as above

how do

  • download and unzip the hugo stuff
  • navigate to hugo-nebula anthology 1993/hugo/novel/vinge
  • download pipeline.sh and process_html.py to that directory
    • wget whatever-the-raw-link-of-this-gist-is
  • make 'em executable
    • chmod +x pipeline.sh && chmod +x process_html.py
  • run pipeline.sh
  • open a_fire_upon_the_deep_annotated.html

notes

  • I personally recommend modifying the css file that's downloaded just a teensey bit. Make the body section look like this:
body {
    width: 87.5%;
    /* margin-left: 12.5%; */
    /* margin-right: auto; */
    padding-left: 12.5%;
    /* padding-right: 12.5%; */
    font-family: et-book, Palatino, "Palatino Linotype", "Palatino LT STD", "Book Antiqua", Georgia, serif;
    background-color: #fffff8;
    color: #111;
    max-width: 50%;
    counter-reset: sidenote-counter;
}

and make the .sidenote, .marginnote section look like this:

.sidenote,
.marginnote {
    float: right;
    clear: right;
    margin-right: -60%;
    width: 50%;
    margin-top: 0.3rem;
    margin-bottom: 0;
    padding: 1rem;
    font-size: 1.1rem;
    line-height: 1.0;
    vertical-align: baseline;
    position: relative;
    background: lightgray;
}
# this probably should have been a makefile. oh well.
# consts
STYLESHEET_LOCAL_PATH="tufte.css"
STYLESHEET_DOWNLOAD_PATH="https://raw.githubusercontent.com/edwardtufte/tufte-css/gh-pages/tufte.css"
# used to automate conversion from RTF to HTML
LIBRE_OFFICE_APPLICATION_PATH="/Applications/LibreOffice.app/Contents/MacOS/soffice"
# will delete and remake the RTF conversion directory
function convertRtfFilesToHtml {
rm -rf ./rtf_to_html
# convert RTF files to HTML
mkdir -p ./rtf_to_html
$LIBRE_OFFICE_APPLICATION_PATH --convert-to html --outdir ./rtf_to_html *.rtf
}
# if CSS stylesheet doesn't exist, get it
if test -f "${STYLESHEET_LOCAL_PATH}"; then
echo "CSS stylesheet already downloaded."
else
echo "Downloading stylesheet to ${STYLESHEET_LOCAL_PATH}"
wget "${STYLESHEET_DOWNLOAD_PATH}" -O "${STYLESHEET_LOCAL_PATH}"
fi
if [ ! -d "./rtf_to_html" ]; then
echo "RTF conversion directory does not exist. Will convert RTF files."
echo "This may take a minute."
convertRtfFilesToHtml
fi
echo "Validating RTF conversion (checking all files exist)"
for chapter_num in `seq -w 0 42`; do
if ! test -f "./rtf_to_html/c${chapter_num}b.html"; then
echo "Could not find chapter 'c${chapter_num}b.html'. Aborting check, reconverting."
convertRtfFilesToHtml
fi
done
echo "Stripping everything but the body **contents** in each file."
for chapter_num in `seq -w 0 42`; do
## COMPATABILITY NOTES:
# If on a mac, `brew install coreutils` to get ghead
# if on something sane, just edit ghead to be head
# I have no idea where xmllint comes from. You could probably hack around all this with beautifulsoup, but I was knee deep in this and lazy, so I stuck with what I had
echo "cat //html/body" | xmllint --html --shell ./rtf_to_html/c${chapter_num}b.html | sed '/^\/ >/d' | tail -n +2 | ghead -n -1 > ./rtf_to_html/working.txt
mv ./rtf_to_html/working.txt ./rtf_to_html/c${chapter_num}b.html
done
echo "Reformatting HTML to something sane..."
rm -rf assembly
mkdir -p ./assembly
echo '<html><head><link rel="stylesheet" href="tufte.css"/></head><body>' > a_fire_upon_the_deep_annotated.html
for chapter_num in `seq -w 0 42`; do
./process_html.py --prefix "${chapter_num}" --inputFile "./rtf_to_html/c${chapter_num}b.html" --outputFile "./assembly/c${chapter_num}.html"
cat "./assembly/c${chapter_num}.html" >> a_fire_upon_the_deep_annotated.html
done
echo "</body></html>" >> a_fire_upon_the_deep_annotated.html
#!/usr/bin/env python3
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--inputFile", required=True)
parser.add_argument("--outputFile", required=True)
parser.add_argument("--prefix", required=True, help="prefix to add to footnote IDs to avoid collisions")
args = parser.parse_args()
page_flush_header = BeautifulSoup('<p class="P1"><span class="T1">.</span><span class="T2">Delete this paragraph to shift page flush </span></p>', "html.parser")
cleaned_body_text = []
sidenotes = dict()
curr_sidenote = None
# clean up junk & parse out footnotes
with open(args.inputFile, "r") as the_file:
soup = BeautifulSoup(the_file.read(), "html.parser")
for child in soup.children:
if type(child) is NavigableString:
# newlien from file
continue
elif child.get_text() == page_flush_header.get_text():
# drop the thing
continue
elif "Footnote" in child["class"]:
# drop
continue
elif "Standard" in child["class"]:
if "footnodeNumber" in list(child.children)[0]["class"]:
# this is the start of a footnote
anchor_id = f"{list(list(child.children)[0].children)[0]['id']}"
new_sidenote_tag = soup.new_tag("span")
new_sidenote_tag["class"] = "sidenote"
s = ""
for sub_child in list(child.children)[1:]:
s += sub_child.get_text()
new_sidenote_tag.append(s)
new_sidenote_tag.append(soup.new_tag("br"))
sidenotes[anchor_id] = new_sidenote_tag
curr_sidenote = new_sidenote_tag
else:
s = ""
for sub_child in child.children:
s += sub_child.get_text()
curr_sidenote.append(s)
new_sidenote_tag.append(soup.new_tag("br"))
else:
cleaned_body_text.append(child)
# iterate over cleaned stuff, filtering out anchors & replacing with sidenote spans instead
new_soup = BeautifulSoup('', "html.parser")
for element in cleaned_body_text:
new_soup.append(element)
for span in new_soup.findAll('span'):
if span["class"] == []:
span.replaceWith(list(span.children)[0])
for anchor in new_soup.findAll('a'):
# new span to contain goodies
new_span = new_soup.new_tag('span')
new_label = new_soup.new_tag("label")
new_label["for"] = f"{args.prefix}{anchor['href'][1:]}"
new_label["class"] = "margin-toggle sidenote-number"
new_input = new_soup.new_tag("input")
new_input["type"] = "checkbox"
new_input["id"] = f"{args.prefix}{anchor['href'][1:]}"
new_input["class"] = "margin-toggle"
new_span.append(new_label)
new_span.append(new_input)
new_span.append(sidenotes[anchor["href"][1:]])
anchor.parent.replaceWith(new_span)
for div in new_soup.findAll('div'):
# TODO: more directly translate the formatting here, rather than just stripping it
# div P4s (and P7s?) seem to be section breaks--consider replacing with hrules
# p P6s may be chapter breaks (chapter headers don't seem to be formatted at all)
# div (P8s, P9s, P12s) and p (P8s, P9s, P12s) look to be message blocks, but end with some P3s of normal text (and P3 is shared by normal book text too)
# (looks like they contain span T7s that are unique, though--possible to key off those? use one font for message header and another for message body?)
div_classes_to_strip = ["P2", "P3", "P5", "P8", "P9", "P12"]
divs_that_may_be_hrules = ["P4", "P7"]
if any([x in div["class"] for x in div_classes_to_strip]):
div.name = "p"
elif any([x in div["class"] for x in divs_that_may_be_hrules]):
div.append(new_soup.new_tag("hr"))
with open(args.outputFile, "w") as ofile:
ofile.write(str(new_soup))
@charset "UTF-8";
/* Import ET Book styles
adapted from https://github.com/edwardtufte/et-book/blob/gh-pages/et-book.css */
@font-face {
font-family: "et-book";
src: url("et-book/et-book-roman-line-figures/et-book-roman-line-figures.eot");
src: url("et-book/et-book-roman-line-figures/et-book-roman-line-figures.eot?#iefix") format("embedded-opentype"), url("et-book/et-book-roman-line-figures/et-book-roman-line-figures.woff") format("woff"), url("et-book/et-book-roman-line-figures/et-book-roman-line-figures.ttf") format("truetype"), url("et-book/et-book-roman-line-figures/et-book-roman-line-figures.svg#etbookromanosf") format("svg");
font-weight: normal;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: "et-book";
src: url("et-book/et-book-display-italic-old-style-figures/et-book-display-italic-old-style-figures.eot");
src: url("et-book/et-book-display-italic-old-style-figures/et-book-display-italic-old-style-figures.eot?#iefix") format("embedded-opentype"), url("et-book/et-book-display-italic-old-style-figures/et-book-display-italic-old-style-figures.woff") format("woff"), url("et-book/et-book-display-italic-old-style-figures/et-book-display-italic-old-style-figures.ttf") format("truetype"), url("et-book/et-book-display-italic-old-style-figures/et-book-display-italic-old-style-figures.svg#etbookromanosf") format("svg");
font-weight: normal;
font-style: italic;
font-display: swap;
}
@font-face {
font-family: "et-book";
src: url("et-book/et-book-bold-line-figures/et-book-bold-line-figures.eot");
src: url("et-book/et-book-bold-line-figures/et-book-bold-line-figures.eot?#iefix") format("embedded-opentype"), url("et-book/et-book-bold-line-figures/et-book-bold-line-figures.woff") format("woff"), url("et-book/et-book-bold-line-figures/et-book-bold-line-figures.ttf") format("truetype"), url("et-book/et-book-bold-line-figures/et-book-bold-line-figures.svg#etbookromanosf") format("svg");
font-weight: bold;
font-style: normal;
font-display: swap;
}
@font-face {
font-family: "et-book-roman-old-style";
src: url("et-book/et-book-roman-old-style-figures/et-book-roman-old-style-figures.eot");
src: url("et-book/et-book-roman-old-style-figures/et-book-roman-old-style-figures.eot?#iefix") format("embedded-opentype"), url("et-book/et-book-roman-old-style-figures/et-book-roman-old-style-figures.woff") format("woff"), url("et-book/et-book-roman-old-style-figures/et-book-roman-old-style-figures.ttf") format("truetype"), url("et-book/et-book-roman-old-style-figures/et-book-roman-old-style-figures.svg#etbookromanosf") format("svg");
font-weight: normal;
font-style: normal;
font-display: swap;
}
/* Styles specific to the libreoffice conversion */
.T7 {
font-family: monospace;
}
/* Tufte CSS styles */
html {
font-size: 15px;
}
body {
width: 87.5%;
/* margin-left: 12.5%; */
/* margin-right: auto; */
padding-left: 12.5%;
/* padding-right: 12.5%; */
font-family: et-book, Palatino, "Palatino Linotype", "Palatino LT STD", "Book Antiqua", Georgia, serif;
background-color: #fffff8;
color: #111;
max-width: 50%;
counter-reset: sidenote-counter;
}
h1 {
font-weight: 400;
margin-top: 4rem;
margin-bottom: 1.5rem;
font-size: 3.2rem;
line-height: 1;
}
h2 {
font-style: italic;
font-weight: 400;
margin-top: 2.1rem;
margin-bottom: 1.4rem;
font-size: 2.2rem;
line-height: 1;
}
h3 {
font-style: italic;
font-weight: 400;
font-size: 1.7rem;
margin-top: 2rem;
margin-bottom: 1.4rem;
line-height: 1;
}
hr {
display: block;
height: 1px;
width: 55%;
border: 0;
border-top: 1px solid #ccc;
margin: 1em 0;
padding: 0;
}
p.subtitle {
font-style: italic;
margin-top: 1rem;
margin-bottom: 1rem;
font-size: 1.8rem;
display: block;
line-height: 1;
}
.numeral {
font-family: et-book-roman-old-style;
}
.danger {
color: red;
}
article {
padding: 5rem 0rem;
}
section {
padding-top: 1rem;
padding-bottom: 1rem;
}
p,
dl,
ol,
ul {
font-size: 1.4rem;
line-height: 2rem;
}
p {
margin-top: 1.4rem;
margin-bottom: 1.4rem;
padding-right: 0;
vertical-align: baseline;
}
/* Chapter Epigraphs */
div.epigraph {
margin: 5em 0;
}
div.epigraph > blockquote {
margin-top: 3em;
margin-bottom: 3em;
}
div.epigraph > blockquote,
div.epigraph > blockquote > p {
font-style: italic;
}
div.epigraph > blockquote > footer {
font-style: normal;
}
div.epigraph > blockquote > footer > cite {
font-style: italic;
}
/* end chapter epigraphs styles */
blockquote {
font-size: 1.4rem;
}
blockquote p {
width: 55%;
margin-right: 40px;
}
blockquote footer {
width: 55%;
font-size: 1.1rem;
text-align: right;
}
section > p,
section > footer,
section > table {
width: 55%;
}
/* 50 + 5 == 55, to be the same width as paragraph */
section > dl,
section > ol,
section > ul {
width: 50%;
-webkit-padding-start: 5%;
}
dt:not(:first-child),
li:not(:first-child) {
margin-top: 0.25rem;
}
figure {
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
max-width: 55%;
-webkit-margin-start: 0;
-webkit-margin-end: 0;
margin: 0 0 3em 0;
}
figcaption {
float: right;
clear: right;
margin-top: 0;
margin-bottom: 0;
font-size: 1.1rem;
line-height: 1.6;
vertical-align: baseline;
position: relative;
max-width: 40%;
}
figure.fullwidth figcaption {
margin-right: 24%;
}
/* Links: replicate underline that clears descenders */
a:link,
a:visited {
color: inherit;
}
.no-tufte-underline:link {
background: unset;
text-shadow: unset;
}
a:link, .tufte-underline, .hover-tufte-underline:hover {
text-decoration: none;
background: -webkit-linear-gradient(#fffff8, #fffff8), -webkit-linear-gradient(#fffff8, #fffff8), -webkit-linear-gradient(currentColor, currentColor);
background: linear-gradient(#fffff8, #fffff8), linear-gradient(#fffff8, #fffff8), linear-gradient(currentColor, currentColor);
-webkit-background-size: 0.05em 1px, 0.05em 1px, 1px 1px;
-moz-background-size: 0.05em 1px, 0.05em 1px, 1px 1px;
background-size: 0.05em 1px, 0.05em 1px, 1px 1px;
background-repeat: no-repeat, no-repeat, repeat-x;
text-shadow: 0.03em 0 #fffff8, -0.03em 0 #fffff8, 0 0.03em #fffff8, 0 -0.03em #fffff8, 0.06em 0 #fffff8, -0.06em 0 #fffff8, 0.09em 0 #fffff8, -0.09em 0 #fffff8, 0.12em 0 #fffff8, -0.12em 0 #fffff8, 0.15em 0 #fffff8, -0.15em 0 #fffff8;
background-position: 0% 93%, 100% 93%, 0% 93%;
}
@media screen and (-webkit-min-device-pixel-ratio: 0) {
a:link, .tufte-underline, .hover-tufte-underline:hover {
background-position-y: 87%, 87%, 87%;
}
}
a:link::selection,
a:link::-moz-selection {
text-shadow: 0.03em 0 #b4d5fe, -0.03em 0 #b4d5fe, 0 0.03em #b4d5fe, 0 -0.03em #b4d5fe, 0.06em 0 #b4d5fe, -0.06em 0 #b4d5fe, 0.09em 0 #b4d5fe, -0.09em 0 #b4d5fe, 0.12em 0 #b4d5fe, -0.12em 0 #b4d5fe, 0.15em 0 #b4d5fe, -0.15em 0 #b4d5fe;
background: #b4d5fe;
}
/* Sidenotes, margin notes, figures, captions */
img {
max-width: 100%;
}
.sidenote,
.marginnote {
float: right;
clear: right;
margin-right: -60%;
width: 50%;
margin-top: 0.3rem;
margin-bottom: 0;
padding: 1rem;
font-size: 1.1rem;
line-height: 1.0;
vertical-align: baseline;
position: relative;
background: lightgray;
}
.sidenote-number {
counter-increment: sidenote-counter;
}
.sidenote-number:after,
.sidenote:before {
font-family: et-book-roman-old-style;
position: relative;
vertical-align: baseline;
}
.sidenote-number:after {
content: counter(sidenote-counter);
font-size: 1rem;
top: -0.5rem;
left: 0.1rem;
}
.sidenote:before {
content: counter(sidenote-counter) " ";
font-size: 1rem;
top: -0.5rem;
}
blockquote .sidenote,
blockquote .marginnote {
margin-right: -82%;
min-width: 59%;
text-align: left;
}
div.fullwidth,
table.fullwidth {
width: 100%;
}
div.table-wrapper {
overflow-x: auto;
font-family: "Trebuchet MS", "Gill Sans", "Gill Sans MT", sans-serif;
}
.sans {
font-family: "Gill Sans", "Gill Sans MT", Calibri, sans-serif;
letter-spacing: .03em;
}
code, pre > code {
font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;
font-size: 1.0rem;
line-height: 1.42;
-webkit-text-size-adjust: 100%; /* Prevent adjustments of font size after orientation changes in iOS. See https://github.com/edwardtufte/tufte-css/issues/81#issuecomment-261953409 */
}
.sans > code {
font-size: 1.2rem;
}
h1 > code,
h2 > code,
h3 > code {
font-size: 0.80em;
}
.marginnote > code,
.sidenote > code {
font-size: 1rem;
}
pre > code {
font-size: 0.9rem;
width: 52.5%;
margin-left: 2.5%;
overflow-x: auto;
display: block;
}
pre.fullwidth > code {
width: 90%;
}
.fullwidth {
max-width: 90%;
clear:both;
}
span.newthought {
font-variant: small-caps;
font-size: 1.2em;
}
input.margin-toggle {
display: none;
}
label.sidenote-number {
display: inline;
}
label.margin-toggle:not(.sidenote-number) {
display: none;
}
.iframe-wrapper {
position: relative;
padding-bottom: 56.25%; /* 16:9 */
padding-top: 25px;
height: 0;
}
.iframe-wrapper iframe {
position: absolute;
top: 0;
left: 0;
width: 100%;
height: 100%;
}
@media (max-width: 760px) {
body {
width: 84%;
padding-left: 8%;
padding-right: 8%;
}
hr,
section > p,
section > footer,
section > table {
width: 100%;
}
pre > code {
width: 97%;
}
section > dl,
section > ol,
section > ul {
width: 90%;
}
figure {
max-width: 90%;
}
figcaption,
figure.fullwidth figcaption {
margin-right: 0%;
max-width: none;
}
blockquote {
margin-left: 1.5em;
margin-right: 0em;
}
blockquote p,
blockquote footer {
width: 100%;
}
label.margin-toggle:not(.sidenote-number) {
display: inline;
}
.sidenote,
.marginnote {
display: none;
}
.margin-toggle:checked + .sidenote,
.margin-toggle:checked + .marginnote {
display: block;
float: left;
left: 1rem;
clear: both;
width: 95%;
margin: 1rem 2.5%;
vertical-align: baseline;
position: relative;
}
label {
cursor: pointer;
}
div.table-wrapper,
table {
width: 85%;
}
img {
width: 100%;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment