ThiagoLira/post.md Secret

## post.md

      
    Raw
  

              post.md
            
          
Have you ever thought about all the relations beetween all the books you've read? If the authors of your favorite books have read one another, or even if they have read the same people from before their time? Which books then were read by basically everyone?

Being able to do projects like this, for me, justify some of the hype around Deep Learning on these last couple of years. I had an idea with something that I find interesting and care about, and I could prototype it and quickly have something to show for it. It is not perfect but it works.
*I've trained a fine-tuned version of RoBERTa to detect citations in free-form (i.e. not just academic citations) between books and generated a nice big directed graph with hundreds of books and their relations. All my Kindle library and more! *
You can check the graph of citations made with my Kindle library and augumented with hundreds more books from Goodreads here!

So, what am I calling citations in "free-form"? Well, any time an author mentions another book on the text. For example, from Neil Gaiman's The View from the Cheap Seats:

"Children listened to them and enjoyed them, but children were not the primary audience, no more than they were the intended audience of Beowulf, or The Odyssey."

Why citations in "free-form"? Because bibliographical/academic citations are not present on older books and even on some new books. Depending on the type of book the author doesn't need to construct a giant list of citations with precise bibliographical information at the end of the book. And with Deep Learning and enough data we might very well make a model that works on both kinds of citations and be done with it (At the end of the day it's all text/data).
How this idea came to be

I love to read books about books, e.g. Will Durant's History of Western Philosophy and Italo Calvino's Why Read the Classics and I'm always curious to know the books that were formative on the lifes of my favourite authors (and on a bigger scale even society as a whole).
The idea was, then, to make something that would receive the text from some non-fiction book as input and output me all the books that are cited inside of it. Ideally the output from Why Read The Classics would be ['The Odyssey', 'Candide',  '100 Years of Solitude', ...] and from History of Western Philosophy ["Basically everything ever written by a Greek, French or German guy (or gal) with way too much time on their hands"].
The process then would be to:

Get the text from all my books.
Get metadata from all my books, authors, original publication date etc. (+)
Search on every book citations to every other book I have.
Build a kickass looking graph illustrating all these citations with the books categorized by publication date.
EXTRA: Search for citations from my books to books I don't own! (*)(**)

(+) This was a somewhat convoluted process. It is hard to get the original publication date because this is sometimes something historical/uncertain, and there are dozens (or even hundreds!) of editions for many books. So I settled by importing all my books to Calibre and then using a Goodreads plugin to get the original publication date for each of them! Fun fact: It is pure pain and suffering to support very very distant dates with code (like b.c. dates).
(*) I did this by using a dataset scraped from Goodreads that has metadata from  thousands of books, old and new. For my purposes I just filtered the dataset by removing books with no ratings/reviews and modern Fantasy/Fiction books that would pollute the analysis and probably are not cited many times (Which doesn't mean that I don't like Fantasy books, I LOVE Fantasy).
(**) Unfortunately, these extra citations are just one-way. I don't have the text from these books to check which books they cite in turn. So... this part of the graph can only be the target of citations from the books I have!
The naive solution (and why this is more complex than it seems)

One simple solution is to build a list with a bunch of names of books and just search this on the text of each book you own. And this actually works pretty well for most books (technical considerations on how to do this efficiently later). We start to have problems on smaller book titles like "The Prince" or "The Republic", since this string might very well appear on the text with absolutely no relation to this particular book by Machiavelli. See the following example:
"The Prince reads Marcus Aurelius' Meditations to relax."

"Marcus Aurelius reads The Prince to relax."
We can't aways expect that quotations in this form will be, well, quoted or in italics or whatever. So how to detect the string "The Prince" as a citation on the latter sentence but not on the former? There are many ways to go about this, and after iterating for a while I've settled on a NLP technique called NER, or named entity recognition.
Creating a NER dataset for the task

NER models should consume some text and then return the location of labels if found.
> "The Prince reads Marcus Aurelius' Meditations to relax."

{
	'labels' : [["Meditations" , 13, 27,'BOOK']]
}


> "Marcus Aurelius reads The Prince to relax."

{
	'labels' : [["The Prince" , 12, 22,'BOOK']]
}
The model has to be trained on some specific set of labels using some text with the annotations made beforehand. On my case we have a single label, [BOOK]. The relevant information are the indexes of the string where this tagged text begins and ends. NER models works by basically having a tag associated with each word ('no-tag' is also a tag!) and sequentially assigning a probability for each word on the text for being represented by each tag on the training data e.g. ["The Prince": (0.8, BOOK , 0.2, NO_TAG ) , "Marcus Aurelius": (0.1 , BOOK , 0.9 , NO_TAG)].
Of course, I had to manually annotate a full dataset (a thousand of them!) of labels so that the model could learn what is and what isn't a book, by context. It's quite magical really when you see it working. Here are some examples from my dataset.
# example 1
{ 

"text": "Just prior to the publication of Crime and Punishment, Dostoevsky(...)" ,

"entities": [{"start": 33, "end": 53, "text": "Crime and Punishment", "label": "BOOK"}]

}

# example 2
{

"text": "In King Lear, for example, Shakespeare cast one of his most complex villains, Edmund.",

"entities": [{"start": 3, "end": 12, "text": "King Lear", "label": "BOOK"}]

}
To create my dataset I've searched on all my books for some ambiguous book titles such as "Ulysses" (The character from The Odyssey or the book from James Joyce?), "The Prince" (The book from Machiavelli or literally just some prince being referenced?) and "The Republic" (Plato's book or... you get the ideia). Then, I used Doccano to manually annotate each passage from my books to tell if they are a citation or not.
About Augumenting the Dataset

After some failed attemps to make my models work well with new validation data, I started to augument my training dataset by swaping book titles around and creating new training examples. With some probability for each training example the code will swap the citation on some example with another random book title from my  dataset and store this as a new traning example (while still keeping the old one on the dataset). Doing this has dramatically improved my model's performance on validation data. Before this augumentation my model was overfitting on only labeling books if there were many examples with this specific book on my dataset. This was probably happening because (for Deep Learning standards) my ~1000 annotations dataset is quite small.
Fine Tuning RoBERTa

The Huggingface library makes it child's play to download a SOTA model and then fine-tune it on a specific task. You can even load a model with new layers initialized specifically to be fine-tuned on a new task.
The hardest part was to convert my annotations from Doccano's format to something Pytorch would understand, and I thoroughly copied the code from here for that. And of course, to play with the hyper-parameters until the training yields good results. There are lots of things going on on this part, so I'm just gonna link my Notebook to fine-tune roberta here.
Finally, here are some test strings and the outputs from my model after fine-tuning! Look at this beautiful citations being automatically detected.
check_if_citation_ner(model,tokenizer,"When Odysseus returns to Ithaca in Book 13 of The Odyssey, Athena disguises him as an old beggar")
> 'The Odyssey'

check_if_citation_ner(model,tokenizer,"the commencement of war a herald might be called upon to recite the causes of the conflict; in effect, to provide the motivation. In Shakespeare's Henry IV")
> 'Henry IV'

check_if_citation_ner(model,tokenizer,"is impersonal in the Meditations agrees closely with Epictetus. Marcus Aurelius is doubtful about immortality, but says, as a Christian might: 'Since it is possible th")

> 'Meditations'
Gluing all this together

So far I've written about collecting the dataset, annotating and preprocessing it and finally, fine-tuning RoBERTa.
Now, how to actually process the text from all my books. The "algorithm" I wrote goes something like that:

Load all books as lists of strings in memory (using python's readlines() function).
For each line use a regex(+) to detected any book titles.
If the title detected is too short (e.g. less than 3 words)(*) run the whole line through RoBERTa and see if the NER model "validates" the title as a book title.
For each match create a new node on the graph (if needed) and add a link beetwen the book whose text we are currently reading and the book that was cited.
Convert this graph to D3.js format, add some metadata for each book and save it.

(+) I ended up using a giant pre-compiled regex of the form r'Book1 | Book2 ...' to look for matches because it was the fastest method I found. More on this on this StackOverflow answer.
(*) I choose this heuristic because I'm assuming that if longer titles like "The Hound of Baskerville" or "The Ballad of Reading Gaol" shows up on some book's text it has a VERY high probability of being a citation, so I don't need to waste CPU clocks confirming that with RoBERTa.
How to take less than 8 hours to process all by books with RoBERTa

By using multiprocessing! My book processing function receives as argument the list of lines from a book, it's metadata and my model and tokenizer. By creating a Pool of workers we can give each of them a book at a time to process assynchronously, when one worker finishes a book it starts processing another from the list. Just don't overdo it since each RoBERTa takes ~1GB of RAM! My  gaming machine with 16GB of RAM and an modern-ish i7 absolutely MELTS with 4 RoBERTa processes.
import multiprocessing as mp


ctx = mp.get_context('spawn')

num_workers = 4

p = ctx.Pool(num_workers)

f = book_processer_func

jobs = []

#iterator is a list with all book lines and metadata for each book 

for i in range(len(iterator)):

	job = p.apply_async(f, [iterator[i],model,tokenizer])

	jobs.append(job)

results=[]


for job in tqdm(jobs):
	results.append(job.get())
Processing the Graph and converting it to D3.js

The graph is built with networkx on the python side of the code, just so I could plot it and run the PageRank algorithm to see which books are the most "linked" on my data. Turns out it's The Prince!
[('The Prince', 0.01081877301303925),
 ('Hamlet', 0.01061937268317937),
 ('The Divine Comedy', 0.007740738803800382),
 ('The Art of War', 0.006072207782683749),
 ('Romeo and Juliet', 0.0051880591489493295),
 ('Phaedrus', 0.004890520471208026),
 ('Poetics', 0.003865239838704173),
 ('The Deep', 0.0037173564227710025),
 ('Alice in Wonderland', 0.0037039373900383415)]
 
We just need some simple code to create a JSON with all this links and nodes and load it on D3.js. Here is the format of each node and each link on the final graph. One thing to note is the metadata we can save on each node to use it on the javascript side of the code later.
Again, you can see the end result here!

# for each node 
{
	 "name": v,
	 "label":v,
	 "id":node_id,
	 "px":px,
	 "py":py,
	 "category_date":pubdate,
	 "authors":author,
	 "nodesize":ndsize,
	 "is_goodreads":is_goodreads
 }
 
 # for each link 
 {
	 "source": some_id_1,
	 "target":some_id_2,
 }
Closing Remarks

Just some notes that might be useful to someone working with a similar problem:

I've never increased my performance dramatically by toying with RoBERTa's fine-tuning hyper-parameters. After you choose something that converges and are still unsatisfied, I'll wager that by getting more/better data, or augumenting it like I did, you will get more substancial gains in performance.
Something that wasn't obvious to me was that when you are using multiprocessing you should get close to 100% CPU use if you want your code to be running the fastest it possibly can (in retrospect it sounds kinda obvious now). If it is not close to 100% this probably means that your code is doing I/O or some processes are stuck (or deadlocked!) on something and can't proceed with the computation, or you could spawn more processes to speed everything up.