eddotman/gist:46df932234308d32b6bb

## gistfile1.txt
\chapter{Synthesis Database}

\section{Building the Corpus}

A corpus of articles was constructed by first creating search queries using information from the Materials Project (MP) article database \cite{Jain2013}, and then running these queries against various published Application Programming Interfaces (APIs) \cite{ElsevierAPI} or direct-downloading articles in the case where an API was not available. The MP database consisted of a table of approximately 30 000 articles, with each row containing an article's title, author list, abstract and DOI.

For each article's title in the MP database, common words (e.g. a, the) were omitted and the remaining words were used to form a boolean search query (e.g. query=synthesis+zeolite\&count=5). Here, the `count' parameter specified how many related articles to retrieve for each query. These queries were used to download article identification numbers (either DOIs or PIIs) from various publishers, with the majority of articles retrieved from Elsevier using the text mining API \cite{ElsevierAPI}. Table \ref{tab:publishers} summarizes the sources of the articles in our constructed corpus.

A narrowed-subject corpus was also constructed using only search terms for Li-battery materials (e.g. "LiCoO2 synthesis"); this corpus was used as a preliminary set of articles for exploring different data mining approaches on the extracted data. These articles were retrieved exclusively via the Elsevier API.

\begin{table}
\centering
\begin{tabular}{lrr}
{\bf Publisher} & {\bf Articles Retrieved} & {\bf Retrieval Method}\\
Elsevier & 10 000 & Text-mining API \\
Royal Society of Chemistry & 300 & Direct Download
\end{tabular}
\caption{Sources of articles for corpus construction.}
\label{tab:publishers}
\end{table}

After building the list of article identification numbers, articles were either downloaded from publisher websites via web-scraping (i.e. direct downloading), or in the case where an API was provided by the publisher, the API was used to programmatically retrieve articles.

\section{Extracting Synthesis Sections as Plain Text}

Using tools built by the Information Extraction and Synthesis Laboratory (IESL) at University of Massachusetts Amherst [Andrew: citation for pstotext/metatagger?], the postscript text contained in the article PDFs was extracted into the XML plain text format. This text was then programmatically separated into labelled paragraphs and sections, indicating in each paper where the title, abstract, and body text paragraphs were located.

In order to identify which paragraphs corresponded to materials synthesis sections, a logistic regression classifier was employed, using the {\it scikit-learn} machine-learning Python package \cite{scikit-learn}. A set of $\sim3000$ paragraphs was manually labelled to train the classifier. Each paragraph was given a label of 0 or 1, with the positive label indicating that the paragraph described the synthesis of a material.

Paragraphs were represented as feature vectors using a bag-of-words approach, where each dimension of a feature vector corresponded to a particular word (e.g. 'heat'), and the value of each coordinate represented the number of times a word occurred in a paragraph (e.g. 5). The dimensions (i.e. keywords of interest) were chosen by scraping all the manually labelled positive paragraphs (i.e. the confirmed synthesis paragraphs) to find the most common words in manually identified synthesis paragraphs. The logistic regression classifier was set up in binary classification mode where each feature vector, corresponding to a paragraph in a paper, would be assigned a label of 0 (not a synthesis section) or 1 (synthesis section).

 The manually labelled set was split into a 60/40 ratio for the training and testing sets, respectively. Roughly $11\%$ of the manually labelled set consisted of positive labels, meaning that the corpus was heavily skewed towards negative samples. The measured accuracy was $92\%$, with a precision of $70\%$ and a recall of $42\%$ [Andrew: Should we do some bootstrapping / cross-validation for these (and other) metrics, or is this find for prelim results?].

\section{Information Extraction from Paragraphs}

A combination of rule-based and machine learning methods were used to extract data from the synthesis sections of the articles. A schema was created for extracted data as explained in Table \ref{tab:extraction_nomenclature}. In our schema, materials synthesis `recipes' are represented as a sequence of operations (each belonging to an action type) applied to entities at specified conditions.

\begin{table}
\centering
\begin{tabular}{lr}
 {\bf Data Type} &  {\bf Examples} \\
 entity & alumina, KCl \\
 action & heat, add \\
 operation & calcine, react \\
 condition & 500 C, argon
\end{tabular}
\caption{Extraction nomenclature for materials synthesis. Actions refer to general functions that act on materials, while operations are specific functions on materials which belong to an action category (e.g. calcination is a type of heating).}
\label{tab:extraction_nomenclature}
\end{table}


Using the {\em Factorie} software package \cite{McCallum2009}, each paragraph was automatically transformed into a parse tree and each word had its part-of-speech (e.g. noun, verb) labelled. To extract operations, each word in a sentence was lemmatized using the {\em nltk} Python package and was compared against the list of known operations \cite{BirdKleinLoper09}. Only a small number of operations appear in materials synthesis articles, and so direct matching via 'dictionaries' of known operations proved highly effective for extracting operations on materials. Using a manually labelled test set of materials synthesis sentences, a precision of $90\%$ and a recall of $79\%$ was achieved using this method.

Once an operation had been identified, entities were extracted by satisfying at least one of the following criteria: either they matching against a chemical database and by considering any nouns (as tagged by {\em Factorie}) which were nested under the parent operation in the parse tree. The chemical database consisted of approximately $20 000$ chemical formulae which were retrieved from the Materials Project. Using this method, a precision of $41\%$ and a recall of $23\%$ were attained. Precision and recall for entities also paired to their correction operations were measured; these metrics were $34\%$ and $22\%$ respectively. The majority of errors stemmed from either misidentifying non-entity nouns as entities, incorrect tokenization of words, or assigning the wrong entities to operations.

Conditions were extracted by matching words, nested under the parent operation in the parse tree, against a list of known conditions. Some of these known conditions were simply words describing atmospheric conditions (e.g. argon), while others involved searching for numeric conditions composed of multiple words (e.g. 500 C). Similar to entity extraction, overall extraction metrics were measured along with metrics for both properly extracting and matching conditions with their respective operations. The reported precision for condition extraction was $95\%$ and the recall was $72\%$. Precision and recall for conditions which were also paired to their correct operations were measured as $60\%$ and $46\%$ respectively.
	\chapter{Synthesis Database}

	\section{Building the Corpus}

	A corpus of articles was constructed by first creating search queries using information from the Materials Project (MP) article database \cite{Jain2013}, and then running these queries against various published Application Programming Interfaces (APIs) \cite{ElsevierAPI} or direct-downloading articles in the case where an API was not available. The MP database consisted of a table of approximately 30 000 articles, with each row containing an article's title, author list, abstract and DOI.

	For each article's title in the MP database, common words (e.g. a, the) were omitted and the remaining words were used to form a boolean search query (e.g. query=synthesis+zeolite\&count=5). Here, the `count' parameter specified how many related articles to retrieve for each query. These queries were used to download article identification numbers (either DOIs or PIIs) from various publishers, with the majority of articles retrieved from Elsevier using the text mining API \cite{ElsevierAPI}. Table \ref{tab:publishers} summarizes the sources of the articles in our constructed corpus.

	A narrowed-subject corpus was also constructed using only search terms for Li-battery materials (e.g. "LiCoO2 synthesis"); this corpus was used as a preliminary set of articles for exploring different data mining approaches on the extracted data. These articles were retrieved exclusively via the Elsevier API.

	\begin{table}
	\centering
	\begin{tabular}{lrr}
	{\bf Publisher} & {\bf Articles Retrieved} & {\bf Retrieval Method}\\
	Elsevier & 10 000 & Text-mining API \\
	Royal Society of Chemistry & 300 & Direct Download
	\end{tabular}
	\caption{Sources of articles for corpus construction.}
	\label{tab:publishers}
	\end{table}

	After building the list of article identification numbers, articles were either downloaded from publisher websites via web-scraping (i.e. direct downloading), or in the case where an API was provided by the publisher, the API was used to programmatically retrieve articles.

	\section{Extracting Synthesis Sections as Plain Text}

	Using tools built by the Information Extraction and Synthesis Laboratory (IESL) at University of Massachusetts Amherst [Andrew: citation for pstotext/metatagger?], the postscript text contained in the article PDFs was extracted into the XML plain text format. This text was then programmatically separated into labelled paragraphs and sections, indicating in each paper where the title, abstract, and body text paragraphs were located.

	In order to identify which paragraphs corresponded to materials synthesis sections, a logistic regression classifier was employed, using the {\it scikit-learn} machine-learning Python package \cite{scikit-learn}. A set of $\sim3000$ paragraphs was manually labelled to train the classifier. Each paragraph was given a label of 0 or 1, with the positive label indicating that the paragraph described the synthesis of a material.

	Paragraphs were represented as feature vectors using a bag-of-words approach, where each dimension of a feature vector corresponded to a particular word (e.g. 'heat'), and the value of each coordinate represented the number of times a word occurred in a paragraph (e.g. 5). The dimensions (i.e. keywords of interest) were chosen by scraping all the manually labelled positive paragraphs (i.e. the confirmed synthesis paragraphs) to find the most common words in manually identified synthesis paragraphs. The logistic regression classifier was set up in binary classification mode where each feature vector, corresponding to a paragraph in a paper, would be assigned a label of 0 (not a synthesis section) or 1 (synthesis section).

	The manually labelled set was split into a 60/40 ratio for the training and testing sets, respectively. Roughly $11\%$ of the manually labelled set consisted of positive labels, meaning that the corpus was heavily skewed towards negative samples. The measured accuracy was $92\%$, with a precision of $70\%$ and a recall of $42\%$ [Andrew: Should we do some bootstrapping / cross-validation for these (and other) metrics, or is this find for prelim results?].

	\section{Information Extraction from Paragraphs}

	A combination of rule-based and machine learning methods were used to extract data from the synthesis sections of the articles. A schema was created for extracted data as explained in Table \ref{tab:extraction_nomenclature}. In our schema, materials synthesis `recipes' are represented as a sequence of operations (each belonging to an action type) applied to entities at specified conditions.

	\begin{table}
	\centering
	\begin{tabular}{lr}
	{\bf Data Type} & {\bf Examples} \\
	entity & alumina, KCl \\
	action & heat, add \\
	operation & calcine, react \\
	condition & 500 C, argon
	\end{tabular}
	\caption{Extraction nomenclature for materials synthesis. Actions refer to general functions that act on materials, while operations are specific functions on materials which belong to an action category (e.g. calcination is a type of heating).}
	\label{tab:extraction_nomenclature}
	\end{table}


	Using the {\em Factorie} software package \cite{McCallum2009}, each paragraph was automatically transformed into a parse tree and each word had its part-of-speech (e.g. noun, verb) labelled. To extract operations, each word in a sentence was lemmatized using the {\em nltk} Python package and was compared against the list of known operations \cite{BirdKleinLoper09}. Only a small number of operations appear in materials synthesis articles, and so direct matching via 'dictionaries' of known operations proved highly effective for extracting operations on materials. Using a manually labelled test set of materials synthesis sentences, a precision of $90\%$ and a recall of $79\%$ was achieved using this method.

	Once an operation had been identified, entities were extracted by satisfying at least one of the following criteria: either they matching against a chemical database and by considering any nouns (as tagged by {\em Factorie}) which were nested under the parent operation in the parse tree. The chemical database consisted of approximately $20 000$ chemical formulae which were retrieved from the Materials Project. Using this method, a precision of $41\%$ and a recall of $23\%$ were attained. Precision and recall for entities also paired to their correction operations were measured; these metrics were $34\%$ and $22\%$ respectively. The majority of errors stemmed from either misidentifying non-entity nouns as entities, incorrect tokenization of words, or assigning the wrong entities to operations.

	Conditions were extracted by matching words, nested under the parent operation in the parse tree, against a list of known conditions. Some of these known conditions were simply words describing atmospheric conditions (e.g. argon), while others involved searching for numeric conditions composed of multiple words (e.g. 500 C). Similar to entity extraction, overall extraction metrics were measured along with metrics for both properly extracting and matching conditions with their respective operations. The reported precision for condition extraction was $95\%$ and the recall was $72\%$. Precision and recall for conditions which were also paired to their correct operations were measured as $60\%$ and $46\%$ respectively.