soeffing/Notes

## Notes
Potential feeds:

http://feeds.reuters.com/reuters/companyNews
http://feeds.reuters.com/reuters/businessNews
http://www.economist.com/feeds/print-sections/75/europe.xml
http://feeds.nytimes.com/nyt/rss/Business
http://feeds.bbci.co.uk/news/business/rss.xml
http://www.telegraph.co.uk/finance/rssfeeds/  (Whole fucking list of potential feeds)


Useful tutorials

Scraping/Spidering

http://francescopochetti.com/scrapying-around-web/


Useful Datasets:

http://www.sananalytics.com/lab/twitter-sentiment/

RCV1 (Reuters Corpus Volume 1)

Reuters-21578 (Test Collection)

WallStreet Journal Articles (Cost money!): https://catalog.ldc.upenn.edu/LDC93S6A

Reuters Key Developments Corpus

General Inquirer dictionary

Compustat tool in the Wharton Research Data Services website (assets, liabilities, share volume/outstanding for listed companies)

LexisNexis (company that collects heaps of articles...probably costs $$$)


Interesting Papers:

http://people.csail.mit.edu/azar/wp-content/uploads/2011/09/thesis.pdf

1. uses Reuters Key Developments Corpus -> seems like we have to write Reuters, to get it
2. Presents some need feature extraction approaches

http://ceur-ws.org/Vol-862/FEOSWp4.pdf

Take aways:

1. use title and abstracts cuz they condense polarity already
2. Very simplistic hard-coded semantic and sentiment annotation (no fit for us)


http://www.sciencedirect.com/science/article/pii/S0950705116302271?via%3Dihub#fn0056

1. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news
2. Seems very interesting but too much for the alpha version, I think
3. creates knowledge graphs from text (RDF format)
4. It determines what happened, who was involved, and where and when it took place
5. Because we anchor events to time, we can extract longer sequences of events over time and discover the role of participants in history. It allows us to find networks of actors and implications of events on a large global scale and over longer periods of time. It provides the means to generalize from the level of individuals (people, companies, incidents) to classes and types (management, governors, industries and event types), discovering trends and patterns, or vice versa to specialize from general trends to personal stories.
6. The deep-reading technology developed by NewsReader is unique in its kind. It combines the most advanced NLP technology in four different languages to obtain interoperable semantic interpretations of text. Our four NLP pipelines perform named-entity-detection and linking, event and semantic role detection, temporal expression normalization and temporal relation detection
7. We also processed news across four different languages, resulting in unified interoperable data across these languages. The knowledge resulting from the processed is stored in a dedicated KnowledgeStore that supports various APIs for semantic querying and exploitation of the data.
8. Within the field of information extraction, there are two main directions: closed information extraction, where the system is required to fill slots in a predefined template and open information extraction, where the concepts or types of relations that the system is required to extract are not predefined.
9. NewsReader employs deep NLP as well as state-of-the-art Semantic Web technology, resulting in much more fine-grained analyses than projects that employ only shallow natural language processing or focus on a single NLP task
10. We divide the task of interpreting text and representing it in event-centric RDF in three main steps illustrated in Fig. 1. The first step applies various advanced linguistic analyses to single documents, the second translates the output of these analyses to RDF resolving mentions of information to an instance representation and the third and final step aggregates the RDF instance representations across different documents into a joined RDF representation.


Papers to Checkout:

1. Tetlock, Saar-Tsechansky and Macskassy (2007)
2. Toward an architecture for never-ending language learning


Interesting Projects:

1. https://github.com/jasti/Stock-Predictor
2. http://sentdex.com/financial-analysis/ (Nice dashboard)
3. http://www.newsreader-project.eu/ (https://github.com/newsreader
   - all Java, meh but great since it works for English, Italian, Spanish and Dutch)
   - all code components http://www.newsreader-project.eu/results/software/
	Potential feeds:

	http://feeds.reuters.com/reuters/companyNews
	http://feeds.reuters.com/reuters/businessNews
	http://www.economist.com/feeds/print-sections/75/europe.xml
	http://feeds.nytimes.com/nyt/rss/Business
	http://feeds.bbci.co.uk/news/business/rss.xml
	http://www.telegraph.co.uk/finance/rssfeeds/ (Whole fucking list of potential feeds)


	Useful tutorials

	Scraping/Spidering

	http://francescopochetti.com/scrapying-around-web/




	Useful Datasets:

	http://www.sananalytics.com/lab/twitter-sentiment/

	RCV1 (Reuters Corpus Volume 1)

	Reuters-21578 (Test Collection)

	WallStreet Journal Articles (Cost money!): https://catalog.ldc.upenn.edu/LDC93S6A

	Reuters Key Developments Corpus

	General Inquirer dictionary

	Compustat tool in the Wharton Research Data Services website (assets, liabilities, share volume/outstanding for listed companies)

	LexisNexis (company that collects heaps of articles...probably costs $$$)





	Interesting Papers:

	http://people.csail.mit.edu/azar/wp-content/uploads/2011/09/thesis.pdf

	1. uses Reuters Key Developments Corpus -> seems like we have to write Reuters, to get it
	2. Presents some need feature extraction approaches

	http://ceur-ws.org/Vol-862/FEOSWp4.pdf

	Take aways:

	1. use title and abstracts cuz they condense polarity already
	2. Very simplistic hard-coded semantic and sentiment annotation (no fit for us)


	http://www.sciencedirect.com/science/article/pii/S0950705116302271?via%3Dihub#fn0056

	1. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news
	2. Seems very interesting but too much for the alpha version, I think
	3. creates knowledge graphs from text (RDF format)
	4. It determines what happened, who was involved, and where and when it took place
	5. Because we anchor events to time, we can extract longer sequences of events over time and discover the role of participants in history. It allows us to find networks of actors and implications of events on a large global scale and over longer periods of time. It provides the means to generalize from the level of individuals (people, companies, incidents) to classes and types (management, governors, industries and event types), discovering trends and patterns, or vice versa to specialize from general trends to personal stories.
	6. The deep-reading technology developed by NewsReader is unique in its kind. It combines the most advanced NLP technology in four different languages to obtain interoperable semantic interpretations of text. Our four NLP pipelines perform named-entity-detection and linking, event and semantic role detection, temporal expression normalization and temporal relation detection
	7. We also processed news across four different languages, resulting in unified interoperable data across these languages. The knowledge resulting from the processed is stored in a dedicated KnowledgeStore that supports various APIs for semantic querying and exploitation of the data.
	8. Within the field of information extraction, there are two main directions: closed information extraction, where the system is required to fill slots in a predefined template and open information extraction, where the concepts or types of relations that the system is required to extract are not predefined.
	9. NewsReader employs deep NLP as well as state-of-the-art Semantic Web technology, resulting in much more fine-grained analyses than projects that employ only shallow natural language processing or focus on a single NLP task
	10. We divide the task of interpreting text and representing it in event-centric RDF in three main steps illustrated in Fig. 1. The first step applies various advanced linguistic analyses to single documents, the second translates the output of these analyses to RDF resolving mentions of information to an instance representation and the third and final step aggregates the RDF instance representations across different documents into a joined RDF representation.



	Papers to Checkout:

	1. Tetlock, Saar-Tsechansky and Macskassy (2007)
	2. Toward an architecture for never-ending language learning


	Interesting Projects:

	1. https://github.com/jasti/Stock-Predictor
	2. http://sentdex.com/financial-analysis/ (Nice dashboard)
	3. http://www.newsreader-project.eu/ (https://github.com/newsreader
	- all Java, meh but great since it works for English, Italian, Spanish and Dutch)
	- all code components http://www.newsreader-project.eu/results/software/