Skip to content

Instantly share code, notes, and snippets.

@soeffing
Created June 14, 2017 15:39
Show Gist options
  • Save soeffing/9a81f5c5cdf9d227ff7e921f8cd36883 to your computer and use it in GitHub Desktop.
Save soeffing/9a81f5c5cdf9d227ff7e921f8cd36883 to your computer and use it in GitHub Desktop.
Potential feeds:
http://feeds.reuters.com/reuters/companyNews
http://feeds.reuters.com/reuters/businessNews
http://www.economist.com/feeds/print-sections/75/europe.xml
http://feeds.nytimes.com/nyt/rss/Business
http://feeds.bbci.co.uk/news/business/rss.xml
http://www.telegraph.co.uk/finance/rssfeeds/ (Whole fucking list of potential feeds)
Useful tutorials
Scraping/Spidering
http://francescopochetti.com/scrapying-around-web/
Useful Datasets:
http://www.sananalytics.com/lab/twitter-sentiment/
RCV1 (Reuters Corpus Volume 1)
Reuters-21578 (Test Collection)
WallStreet Journal Articles (Cost money!): https://catalog.ldc.upenn.edu/LDC93S6A
Reuters Key Developments Corpus
General Inquirer dictionary
Compustat tool in the Wharton Research Data Services website (assets, liabilities, share volume/outstanding for listed companies)
LexisNexis (company that collects heaps of articles...probably costs $$$)
Interesting Papers:
http://people.csail.mit.edu/azar/wp-content/uploads/2011/09/thesis.pdf
1. uses Reuters Key Developments Corpus -> seems like we have to write Reuters, to get it
2. Presents some need feature extraction approaches
http://ceur-ws.org/Vol-862/FEOSWp4.pdf
Take aways:
1. use title and abstracts cuz they condense polarity already
2. Very simplistic hard-coded semantic and sentiment annotation (no fit for us)
http://www.sciencedirect.com/science/article/pii/S0950705116302271?via%3Dihub#fn0056
1. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news
2. Seems very interesting but too much for the alpha version, I think
3. creates knowledge graphs from text (RDF format)
4. It determines what happened, who was involved, and where and when it took place
5. Because we anchor events to time, we can extract longer sequences of events over time and discover the role of participants in history. It allows us to find networks of actors and implications of events on a large global scale and over longer periods of time. It provides the means to generalize from the level of individuals (people, companies, incidents) to classes and types (management, governors, industries and event types), discovering trends and patterns, or vice versa to specialize from general trends to personal stories.
6. The deep-reading technology developed by NewsReader is unique in its kind. It combines the most advanced NLP technology in four different languages to obtain interoperable semantic interpretations of text. Our four NLP pipelines perform named-entity-detection and linking, event and semantic role detection, temporal expression normalization and temporal relation detection
7. We also processed news across four different languages, resulting in unified interoperable data across these languages. The knowledge resulting from the processed is stored in a dedicated KnowledgeStore that supports various APIs for semantic querying and exploitation of the data.
8. Within the field of information extraction, there are two main directions: closed information extraction, where the system is required to fill slots in a predefined template and open information extraction, where the concepts or types of relations that the system is required to extract are not predefined.
9. NewsReader employs deep NLP as well as state-of-the-art Semantic Web technology, resulting in much more fine-grained analyses than projects that employ only shallow natural language processing or focus on a single NLP task
10. We divide the task of interpreting text and representing it in event-centric RDF in three main steps illustrated in Fig. 1. The first step applies various advanced linguistic analyses to single documents, the second translates the output of these analyses to RDF resolving mentions of information to an instance representation and the third and final step aggregates the RDF instance representations across different documents into a joined RDF representation.
Papers to Checkout:
1. Tetlock, Saar-Tsechansky and Macskassy (2007)
2. Toward an architecture for never-ending language learning
Interesting Projects:
1. https://github.com/jasti/Stock-Predictor
2. http://sentdex.com/financial-analysis/ (Nice dashboard)
3. http://www.newsreader-project.eu/ (https://github.com/newsreader
- all Java, meh but great since it works for English, Italian, Spanish and Dutch)
- all code components http://www.newsreader-project.eu/results/software/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment