Core Sentiment Analysis Package
- Documentation, specifically docstrings. Set-up sphinx to organize documentation.
- Data loader need to refactored to respond gracefully to faults in the provided datasets.
- Helper function for concatenating train, dev and test data and providing the CV params (important/useful for
learning_curve
or other methods that only accept a single set of data with cross-validation parameters.) - Turn the miscellaneous scripts into either package scripts or helper methods
- Clean up the IPython Notebooks and they can all be executed with "run all" and remove old/irrelevant ones
- Provide Cookbook with useful snippets
- Fix the package installer (
setup.py
) to support one-click install, so when the package is distributed, it will install everything required, including e.g. NLTK Copora, etc. - Refactor code so all components are consistent. So far,
data.py
andfeature_extraction.py
are fully up-to-date and compliant with the design philosophy, but not so much modules likeanalyzer.py
. - Twitter conformance tests for tokenization and preprocessing
- Tests for feature calculation methods
- Consistent logging
Sentiment Analysis Web Service
- Deployment documentation
- More extensive exception handling
- Additional views useful for analysing text (feature weights, etc)
Research
-
SemEval evaluation script
- If we wish to partake in SemEval-2014 task 9, we need to implement a helper method or snippet to the result of classification on the test data in a way that conforms to the specifications of their scorer.
-
Cascade Classifier
- As a first step, we need to simply take the instances with gold-standard annotation of
positive
/negative
and train a polarity classifier. We also normalize these instances (positive
/negative
) into equivalence classsubjective
and train a subjectivity classifier with theobjective/neutral
instances. See if we obtain promising results from this. - If so, there is a bit of implementation required to support this kind of cascaded classier (http://stackoverflow.com/questions/21151346/cascade-classifiers-for-multiclass-problems-in-scikit-learn/21159014#21159014)
- As a first step, we need to simply take the instances with gold-standard annotation of
-
Sentiment Lexicon
- Incorporate the sentiment lexicons used by (Mohammad et al. 2013) into the
TweetFeaturesaExtractor
so features can be computed based on these lexicons. All of these lexicons have been obtained and reside in theresources
directory.
- Incorporate the sentiment lexicons used by (Mohammad et al. 2013) into the
-
Negation
- Potts 2011 (http://sentiment.christopherpotts.net/lingstruc.html) describe an interesting way to handle words under the semantic scope of negation. This method has been implemented as an option in the preprocessing step of the pipeline but not yet evaluated. N.B. Mohammad et al. 2013 also adopt this approach.
-
Part-of-Speech tagging
- Many works on sentiment analysis use part-of-speech tagging one way or another. Either by appending the tags to the tokens, or having some integer-valued feature which counts the occurrence of a particular part-of-speech. Both of these have been implemented as options in various components of the pipeline but again is yet to evaluated.
-
Feature Selection
The Recursive Feature Elimination (RFE) method introduced by (I Guyon et al. 2002) has shown promising improvements. We're able to distill the minimal set of features required to obtain comparable or even better accuracy. However, as with all feature selection methods, we are wary of overfitting: the increase in performance could be attributed to the fact that we've removed sufficient noise from our feature space and distilled the truly general sentiment-bearing features with excellent discriminative capabilities, but it could also be attributed to extreme overfitting to our training set. This is the toughest counter-argument to the soundness of this method and to address this, we need to:
- Reimplement the RFE method so the scoring is done against the held-out test set, rather than a cross-validation on all available data. The
RFECV
method provided byscikit-learn
is broken beyond repair (scikit-learn/scikit-learn#2403 (comment)). - Bias-Variance Analysis: learning curves of error committed on training and test sets. Use http://see.stanford.edu/materials/aimlcs229/ML-advice.pdf to guide this analysis.
- If we take the subset of features which yielded the best performance on the held-out test set (assuming the performance does peak as before with the CV results), we need to know how it will perform on out-of-domain data. Immediately, we can apply this to the SemEval SMS dataset but perhaps data from another social media platform is more desirable. As a baseline, we take the trained model using all the features and test this against out-of-domain data. Then we can test this with the minimal subset of features.
- Another question is how this method compares to simply taking the top
n
features and dropping the rest. This is equivalent a single iteration of RFE. To do this, we should first run RFE to find the optimal number of features saym
. Then, we train another model just once and only keep the topm
features and analyse the difference not just in their performace: do the same features appear in both? What characterizes the symmetric difference between these set ofm
features? This will test whether it is neccessary to use the computationally intensive iterative feature elimination approach which trains a new model on each iteration or whether we can just train the model once and simply keep the topn
. - We need to address why we chose this particular method over the myriad of other popular methods. The results from the above should already be compelling but we should try other methods simply for completeness:
- Principal Component Analysis.
- Univariate feature selection with information gain or chi-squared.
tf-idf, min/max df
. If we specify something likemin df
or even performtf-idf
weighting, in some sense, we are removing or down-weighting some features we suspect won't be very discriminative already even before we fit a model to it. We should compare these to the methods which perform feature elimination after the data has been fitted to a model.
- Reimplement the RFE method so the scoring is done against the held-out test set, rather than a cross-validation on all available data. The
-
Kernels
- String Kernel
- Tree Kernel
-
Report
- A thorough and comprehensive write-up describing every detail of work conducted on the sentiment analysis problem.