Skip to content

Instantly share code, notes, and snippets.

@bbzzzz
bbzzzz / Working with World Bank API
Last active August 29, 2015 14:11
This work extracted open source data from World Bank focusing on the study about relationship between fertility rate, public expenditure on education and GDP. Data is stored in MySQL database for the convenience of inter-working between Python and R. Visualization is done with ggplot2 package in R.
# -*- coding: utf-8 -*-
#### Author: Bohan Zhang | The Business Analytics Program of the George Washington University
#### Python Part
import wbdata
import pandas as pd
import datetime
import MySQLdb as myDB
#### test if data for certain indicator, country and year is available
@bbzzzz
bbzzzz / Job Aggregator
Last active February 22, 2022 14:17
This work collected job listing information from 4 major job search websites by web-scraping and API and aggregated search result within one output. Python modules including BeautifulSoup, urllib2, xmltodict were used.
# -*- coding: utf-8 -*-
# Contributor: Lucas Laviolet, Nisha Iyer, Mikhail Flom and Bohan Zhang
# Part 0 Preparation
#-------------------------------------------------------------------------------------------------
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
# Set up server on user's computer for OAuth 2.0 based authentication and authorization
@bbzzzz
bbzzzz / WordNet Interface
Created March 2, 2015 16:21
Natrual Language Processing - Word Meaning and Word Similarity
{
"metadata": {
"name": "Wordnet Interface"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
@bbzzzz
bbzzzz / Word Similarity
Last active August 29, 2015 14:16
Cosine Similarity, for NLP class presentation - ipython notebook version: http://nbviewer.ipython.org/gist/bozhang0504/5f67575d1397416b0f3d
import nltk
from nltk.corpus import wordnet as wn
### Synsets and lemmas
# For an arbitrary word, i.e. dog, it may have different senses, and we can find its synsets.
wn.synsets('dog')
# Once you have a synset, there are functions to find the information on that synset,
# and we will start with “lemma_names”, “lemmas”, “definitions” and “examples”.
# For the first synset 'dog.n.01', which means the first noun sense of ‘dog’, we can first find all of its words/lemma names.
@bbzzzz
bbzzzz / README
Last active August 29, 2015 14:18 — forked from larsmans/README
Sentiment analysis experiment using scikit-learn
================================================
The script sentiment.py reproduces the sentiment analysis approach from Pang,
Lee and Vaithyanathan (2002), who tried to classify movie reviews as positive
or negative, with three differences:
* tf-idf weighting is applied to terms
* the three-fold cross validation split is different
* regularization is tuned by cross validation
@bbzzzz
bbzzzz / README
Last active August 29, 2015 14:19
IMDB review Sentiment Analysis based on Support Vector Machine
Sentiment Analysis using sklearn
=================================
* sklearn LinearSVC
* 10-fold cross validation
* accuracy 88.45%
@bbzzzz
bbzzzz / 1.0 README
Last active May 4, 2021 21:00
IMDB Sentiment Analysis using Naive Bayes
Sentiment Analysis using Naive Bayes
====================================
* Naive Bayes
* Add-1 smoothing
* 10-fold cross validation
* regular expression detecting negation words
Besides the regular method, the code also realized:
* Boolean Naive Bayes
* Naive Bayes with stop word
@bbzzzz
bbzzzz / download_report
Created April 16, 2015 23:59
Webscrape all XBRL files given stock ticker
import urllib2
from bs4 import BeautifulSoup as BeautifulSoup
def get_list(ticker):
base_url_part1 = "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK="
base_url_part2 = "&type=&dateb=&owner=&start="
base_url_part3 = "&count=100&output=xml"
href = []
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This code is for ZestFinance modeling team interview homework assisgnment. ML algorithms including Regularized Logistic Regression, Elastic Net, Random Fores and Gradient Boosting (xgboost) are applied."
]
},
{