Skip to content

Instantly share code, notes, and snippets.

@davidlenz
davidlenz / textblob_de_lemmatizer.py
Created May 25, 2018 09:55
Usage of german textblob lemmatizer. Takes a list of strings and returns the lemmatized version.
from textblob_de import TextBlobDE as TextBlob
def textblob_lemmatizer(doclist):
"""Takes a list of strings as input and returns a list of lemmatized strings"""
docs=[]
for doc in doclist:
blob = TextBlob(doc)
docs.append(' '.join(list(blob.words.lemmatize())))
return docs
@davidlenz
davidlenz / spacy_lemmatizer.py
Last active May 25, 2018 09:57
Usage of Spacy lemmatizer. Convert list of strings to lemmatized version.
import spacy
settings.LEMMATIZER_BATCH_SIZE = 250
settings.LEMMATIZER_N_THREADS = -1
nlp = spacy.load('de')
nlp.disable_pipes('tagger', 'ner')
def spacy_lemmatizer(text, nlp):
"""text is a list of string. nlp is a spacy nlp object. Use nlp.disable_pipes('tagger','ner') to speed up lemmatization"""
@davidlenz
davidlenz / stopwords.py
Last active June 5, 2018 10:26
Function to generate a list of stopwords from different sources.
import stop_words
from langdetect import detect
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import ast
@davidlenz
davidlenz / jensen-shannon-divergence.py
Last active December 5, 2020 07:05
Implementation of Jensen-Shannon-Divergence based on https://github.com/scipy/scipy/issues/8244
import numpy as np
from scipy.stats import entropy
def js(p, q):
p = np.asarray(p)
q = np.asarray(q)
# normalize
p /= p.sum()
q /= q.sum()
m = (p + q) / 2
@davidlenz
davidlenz / attention_lstm.py
Created April 27, 2018 10:07 — forked from mbollmann/attention_lstm.py
My attempt at creating an LSTM with attention in Keras
class AttentionLSTM(LSTM):
"""LSTM with attention mechanism
This is an LSTM incorporating an attention mechanism into its hidden states.
Currently, the context vector calculated from the attended vector is fed
into the model's internal states, closely following the model by Xu et al.
(2016, Sec. 3.1.2), using a soft attention model following
Bahdanau et al. (2014).
The layer expects two inputs instead of the usual one:
@davidlenz
davidlenz / selenium_google_scrape.py
Created April 26, 2018 21:17
Search on Google and return list of results with urls. Tweaked from https://gist.github.com/azam-a/32b89944b98a3fd79d44ebfdac16b63d
# https://gist.github.com/azam-a/32b89944b98a3fd79d44ebfdac16b63d
import pandas as pd
import selenium
print('selenium.__version__: ', selenium.__version__)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
@davidlenz
davidlenz / twitter_scraper.py
Last active April 25, 2018 17:39
Scrape Data from Twitter and extract Sentiment using VaderSentiment. Code is from https://www.pythonprogramming.net/
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
import sqlite3
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from unidecode import unidecode
import time
analyzer = SentimentIntensityAnalyzer()
@davidlenz
davidlenz / reddit_submissions_stream.py
Last active April 25, 2018 16:20
Stream reddit submissions with PRAW. Additionally finds urls in submissions and extracts their text.
import newsapi_v2
import findurls
import praw
import pandas as pd
import utils_func
import os
import subreddit
import requests
from newspaper import fulltext
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
url matching regex
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
"""
"""
The regex patterns in this gist are intended to match any URLs,
@davidlenz
davidlenz / scrape_newsapi.py
Last active April 26, 2018 13:26
Scrape the sources from the newsapi headers every 12 hours. https://newsapi.org/
import justext, time
import pandas as pd
import requests, urllib
import utils_func
def get_sources(key):
"""
retrieve all sources from newsapi, filter the german and english speaking
and return them as dataframe