Skip to content

Instantly share code, notes, and snippets.

@yhilpisch
Last active May 5, 2023 20:24
Show Gist options
  • Save yhilpisch/135d496355ed339aff53c9ba85d6f428 to your computer and use it in GitHub Desktop.
Save yhilpisch/135d496355ed339aff53c9ba85d6f428 to your computer and use it in GitHub Desktop.

Python and Data Science

Empowering Quants, Traders & Asset Managers

Dr. Yves J. Hilpisch | The Python Quants & The AI Machine

Texas State University, April 2022

(short link to this Gist: http://bit.ly/py_ds_gist)

Slides

You find the slides under http://certificate.tpq.io/python_data_science.pdf

Resources

This Gist contains selected resources used during the lecture.

Dislaimer

All the content, Python code, Jupyter Notebooks and other materials (the “Material”) come without warranties or representations, to the extent permitted by applicable law.

None of the Material represents any kind of recommendation or investment advice.

The Material is only meant as a technical illustration.

Leveraged and unleveraged trading of financial instruments, and of contracts for difference (CFDs) in particular, involves a number of risks (for example, losses in excess of deposits). Make sure to understand and manage these risks.

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"<img src=\"http://hilpisch.com/tpq_logo.png\" alt=\"The Python Quants\" width=\"35%\" align=\"right\" border=\"0\"><br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python and Data Science"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Natural Langauge Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH\n",
"\n",
"http://tpq.io | [@dyjh](http://twitter.com/dyjh) | [team@tpq.io](mailto:team@tpq.io) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nlp\n",
"import requests\n",
"import numpy as np\n",
"import pandas as pd\n",
"%config InlineBackend.figure_format = 'svg'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.simplefilter('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving Text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# url = 'https://platinum.tpq.io'\n",
"url = 'https://fin-eco.mccoy.txstate.edu/degrees-programs/msqfe.html'\n",
"# url = 'https://nr.apple.com/d2I7H340u2'\n",
"url = 'https://finance.yahoo.com/'\n",
"url = 'https://hilpisch.com/walden.txt'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html = requests.get(url).text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html[1000:2000]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing Text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw = nlp.clean_up_html(html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(raw)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw[1000:2000]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens = nlp.tokenize(raw.lower())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens = set(tokens)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens = [t for t in tokens if len(t) > 5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens = sorted(tokens)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokens[100:120]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.generate_key_words(raw, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.generate_key_words(' '.join(tokens), 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nlp.generate_word_cloud(raw, 35)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://hilpisch.com/tpq_logo.png\" alt=\"The Python Quants\" width=\"30%\" align=\"right\" border=\"0\"><br>\n",
"\n",
"<a href=\"http://tpq.io\" target=\"_blank\">http://tpq.io</a> | <a href=\"http://twitter.com/dyjh\" target=\"_blank\">@dyjh</a> | <a href=\"mailto:training@tpq.io\">training@tpq.io</a>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
#
# Mean-Variance Portfolio Class
# Markowitz (1952)
#
# Python for Asset Management
# (c) Dr. Yves J. Hilpisch
# The Python Quants GmbH
#
import math
import numpy as np
import pandas as pd
def portfolio_return(weights, rets):
return np.dot(weights.T, rets.mean()) * 252
def portfolio_variance(weights, rets):
return np.dot(weights.T, np.dot(rets.cov(), weights)) * 252
def portfolio_volatility(weights, rets):
return math.sqrt(portfolio_variance(weights, rets))
#
# NLP Helper Functions
#
# Artificial Intelligence in Finance
# (c) Dr Yves J Hilpisch
# The Python Quants GmbH
#
import re
import nltk
import string
import pandas as pd
from pylab import plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from lxml.html.clean import Cleaner
from sklearn.feature_extraction.text import TfidfVectorizer
plt.style.use('seaborn')
cleaner = Cleaner(style=True, links=True, allow_tags=[''],
remove_unknown_tags=False)
stop_words = stopwords.words('english')
stop_words.extend(['new', 'old', 'pro', 'open', 'menu', 'close'])
def remove_non_ascii(s):
''' Removes all non-ascii characters.
'''
return ''.join(i for i in s if ord(i) < 128)
def clean_up_html(t):
t = cleaner.clean_html(t)
t = re.sub('[\n\t\r]', ' ', t)
t = re.sub(' +', ' ', t)
t = re.sub('<.*?>', '', t)
t = remove_non_ascii(t)
return t
def clean_up_text(t, numbers=False, punctuation=False):
''' Cleans up a text, e.g. HTML document,
from HTML tags and also cleans up the
text body.
'''
try:
t = clean_up_html(t)
except:
pass
t = t.lower()
t = re.sub(r"what's", "what is ", t)
t = t.replace('(ap)', '')
t = re.sub(r"\'ve", " have ", t)
t = re.sub(r"can't", "cannot ", t)
t = re.sub(r"n't", " not ", t)
t = re.sub(r"i'm", "i am ", t)
t = re.sub(r"\'s", "", t)
t = re.sub(r"\'re", " are ", t)
t = re.sub(r"\'d", " would ", t)
t = re.sub(r"\'ll", " will ", t)
t = re.sub(r'\s+', ' ', t)
t = re.sub(r"\\", "", t)
t = re.sub(r"\'", "", t)
t = re.sub(r"\"", "", t)
if numbers:
t = re.sub('[^a-zA-Z ?!]+', '', t)
if punctuation:
t = re.sub(r'\W+', ' ', t)
t = remove_non_ascii(t)
t = t.strip()
return t
def nltk_lemma(word):
''' If one exists, returns the lemma of a word.
I.e. the base or dictionary version of it.
'''
lemma = wn.morphy(word)
if lemma is None:
return word
else:
return lemma
def tokenize(text, min_char=3, lemma=True, stop=True,
numbers=False):
''' Tokenizes a text and implements some
transformations.
'''
tokens = nltk.word_tokenize(text)
tokens = [t for t in tokens if len(t) >= min_char]
if numbers:
tokens = [t for t in tokens if t[0].lower()
in string.ascii_lowercase]
if stop:
tokens = [t for t in tokens if t not in stop_words]
if lemma:
tokens = [nltk_lemma(t) for t in tokens]
return tokens
def generate_word_cloud(text, no, name=None, show=True):
''' Generates a word cloud bitmap given a
text document (string).
It uses the Term Frequency (TF) and
Inverse Document Frequency (IDF)
vectorization approach to derive the
importance of a word -- represented
by the size of the word in the word cloud.
Parameters
==========
text: str
text as the basis
no: int
number of words to be included
name: str
path to save the image
show: bool
whether to show the generated image or not
'''
tokens = tokenize(text)
vec = TfidfVectorizer(min_df=2,
analyzer='word',
ngram_range=(1, 2),
stop_words='english'
)
vec.fit_transform(tokens)
wc = pd.DataFrame({'words': vec.get_feature_names(),
'tfidf': vec.idf_})
words = ' '.join(wc.sort_values('tfidf', ascending=True)['words'].head(no))
wordcloud = WordCloud(max_font_size=110,
background_color='white',
width=1024, height=768,
margin=10, max_words=150).generate(words)
if show:
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
if name is not None:
wordcloud.to_file(name)
def generate_key_words(text, no):
try:
tokens = tokenize(text)
vec = TfidfVectorizer(min_df=2,
analyzer='word',
ngram_range=(1, 2),
stop_words='english'
)
vec.fit_transform(tokens)
wc = pd.DataFrame({'words': vec.get_feature_names(),
'tfidf': vec.idf_})
words = wc.sort_values('tfidf', ascending=False)['words'].values
words = [a for a in words if not a.isnumeric()][:no]
except:
words = list()
return words
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment