Arthur Wang ArthurWangNet

## gist:5216339
<!DOCTYPE HTML>
<html>
<head>
	<title></title>
	<script src=""></script>
</head>
<body>

</body>
</html>

## get_column_index.py
df.columns.get_loc("TCP")

## get_moving_average_with_rolling.py
# Use rolling to get moving average
df['10Day-TCP-Avg'] = df.iloc[:,10].rolling(window=10).mean()

## yfinance_override.py
import yfinance as yf
yf.pdr_override()

## epoch_converter.py
# Some test about how to play with epoch date and time.End date as milliseconds since epoch.
import datetime
import time
import pytz

def convert_datetime_to_mill_epoch(dt):
    epoch = datetime.datetime.utcfromtimestamp(0)
    return (dt - epoch).total_seconds() * 1000.0

def convert_mill_epoch_to_datetime(e_time):

## adding_new_columns.py
'''
In this case, 'Date' and 'Time' are two new columns needs to be added based on converting existing column 'datetime'
which is in unix millseconds format.
epoch_converter is a files contians functions to help with the convertion.
'''
stock['Date'] = stock.apply(lambda row: epoch_converter.get_date_from_mill_epoch(row['datetime']), axis=1)
stock['Time'] = stock.apply(lambda row: epoch_converter.get_time_from_mill_epoch(row['datetime']), axis=1)

## pandas_to_datetime.py
```
'datetime' contians unix epoch format ms sceconds data. to_datetime first convert it to timezone aware readable dateand time
than using .dt.tz_convert to convert them to specified timezone. Lastly, using dt.date and dt.time to get parts for the data.
Somehow this computes very fast compare to using pytz and datetime functions.
For test, 20 identical csv files, using pytz and datetime cause 21.1 seconds. While using pandas to_datetime, using 3.74 seconds.
```
stock['Date'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.date
stock['Time'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.time


## multiprocessing_files_process.py
"""
When dealing with large number of files, In this case around 5000, using for loop will takes a lot of time.
The idea is using python's multiprocessing features to utilize mulit-core CPU to speed it up.
In simple term, just write the operation willing to perform into a single function instead of using for loop.
The iteration of for loop before will be replaced with pool.map()

There are still some improvements might want to consider
1. If I need some value returns from each process and get a final list fo return value, what to do?
2. Is this the best way to use multiprocessing?
"""

## apply_df_by_multiprocessing.py
import multiprocessing
import pandas as pd
import numpy as np

def _apply_df(args):
    df, func, kwargs = args
    return df.apply(func, **kwargs)

def apply_by_multiprocessing(df, func, **kwargs):
    workers = kwargs.pop('workers')

## remove_duplicates.py
combine = pd.concat([aapl,aapl_overlap]).drop_duplicates().reset_index(drop=True)
	<!DOCTYPE HTML>
	<html>
	<head>
	<title></title>
	<script src=""></script>
	</head>
	<body>

	</body>
	</html>
	# Use rolling to get moving average
	df['10Day-TCP-Avg'] = df.iloc[:,10].rolling(window=10).mean()
	# Some test about how to play with epoch date and time.End date as milliseconds since epoch.
	import datetime
	import time
	import pytz

	def convert_datetime_to_mill_epoch(dt):
	epoch = datetime.datetime.utcfromtimestamp(0)
	return (dt - epoch).total_seconds() * 1000.0

	def convert_mill_epoch_to_datetime(e_time):
	'''
	In this case, 'Date' and 'Time' are two new columns needs to be added based on converting existing column 'datetime'
	which is in unix millseconds format.
	epoch_converter is a files contians functions to help with the convertion.
	'''
	stock['Date'] = stock.apply(lambda row: epoch_converter.get_date_from_mill_epoch(row['datetime']), axis=1)
	stock['Time'] = stock.apply(lambda row: epoch_converter.get_time_from_mill_epoch(row['datetime']), axis=1)
	```
	'datetime' contians unix epoch format ms sceconds data. to_datetime first convert it to timezone aware readable dateand time
	than using .dt.tz_convert to convert them to specified timezone. Lastly, using dt.date and dt.time to get parts for the data.
	Somehow this computes very fast compare to using pytz and datetime functions.
	For test, 20 identical csv files, using pytz and datetime cause 21.1 seconds. While using pandas to_datetime, using 3.74 seconds.
	```
	stock['Date'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.date
	stock['Time'] = pd.to_datetime(stock['datetime'],unit='ms',utc=True).dt.tz_convert('America/New_York').dt.time
	"""
	When dealing with large number of files, In this case around 5000, using for loop will takes a lot of time.
	The idea is using python's multiprocessing features to utilize mulit-core CPU to speed it up.
	In simple term, just write the operation willing to perform into a single function instead of using for loop.
	The iteration of for loop before will be replaced with pool.map()

	There are still some improvements might want to consider
	1. If I need some value returns from each process and get a final list fo return value, what to do?
	2. Is this the best way to use multiprocessing?
	"""
	import multiprocessing
	import pandas as pd
	import numpy as np

	def _apply_df(args):
	df, func, kwargs = args
	return df.apply(func, **kwargs)

	def apply_by_multiprocessing(df, func, **kwargs):
	workers = kwargs.pop('workers')