stephenleo/00_Python.md

## 00_Python.md

      
    Raw
  

              00_Python.md
            
          
    Python

A collection of useful Python snippets

  
## imports.md

      
    Raw
  

              imports.md
            
          
To import a module in one directory into another directory

Directory structure
└─ heroku_apps
├─ README.md
└─ src
    └─ boyorgirl
      ├─ train
      │  ├─ model.py
      │  └─ train.py
      └─ utils
          └─ preprocess.py


Imports in src/boyorgirl/train/train.py
from utils import preprocess

To run train.py
cd heroku_apps/src/boyorgirl
python -m train.train


To reload an already imported module (mainly in Jupyter)
import custom_module
from importlib import reload
reload(custom_module)


## lists.md

      
    Raw
  

              lists.md
            
          
    Python Lists

A collection of common operations on Python lists


de-duplicate
# Remove duplicates while preserving order
items = [1, 2, 0, 1, 3, 2]
items = list(dict.fromkeys(items))


Merge overllaping lists
# Merge overlapping sublists in a list of lists
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import networkx 
from networkx.algorithms.components.connected import connected_components

def merge_overlapping_list_of_lists(l):
    def to_graph(l):
        G = networkx.Graph()
        for part in l:
            # each sublist is a bunch of nodes
            G.add_nodes_from(part)
            # it also imlies a number of edges:
            G.add_edges_from(to_edges(part))
        return G

    def to_edges(l):
    """
    treat `l` as a Graph and returns it's edges
    to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current

    G = to_graph(l)
    return list(connected_components(G))

l = [['a','b','c'],['b','d','e'],['k'],['o','p'],['e','f'],['p','a'],['d','g']]
print(merge_overlapping_list_of_lists(l))
# prints [['a', 'c', 'b', 'e', 'd', 'g', 'f', 'o', 'p'], ['k']]


Merge overlapping tuples
# Merge tuples with overlapping elements
# https://stackoverflow.com/questions/22680116/merging-overlapping-items-in-a-list
# If an item falls within the range of the next, the two tuples will have to be merged. 
# The resulting tuple is one that covers the range of the two items (minimum to maximum values).

mylist = [(1, 1), (1, 6), (2, 5), (4, 4), (9, 10)]
desired_output = [(1, 6), (9, 10)]

result = []
for item in sorted(mylist):
    result = result or [item]
    if item[0] > result[-1][1]:
        result.append(item)
    else:
        old = result[-1]
        result[-1] = (old[0], max(old[1], item[1]))

print(result) #[(1, 6), (9, 10)]


## pandas.md

      
    Raw
  

              pandas.md
            
          
    Useful Pandas tips


Joins

While merging two tables, always use `how='inner`` first to ensure that your keys are matching.
If you get an empty result could mean your keys are of different datatype. Align the datatype and try again.
Once you get results from  how='inner', you can then change to how='left' or 'right'.


Renaming

To rename all columns, you can directly use
df.columns = ['col1', 'col2']

To rename a single column in a table with many columns, you can use
df.rename({'col100': 'column100'}, inplace=True)


Explode

Converting Column of lists to multiple rows
df = pd.DataFrame({'brands': [['Apple', 'Xiaomi'], ['Huawei'], ['Samsung', 'Apple']],
            'tweet_id': [1234, 1235, 1236]})
print(df.head())
df.explode('brands')


Time Series

To create a time series, you can use
pd.date_range('2019-01-01', periods=10, freq='D')

To shift a time series, you can use
df['value'].shift(1)

To calculate the difference between two time series, you can use
df['value'].diff()

To calculate the cumulative sum of a time series, you can use
df['value'].cumsum()

To calculate the 3 day moving average of a time series, you can use
df['value'].rolling(window=3).mean()

To fill missing values in a time series, you can use
df['value'].fillna(method='ffill')

To resample a time series, you can use
df['value'].resample('M').mean()


## pdf.md

      
    Raw
  

              pdf.md
            
          
    Snippets to work with PDF files

Merge multiple pdfs

from PyPDF2 import PdfMerger
merger = PdfMerger()

# File 1
merger.append("file_1.pdf")

# File 2
merger.append("file_2.pdf", pages=(0,1))

# File 3
merger.append("file_3.pdf", pages=(1,2))

merger.write("Combined_file.pdf")
merger.close()

  
## plotly.md

      
    Raw
  

              plotly.md
            
          
    Plotly and Dash tips and tricks


If plots are not showing in Jupyter
import plotly.io as pio
pio.renderers.default = "iframe"

Select anywhere on the row

Select a Row
Highlight Entire Row


Server Side Caching

If data extraction is too expensive, we can extract the data once during app startup and then cache a copy on the server itself: Link


## progress_bars.md

      
    Raw
  

              progress_bars.md
            
          
    Python Progress Bars

General code snippets about Python tqdm progress bar usage


Imports
from tqdm import tqdm
import pandas


Normal usage
for i in tqdm(iterable):
  print(i)


Pandas itertuples usage
for row in tqdm(df.itertuples(), total=df.shape[0]):
  print(getattr(row, 'col_name'))


Pandas apply usage
tqdm.pandas()
df['new_col'] = df['col'].progress_apply(fn)


## Pypi_steps.md

      
    Raw
  

              Pypi_steps.md
            
          
    Most instructions from: https://github.com/bast/pypi-howto
Test PyPi


pip install twine
python setup.py sdist
python -m twine upload --repository testpypi dist/*

name: token
pwd: TestPyPi API_TOKEN


pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple stripnet

PyPi


python setup.py sdist bdist_wheel
twine upload dist/* -r pypi

name: token
pwd: PyPi API_TOKEN


git tag -a v0.0.7 -m "Update support for Py3.7"
git push origin --tags