Skip to content

Instantly share code, notes, and snippets.

View hadinh1306's full-sized avatar
๐Ÿ“

Ha Dinh hadinh1306

๐Ÿ“
View GitHub Profile
@hadinh1306
hadinh1306 / gist:b2209235cfcfcfdec8c2aefe0f630214
Created August 29, 2018 21:56 — forked from conormm/r-to-python-data-wrangling-basics.md
R to Python: Data wrangling with dplyr and pandas
R to python useful data wrangling snippets
The dplyr package in R makes data wrangling significantly easier.
The beauty of dplyr is that, by design, the options available are limited.
Specifically, a set of key verbs form the core of the package.
Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).
dplyr is organised around six key verbs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
warnings.filterwarnings("ignore")
# Load Data
def _visualize_customer_behavior(AccountCode):
"""This function visualizes customer behavior using subscription, login and republish events of a customer.
Args:
AccountCode (str): Account unique identification.
Returns:
matplotlib.figure.Figure: a visualization with subscription, login and republish events of a customer.
"""
sample_subscription, sample_republished, sample_login = _get_sample_data(AccountCode)
def _get_sample_data(AccountCode):
"""This function gets subscription info, login events and republish events for the AccountCode input.
Args:
AccountCode (str): Account unique identification.
Returns:
pandas.core.frame.DataFrame: 3 dataframes with subscription info, login and republish events.
"""
sample_subscription = subscription_info_df[subscription_info_df['AccountCode'] == AccountCode]
sample_republished = republished_df[republished_df['AccountCode'] == AccountCode]
string_star_mask = df['Stars'].isin(['Unrated', 'NR', '1/4', '1/2', '1/3',
'3.5/2.5', '4/4', '5/5', '4.5/5',
'5/2.5', '5/4', '4.25/5'])
df_length = len(df)
print(f"Percentage of rows with `Unrated`, `NR` or mixing rates in the dataset is
{np.sum(string_star_mask)*100/df_length:.2}%.")
# Remove string ratings from the dataset
df = df[~string_star_mask]
rating_bins = [0, 1, 2, 3, 4, 5]
rating_bin_labels = ['0-1', '1-2', '2-3', '3-4', '4-5']
df['RatingGroups'] = pd.cut(df['Stars'], rating_bins, include_lowest=True, right=True, labels=rating_bin_labels)
from collections.abc import Sequence
def identify_geometric_progression(sequence):
"""
Determine if a sequence is a geometric progression.
"""
assert isinstance(sequence, Sequence) & (not isinstance(sequence, str)), "Expect input to be a sequence that's not string"
assert len(sequence) > 2, "Expect a sequence with more than 2 items"
try:
ratio = sequence[1]/sequence[0]
import logging
import os
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
from utils import CLIENT_ID
from utils import CLIENT_SECRET
@hadinh1306
hadinh1306 / peer_review_fake_data.py
Created August 6, 2023 21:57
peer_review_fake_data.py
import pandas as pd
# Create a fake DataFrame
data = {
'employee': ['Ha', 'Ha', 'Ha', 'Ha', 'Ha', 'Ha', 'Mai', 'Mai', 'Mai', 'Mai', 'Mai'],
'collaborator_employee': ['Minh', 'Mai', 'Lam', 'Nguyen', 'Chau', 'Giang', 'Minh', 'Ha', 'Lam', 'Nguyen', 'Chau'],
'collaboration_days': [30, 25, 10, 60, 50, 5, 60, 25, 12, 15, 1]
}
df = pd.DataFrame(data)
import pandas as pd
MEANINGFUL_COLLABORATION_DAYS = 15
def _rank_collaborators(df, keep_meaningful_collaborations=True):
df_copy = df.copy()
if keep_meaningful_collaborations:
meaningful_collaborations_mask = df_copy["collaboration_days"] >= MEANINGFUL_COLLABORATION_DAYS
df_copy = df_copy.loc[meaningful_collaborations_mask]