This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| R to python useful data wrangling snippets | |
| The dplyr package in R makes data wrangling significantly easier. | |
| The beauty of dplyr is that, by design, the options available are limited. | |
| Specifically, a set of key verbs form the core of the package. | |
| Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe. | |
| Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R. | |
| The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package). | |
| dplyr is organised around six key verbs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| import numpy as np | |
| import matplotlib.pyplot as plt | |
| import matplotlib.dates as mdates | |
| import warnings | |
| warnings.filterwarnings("ignore") | |
| # Load Data |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def _visualize_customer_behavior(AccountCode): | |
| """This function visualizes customer behavior using subscription, login and republish events of a customer. | |
| Args: | |
| AccountCode (str): Account unique identification. | |
| Returns: | |
| matplotlib.figure.Figure: a visualization with subscription, login and republish events of a customer. | |
| """ | |
| sample_subscription, sample_republished, sample_login = _get_sample_data(AccountCode) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def _get_sample_data(AccountCode): | |
| """This function gets subscription info, login events and republish events for the AccountCode input. | |
| Args: | |
| AccountCode (str): Account unique identification. | |
| Returns: | |
| pandas.core.frame.DataFrame: 3 dataframes with subscription info, login and republish events. | |
| """ | |
| sample_subscription = subscription_info_df[subscription_info_df['AccountCode'] == AccountCode] | |
| sample_republished = republished_df[republished_df['AccountCode'] == AccountCode] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| string_star_mask = df['Stars'].isin(['Unrated', 'NR', '1/4', '1/2', '1/3', | |
| '3.5/2.5', '4/4', '5/5', '4.5/5', | |
| '5/2.5', '5/4', '4.25/5']) | |
| df_length = len(df) | |
| print(f"Percentage of rows with `Unrated`, `NR` or mixing rates in the dataset is | |
| {np.sum(string_star_mask)*100/df_length:.2}%.") | |
| # Remove string ratings from the dataset | |
| df = df[~string_star_mask] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| rating_bins = [0, 1, 2, 3, 4, 5] | |
| rating_bin_labels = ['0-1', '1-2', '2-3', '3-4', '4-5'] | |
| df['RatingGroups'] = pd.cut(df['Stars'], rating_bins, include_lowest=True, right=True, labels=rating_bin_labels) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from collections.abc import Sequence | |
| def identify_geometric_progression(sequence): | |
| """ | |
| Determine if a sequence is a geometric progression. | |
| """ | |
| assert isinstance(sequence, Sequence) & (not isinstance(sequence, str)), "Expect input to be a sequence that's not string" | |
| assert len(sequence) > 2, "Expect a sequence with more than 2 items" | |
| try: | |
| ratio = sequence[1]/sequence[0] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import logging | |
| import os | |
| import spotipy | |
| from spotipy.oauth2 import SpotifyOAuth | |
| import pandas as pd | |
| from utils import CLIENT_ID | |
| from utils import CLIENT_SECRET |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| # Create a fake DataFrame | |
| data = { | |
| 'employee': ['Ha', 'Ha', 'Ha', 'Ha', 'Ha', 'Ha', 'Mai', 'Mai', 'Mai', 'Mai', 'Mai'], | |
| 'collaborator_employee': ['Minh', 'Mai', 'Lam', 'Nguyen', 'Chau', 'Giang', 'Minh', 'Ha', 'Lam', 'Nguyen', 'Chau'], | |
| 'collaboration_days': [30, 25, 10, 60, 50, 5, 60, 25, 12, 15, 1] | |
| } | |
| df = pd.DataFrame(data) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| MEANINGFUL_COLLABORATION_DAYS = 15 | |
| def _rank_collaborators(df, keep_meaningful_collaborations=True): | |
| df_copy = df.copy() | |
| if keep_meaningful_collaborations: | |
| meaningful_collaborations_mask = df_copy["collaboration_days"] >= MEANINGFUL_COLLABORATION_DAYS | |
| df_copy = df_copy.loc[meaningful_collaborations_mask] |
OlderNewer