Skip to content

Instantly share code, notes, and snippets.

@minesh1291
Forked from codeKgu/data_loading.py
Created August 14, 2023 17:22
Show Gist options
  • Save minesh1291/73d08ce0272938f7a5f2c0c6c5e67a7d to your computer and use it in GitHub Desktop.
Save minesh1291/73d08ce0272938f7a5f2c0c6c5e67a7d to your computer and use it in GitHub Desktop.
Tutorial for multimodal_transformers
import pandas as pd
from multimodal_transformers.data import load_data
from transformers import AutoTokenizer
data_df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
text_cols = ['Title', 'Review Text']
# The label col is expected to contain integers from 0 to N_classes - 1
label_col = 'Recommended IND'
categorical_cols = ['Clothing ID', 'Division Name', 'Department Name', 'Class Name']
numerical_cols = ['Rating', 'Age', 'Positive Feedback Count']
label_list = ['Not Recommended', 'Recommended'] # what each label class represents
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# make sure NaN values for cat columns are filled before passing to load_data
for c in categorical_cols:
data_df.loc[:, c] = data_df.loc[:, c].astype(str).fillna("-9999999")
torch_dataset = load_data(
data_df,
text_cols,
tokenzier,
categorical_cols=categorical_cols,
numerical_cols=numerical_cols,
sep_text_token_str=tokenizer.sep_token
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment