Skip to content

Instantly share code, notes, and snippets.

View sksoumik's full-sized avatar

Sadman Kabir Soumik sksoumik

View GitHub Profile
@sksoumik
sksoumik / map_two_list.py
Created August 4, 2020 00:25
mapping one list to another
"""
categories: List[str]
category_ids: List[int]
"""
label_details = list(map(lambda x, y: x+ ':' +str(y), categories, category_ids))
@sksoumik
sksoumik / simpletransformer.py
Last active July 29, 2020 10:51
simpletransformers default arguments
"""
updated full list:
https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
"""
self.args = {
"output_dir": "outputs/",
"cache_dir": "cache_dir/",
@sksoumik
sksoumik / nan.py
Last active July 22, 2020 08:59
NaN explore
# see the total number of nan values in the dataset
df.isnull().sum().sum()
# see the rows which contains NaN values
nan_rows = df[df.isnull().T.any().T]
print(nan_rows)
# remove NaN values
df = df[df['column_name'].notnull()]
@sksoumik
sksoumik / list_to_dataframe.py
Created July 21, 2020 15:37
combining multiple list to make a dataframe
import pandas as pd
# comments = [......]
# true_label = [......]
# predictions = [......]
df = pd.DataFrame(
{'text': comments,
'true labels': true_label,
'predicted labels': predictions
})
@sksoumik
sksoumik / balanced_train_test_split.py
Last active July 29, 2020 10:55
class balanced both in train and test train test spliting using scikit learn
# data is out dataframe
# train 80% and test 20%
# data is our dataframe
# data['class_id'] is our target column
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data,
stratify=data['class_id'],
test_size=0.20)
@sksoumik
sksoumik / undersampling_using_pandas.py
Created July 20, 2020 10:40
get randomly k elements of each class data undersampling with specific data from each class
# This method get randomly k elements of each class.
def sampling_k_elements(group, k=3):
if len(group) < k:
return group
return group.sample(k)
balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)
@sksoumik
sksoumik / class_to_numeric_id.py
Last active July 21, 2020 03:38
converting text columns to numeric IDs in Panda dataframe
# column_name's feature value will be converted into 0 and 1
# value_1 class will be converted to 0
# value_2 class will be converted to 1
# Binary Classification
data.loc[data['column_name'] == 'value_1', 'class_id'] = 1
data.loc[data['column_name'] != 'value_2', 'class_id'] = 0
# Multiclass Classification
for i in range(len(data['column_name'].unique())):
@sksoumik
sksoumik / simpletransformer.py
Last active July 19, 2020 01:27
simpletransformer required enviroment dependencies apex
# anaconda version
# create the venv with python
conda create -n envname python=3.6.9
# then install the following packages with the specified version
!pip install torch===1.2.0 torchvision===0.4.0 -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers==2.11.0
!pip install simpletransformers==0.41.1
!git clone --recursive https://github.com/NVIDIA/apex.git
!cd apex && pip install .
@sksoumik
sksoumik / drop_nan_and_duplicate.py
Last active July 8, 2020 08:20
Drop NaN values and duplicate rows from dataframe
data = #pandas dataframe
data = data[data['ColumnName'].notnull()]
data.drop_duplicates(keep=False, inplace=True)
@sksoumik
sksoumik / under_sampling.py
Created July 7, 2020 06:24
under-sampling for class imbalance binary classification problems
# Assuming we have only two class
data = # DataFrame
print(data.class_id.value_counts())
# assuming 1 is the minority class and 0 is the majority class
# minority class length
minority_class_length = len(data[data['target_column'] == 1])
print(minority_class_length)