Skip to content

Instantly share code, notes, and snippets.

@RayWilliam46
RayWilliam46 / batch_encode.py
Last active September 3, 2023 14:02
Batch encodes text data using a Hugging Face tokenizer
# Define the maximum number of words to tokenize (DistilBERT can tokenize up to 512)
MAX_LENGTH = 128
# Define function to encode text data in batches
def batch_encode(tokenizer, texts, batch_size=256, max_length=MAX_LENGTH):
"""""""""
A function that encodes a batch of texts and returns the texts'
corresponding encodings and attention masks that are ready to be fed
into a pre-trained transformer model.
@kelvintaywl
kelvintaywl / split.py
Last active May 9, 2024 11:39
Python Script to split CSV files into smaller files based on number of lines
import csv
import sys
import os
# example usage: python split.py example.csv 200
# above command would split the `example.csv` into smaller CSV files of 200 rows each (with header included)
# if example.csv has 401 rows for instance, this creates 3 files in same directory:
# - `example_1.csv` (row 1 - 200)
# - `example_2.csv` (row 201 - 400)
# - `example_3.csv` (row 401)