Skip to content

Instantly share code, notes, and snippets.

@Akash-Rawat
Last active July 2, 2021 10:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Akash-Rawat/15274a383d6e7c5b23ba48ef40d47c30 to your computer and use it in GitHub Desktop.
Save Akash-Rawat/15274a383d6e7c5b23ba48ef40d47c30 to your computer and use it in GitHub Desktop.
Building Vocabulary
def build_datasets_vocab(root_file, captions_file, transform, split=0.15):
df = pd.read_csv(captions_file)
vocab = {}
def create_vocab(caption):
tokens = [token.lower() for token in word_tokenize(caption)]
for token in tokens:
if token not in vocab:
vocab[token] = len(vocab)
df["caption"].apply(create_vocab)
train, valid = train_test_split(df, test_size=split, random_state=42)
return My_Flickr1k(root_file, train.values, transform), \
My_Flickr1k(root_file, valid.values, transform), \
vocab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment