Skip to content

Instantly share code, notes, and snippets.

@mbednarski
Last active May 28, 2022 15:48
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mbednarski/29ee3b2e4fdc48de8152652263e67737 to your computer and use it in GitHub Desktop.
Save mbednarski/29ee3b2e4fdc48de8152652263e67737 to your computer and use it in GitHub Desktop.
window_size = 2
idx_pairs = []
# for each sentence
for sentence in tokenized_corpus:
indices = [word2idx[word] for word in sentence]
# for each word, threated as center word
for center_word_pos in range(len(indices)):
# for each window position
for w in range(-window_size, window_size + 1):
context_word_pos = center_word_pos + w
# make soure not jump out sentence
if context_word_pos < 0 or context_word_pos >= len(indices) or center_word_pos == context_word_pos:
continue
context_word_idx = indices[context_word_pos]
idx_pairs.append((indices[center_word_pos], context_word_idx))
idx_pairs = np.array(idx_pairs) # it will be useful to have this as numpy array
@mikefarmer01
Copy link

I did:

pairs = []
[[[pairs.append((vocabulary.index(c), vocabulary.index(o))) for o in s[max(0, i-window_size):i]+s[i+1:min(i+1+window_size, len(s)-1)]] for i, c in enumerate(s)] for s in tokenized_corpus]
pairs = np.array(pairs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment