Skip to content

Instantly share code, notes, and snippets.

@yashbonde
Created May 6, 2021 22:06
Show Gist options
  • Save yashbonde/782887b82ffab61e126fb69122bf54bf to your computer and use it in GitHub Desktop.
Save yashbonde/782887b82ffab61e126fb69122bf54bf to your computer and use it in GitHub Desktop.
In this quick script we are trying to solve sharding problem:
often in very large datasets there is no way to tokenize everything and store
them. Considering the CLM datasets we have a fixed dataset where each row
has dynamic number of tokens. A dummy looks like follows:
j n sequence (w/o EOT = 42)
[0] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[1] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[2] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[3] [13] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[4] [15] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[5] [ 8] [0, 1, 2, 3, 4, 5, 6],
[6] [14] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
[7] [ 8] [0, 1, 2, 3, 4, 5, 6],
[8] [11] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[9] [10] [0, 1, 2, 3, 4, 5, 6, 7, 8]
j: index in ds
n: number of tokens in this seq + 1
During initialisation we provide
a) seqlen: Size of each output sequence
b) stride: Difference between two consecutive samples. Same as Convolution
When training the model we train on continuous spans (size = seqlen)
and these spans are obtained by merging multiple sequences or from the
same sequence itself.
- for seqlen = 10 and stride = 10
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
stride = seqlen ensures there is no overlap in the sequences
- for seqlen = 10, and stride = 5
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 42, 0, 1, 2, 3, 4], ...
[5, 6, 7, 8, 9, 10, 11, 12, 13, 420], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ...
notice how sequences have overlaps
TASK: given a list of lists (see above) called `ds`, `seqlen` and `stride`
a) can you find the total number of samples (l) in the dataset
b) given any i <= l can you get me the correct sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment