Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active March 31, 2023 04:56
Show Gist options
  • Save Chubek/efa7f74dda28507ffafd395bff5a56fd to your computer and use it in GitHub Desktop.
Save Chubek/efa7f74dda28507ffafd395bff5a56fd to your computer and use it in GitHub Desktop.
Lazy-splitting in Python

Lazy-splitting means getting the first n token-separated substrings without splitting the entire string first. I wrote a heuristic that does this in Python.

The method I provide has its ups, and downs. It is faster on larger streams, and slower on smaller streams. For smaller stream, just the calling convention, initializations, etc takes much more time than CPython's implementation. But with larger streams, of course, it's much faster, as it does not split the entire stirng all at once.

First let's go through how I prototyped this: Let's generate a file of size 1MB, using Python's string mode (-c flag):

touch rand && python3 -c "from random import choices;from pathlib import Path;Path('rand').write_bytes(bytearray(choices(list(range(65, 122)),k=1000000)))" && wc -c rand

touch rand let's us create a file called rand, this works on all UNIX-like systems and POSIX-compliant terminals. We then create a file containing random characters between ASCII byte 65 (A) and ASCII byte 122 (I think it's small-case z, not sure! But they are all printable). Now let's run Python split on it, and time it using time utility:

$ python3 -c "from time import time_ns;t1=time_ns();f=open('rand','r').read();a,b,c=f.split('A')[:3];t2=time_ns();print(t2-t1,a,b,c,sep='\n')"

1819675
mbMdsYdSJORISOuPPrwdbBnL[y]_F
VJxWK`tKGXyYGpvlsDbOpWYIpT`__oTsbpx`bN[j_PL\CdHVJEgWiXbv
^IsHOiojsacmsj`P]F_Z^vvNCpUuDbWCQIi`lUiktW`Hs\`rcFwWssY\VnkyIoWpqEUNKqwBOgjPYG[eIniy`jIhyCZlTGCUaySNTU\rpSMaukfDvn[GbCBkeFMogq\ICoZ_jBjRklFtDXW\`gmieEmn\lBPmqZc\DjpnII]Rwbwd]GJVhSpbSEs\_BCQeC

It takes 1819675ns for Python to split by A (65) and get the first 3 substrings.

Now lazysplit.py's heuristic in string mode:

$ python3 -c "from time import time_ns;f=open('rand','rb').read();p1=0;p2=0;a='';b='';c='';t1=time_ns();list(map(lambda c: exec(f'p2+=f[p1 + p2:].index(65) + p1;{c}=f[p1:p2];p1=p2+1',globals()),['a','b','c']));t2=time_ns();print(t2 - t1,a,b,c,sep='\n')"

582256
b'mbMdsYdSJORISOuPPrwdbBnL[y]_F'
b'VJxWK`tKGXyYGpvlsDbOpWYIpT`__oTsbpx`bN[j_PL\\CdHVJEgWiXbv'
b'^IsHOiojsacmsj`P]F_Z^vvNCpUuDbWCQIi`lUiktW`Hs\\`rcFwWssY\\VnkyIoWpqEUNKqwBOgjPYG[eIniy`jIhyCZlTGCUaySNTU\\rpSMaukfDvn[GbCBkeFMogq\\ICoZ_jBjRklFtDXW\\`gmieEmn\\lBPmqZc\\DjpnII]Rwbwd]GJVhSpbSEs\\_BCQeC'

Much faster right?

Now take a look at how lazysplit.py itself handles it:

Python result:  [b'FwdKaHBKgBXX\\C]YKTs_l_womMKgNytRYEGlIXW_tmLo`aXQelimtYe^hgbCcaij[uBfZS', b'hTeG\\Uh^DKE\\HfVo]Z]ognryOWwXVPvLEowrYMmKQPOLdmpgrjWhbjgpdRuDvhoOIMveIo', b'bsGWKeyTaBWQNOjcpDdkScoQhpeShNROvp\\`WuT`W`^[ocE^uw]xUL\\hlFVFRVV_rGRhCn]COmUXgsgNmmy[iRSxeHnjISVUMoyqQeLSHRgtDMlut_XmJdvDqQSw[lHjatDmSCajQwxby']
Python time:  1010817
---
Custom result:  [b'FwdKaHBKgBXX\\C]YKTs_l_womMKgNytRYEGlIXW_tmLo`aXQelimtYe^hgbCcaij[uBfZS', b'hTeG\\Uh^DKE\\HfVo]Z]ognryOWwXVPvLEowrYMmKQPOLdmpgrjWhbjgpdRuDvhoOIMveIo', b'bsGWKeyTaBWQNOjcpDdkScoQhpeShNROvp\\`WuT`W`^[ocE^uw]xUL\\hlFVFRVV_rGRhCn]COmUXgsgNmmy[iRSxeHnjISVUMoyqQeLSHRgtDMlut_XmJdvDqQSw[lHjatDmSCajQwxby']
Custom time:  602078
Diff pytime - custtime:  408739

Disclaimer: I am not to be held responsible if this code ends up with issues if you use this in-production. I did this as part of a job, but I would like to be held responsible only if the said project faces issues because of this, not someone else's.

Notice: This function does NOT handle errors. Do not pass it more ns than it can handle.

The file has a shebang so you can do sudo chmod +x ./lazysplit.py && ./lazysplit.py.

Contact me on Discord Chubak#7400. I am currently working on another piece of code that I shall publish via Gist, implementation of djb2 with Aarch64 and x86-64 Assembly. I am also working on my own implementation of Zlib + DEFLATE in Rust. Please stay tuned, and follow my profile.

Thank you.

#!/bin/python3
## Fair License
## 2023 Chubak Bidpaa
## Usage of the works is permitted provided that this instrument is retained with the works, so that any entity that uses the works is notified of this instrument.
## DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.
def lazysplit(bytestring: bytearray,
char: int,
n: int,
offset=0) -> list[bytearray]:
"""
Splits a bytearray by `char` and returns the first `n` substrings lazily.
For example, if you wish to lazily split by `\t` and return the first three do:
a, b, c = split_and_extract(bytearray(s.encode()), 9, 3)
If you set `offset` to a value more than 1 it will start looking from that index
Notice that `bytestring` must necessarily be a `bytearray` object, otherwise it will fail.
This function has no error handling for the sake of speed. Do NOT pass it more `n` than possible!
"""
bytestring.append(char)
p_arr = [offset, 0]
result = [b''] * n
def split_inner(i: int, p_arr=p_arr, result=result, char=char):
p_arr[1] += bytestring[p_arr[0] + p_arr[1]:].index(char) + p_arr[0]
result[i] = bytestring[p_arr[0]:p_arr[1]]
p_arr[0] = p_arr[1] + 1
list(map(split_inner, range(n)))
return result if offset == 0 else list(filter(lambda x: len(x) > 0,
result))
def benchmark_split(string: bytearray, char: int, n: int):
from time import time_ns
char_enc = chr(char).encode()
t1py = time_ns()
respy = string.split(char_enc)[:n]
t2py = time_ns()
t1lz = time_ns()
reslz = lazysplit(string, char, n)
t2lz = time_ns()
print("Python result: ", respy)
print("Python time: ", t2py - t1py, end="\n---\n")
print("Lazy result: ", reslz)
print("Lazy time: ", t2lz - t1lz)
print("Diff pytime - custtime: ", (t2py - t1py) - (t2lz - t1lz))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment