I mocked up a 60 MB XML by taking all the small samples in your original ZIP archive and just copying them all 200 times, which ended up with over 425k tok elements.
I then profiled your code and found a really bad culprit for chewing up time.
To process that XML took about 35 seconds:
Thu Jun 29 10:50:59 2023 profile.stats
1521023 function calls (1520459 primitive calls) in 36.464 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
9/1 0.000 0.000 36.464 36.464 {built-in method builtins.exec}
1 0.323 0.323 36.464 36.464 /Users/zyoung/develop/StackOverflow/main.py:1(<module>)
1 0.030 0.030 36.130 36.130 /Users/zyoung/develop/StackOverflow/main.py:232(run)
1 0.482 0.482 36.100 36.100 /Users/zyoung/develop/StackOverflow/main.py:114(op_extract)
18600 35.098 0.002 35.098 0.002 {method 'index' of 'list' objects}
and you can see that the call to index took 96% of the total runtime, yikes!!
You call index to get the original position in the list of toks/dtoks:
...
for el in matching_toks:
...
pos = all_toks.index(el) # <-- right here!
RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]
...
Why does the call to index make this so slow? As the matching_toks loop progresses, the call to all_toks.index(...) must look through more and more of all_toks to find the matching element. In Big-O notation it takes your for el in matching_toks
loop from O(n) to O(n*n), "O of n-squared".
We can see the same behavior in this simple example:
def test_index(size: int):
list_ = [x for x in range(size)]
for x in list_:
list_.index(x)
When I time this for sizes 10_000, 50_000, 100_000 I see the following run times:
size 10000: 0.29s
size 50000: 7.2s
size 100000: 29s
50_000 takes 25 times as long as 10_000 (despite being only 5 times bigger, because 5*5), and 100_000 takes 100 times longer (despite being only 10 times bigger, because 10*10).
We can easily fix this by just giving an index a meaningful and helpful start index:
def test_index_with_start(size: int):
list_ = [x for x in range(size)]
last_x = 0
for x in list_:
list_.index(x, last_x)
last_x = x
Now, for say x = 100_000
, index doesn't have to start looking at position 0 to find 100_000, it can start looking at position 99_999. The times now look linear:
size 10000: 0.00053s
size 50000: 0.0025s
size 100000: 0.0051s
The same can be applied to your loop and index(...) call:
prev_pos = 0
for el in matching_toks:
...
pos = all_toks.index(el, prev_pos)
prev_pos = pos
...
With those changes, processing that big XML only took .7 seconds (down from 35).
You can view my complete code down in main.py, I made some other changes to appease the type hinter: mostly around not mixing None and str... if a variable should end up a string, always treat it as a string, even if empty (not None).
I also changed both your and my functions to return the (CSV) rows for each document/file. This allowed me to compare the results and make sure that my changes didn't affect the accuracy.
concat.py takes all the small XMLs from the ZIP archive you shared and creates big.xml, if you want to see what I tested against.