Skip to content

Instantly share code, notes, and snippets.

@rahulrajaram
Last active April 2, 2023 15:47
Show Gist options
  • Save rahulrajaram/5934d2b786ed2c29dc418fafaa2830ad to your computer and use it in GitHub Desktop.
Save rahulrajaram/5934d2b786ed2c29dc418fafaa2830ad to your computer and use it in GitHub Desktop.
Python: Write to a file from multiple threads

I recently came across the need to spawn multiple threads, each of which needs to write to the same file. Since the file will experience contention from multiple resources, we need to guarantee thread-safety.

NOTE: The following examples work with Python 3.x. To execute the following programs using Python 2.7, please replace threading.get_ident() with thread.get_ident(). As a result, you would need to import thread and not threading.

  1. (The following example will take a very long time). It will create 200 threads, each of which will wait until a global lock is available for acquisition.
# threading_lock.py
import threading

global_lock = threading.Lock()

def write_to_file():
    while global_lock.locked():
        continue

    global_lock.acquire()

    with open("thread_writes", "a+") as file:
        file.write(str(threading.get_ident()))
        file.write("\n")
        file.close()

    global_lock.release()

# Create a 200 threads, invoke write_to_file() through each of them,
# and 
threads = []
for i in range(1, 201):
    t = threading.Thread(target=write_to_file)
    threads.append(t)
    t.start()
[thread.join() for thread in threads]

As mentioned earlier, the above program takes an unacceptable 125s:

python threading_lock.py  125.56s user 0.34s system 103% cpu 2:01.57 total

(Addendum: @agiletelescope points out that with the following minor change, we can avoid lock-contention drastically.

...
while global_lock.locked():
    time.sleep(0.01)
    continue

)

  1. A simple modification to this is to store the information the threads want to write in an in-memory data structure such as a Python list and to write the contents of the list to a file once all threads have join-ed.
# threading_lock_2.py
import threading

# Global lock
global_lock = threading.Lock()
file_contents = []
def write_to_file():
    while global_lock.locked():
        continue

    global_lock.acquire()
    file_contents.append(threading.get_ident())
    global_lock.release()

# Create a 200 threads, invoke write_to_file() through each of them,
# and 
threads = []
for i in range(1, 201):
    t = threading.Thread(target=write_to_file)
    threads.append(t)
    t.start()
[thread.join() for thread in threads]

with open("thread_writes", "a+") as file:
    file.write('\n'.join([str(content) for content in file_contents]))
    file.close()

The above program takes a significantly shorter, and almost negligible, time:

python threading_lock_2.py  0.04s user 0.00s system 76% cpu 0.052 total

With thread count = 2000:

python threading_lock_2.py  0.10s user 0.06s system 77% cpu 0.206 total

With thread count = 20000:

python threading_lock_2.py  0.10s user 0.06s system 77% cpu 0.206 total
@mbanders
Copy link

Your second to last line has file.write('\n'.join()) but doesn't join need an argument?

>>> '\n'.join()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: join() takes exactly one argument (0 given)

@loganathanengrr
Copy link

You just pass the argument here join function expect the argument like this .join("some value")

@tonykwok
Copy link

tonykwok commented Aug 10, 2019

It seems like file.write('\n'.join()) should be file.write('\n'.join(file_contents))

@rahulrajaram
Copy link
Author

rahulrajaram commented Aug 12, 2019

Hi everyone, yes, you are right. I meant to .join stringified file_contents.

@agiletelescope
Copy link

Hey, thanks a lot for the first script.
The execution speed of the first script can be greatly increased by introducing a sleep when waiting for the lock to be free (this prevents the threads from accessing the lock too frequently), was able to get a time of around 0.06s, attached the script below.
Thank You.

import threading
from time import sleep
from datetime import datetime

global_lock = threading.Lock()

def write_to_file():
    while global_lock.locked():
        sleep(0.01)
        continue

    global_lock.acquire()

    with open("thread_writes", "a+") as file:
        file.write(str(threading.get_ident()))
        file.write("\n")
        file.close()

    global_lock.release()

# Create a 200 threads, invoke write_to_file() through each of them,
# and 
threads = []
st = datetime.now()

for i in range(1, 201):
    print (i)
    t = threading.Thread(target=write_to_file)
    threads.append(t)
    t.start()
[thread.join() for thread in threads]

nd = datetime.now()
print ("Ex time: ", (nd - st).total_seconds())

@rahulrajaram
Copy link
Author

@agiletelescope, awesome! Thanks.

@mgirard772
Copy link

How do you get the output for user, system, cpu and total time like displayed above?

@rahulrajaram
Copy link
Author

@mgirard772

Use the Linux/BSD time facility:

time python <python script>

@S0Ulle33
Copy link

S0Ulle33 commented Jul 16, 2020

@agiletelescope, it's even better if you use global_lock with context manager with; and don't call file.close() in context manager:

import threading
from time import sleep
from datetime import datetime

global_lock = threading.Lock()


def write_to_file():
    with global_lock:
        with open("thread_writes", "a") as file:
            file.write(str(threading.get_ident()))
            file.write("\n")


# Create a 200 threads, invoke write_to_file() through each of them,
# and
threads = []
st = datetime.now()

for i in range(200):
    t = threading.Thread(target=write_to_file)
    threads.append(t)
    t.start()
[thread.join() for thread in threads]

nd = datetime.now()
print("Ex time: ", (nd - st).total_seconds())

If you need only append, then there's no point in a+, just use a. This also speed up overall perfomance.

@jin09
Copy link

jin09 commented Apr 30, 2021

Hey, @rahulrajaram I think the entire problem with your 1st script is that you are continuously polling the global lock which is why your threads are not going to sleep when the lock is unavailable and hence 100% CPU strain. You can simply avoid it by using a context manager which will automatically take care of sleeping and waking up your threads.
@agiletelescope this should give even better performance compared to the 0.1s sleep that you added which might completely be unnecessary.
Also, you are continuously polling the lock in the queue example (2nd script) that you have shared, it takes less time compared to your 1st script because list operations are fast compared to the file operations in the critical section, but the problem of lock contention still persists which is why you see a high CPU strain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment