I wanted to compare solutions from JonSG and chepner to see if any ran particularly faster (particularly to see if chepner's ran faster), and to see if they only add the BOM (and don't mutate the text along the way).
Both failed, but for different reasons; JonSG's can easily be fixed.
My comparator:
- runs and times both functions against a 10MB UTF-8 encoded file of random text that runs the full spectrum of Unicode, minus invalid UTF-16 surrogate pairs
- reads the output and asserts the output has a BOM; also chomps the BOM leaving what should be the original UTF-8 bytes
- prints results
def compare():
import time
for name_out, func in [
("output-stream.txt", convert_stream), # JonSG
("output-copy.txt", convert_copy), # chepner
]:
beg = time.monotonic()
func(FNAME_TXT, name_out)
delta = time.monotonic() - beg
with open(FNAME_TXT, "rb") as f:
input = f.read()
with open(name_out, "rb") as f:
first_three = f.read(3)
assert first_three == b"\xEF\xBB\xBF", f"first three bytes of '{FNAME_TXT}'={first_three}; want BOM (b'\\xEF\\xBB\\xBF')" # fmt: skip
output_sans_bom = f.read()
print(
f"{func.__name__} ran in {delta:.4f} s; output==input = {output_sans_bom==input}"
)
JonSG's ran 0.03 seconds, but mutated the text along the way:
convert_stream ran in 0.0363 s; output==input = False
Adding newline=''
to the opener for the input file fixes that by stopping the reader from normalizing line endings:
...
with open(path_in, "r", encoding="utf-8", newline="") as f_in:
...
convert_stream ran in 0.0340 s; output==input = True
chepner's fails with some exception in copyfileobj:
Traceback (most recent call last):
File "/Users/zyoung/develop/StackOverflow/main.py", line 100, in <module>
main()
File "/Users/zyoung/develop/StackOverflow/main.py", line 20, in main
compare()
File "/Users/zyoung/develop/StackOverflow/main.py", line 33, in compare
func(FNAME_TXT, name_out)
File "/Users/zyoung/develop/StackOverflow/main.py", line 64, in convert_copy
copyfileobj(chain(codecs.BOM_UTF8, in_), out)
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/shutil.py", line 201, in copyfileobj
fsrc_read = fsrc.read
^^^^^^^^^
AttributeError: 'itertools.chain' object has no attribute 'read'
I don't know itertools and/or chaining very well, so I cannot say what's wrong or what the fix might be.
The complete script that generates and runs the comparison can be found below.