Skip to content

Instantly share code, notes, and snippets.

@seasonedgeek
Created January 31, 2020 07:20
Show Gist options
  • Save seasonedgeek/1a456709a2b6b7c5bba8454df8b38a36 to your computer and use it in GitHub Desktop.
Save seasonedgeek/1a456709a2b6b7c5bba8454df8b38a36 to your computer and use it in GitHub Desktop.
Solutions for {bound method; utf-8 decode} MultiReader.read & test.bz2: {UnicodeDecodeError} invalid start byte

MultiReader Issues/Solutions

While trying out the reader project code presented in Chapter 1 of The Python Journeyman, I problem solved a couple of issues. I'm certain that my issues are related to being a Java defector, and a pythonic newbie.
I'm running Python 3.8.1 on macOS 10.15.

My Issues:

  • UnicodeDecodeError: 'utf-8' codec can't decode byte ...: invalid start byte

    The invalid start byte, I assume is related to test.bz2 and the authors' bz2.open(..., mode='wt'). I used mode='wb'.

  • bound method MultiReader.read of <reader.multireader.MultiReader object at 0x10f4a8a90>

    I remember <object>.to_string()from my Java days. I googled, found, and added a __str__ method to the MultiReader class definition.

  • See my Results below...
# reader/compressed/bzipped.py

import bz2
import sys

opener = bz2.open

if __name__ == '__main__':
    f = bz2.open(sys.argv[1], mode='wb')
    
    # capture raw text from command line
    text = ' '.join(sys.argv[2:])
    
    # prepare srtring (text)
    encoded_text = text.encode(encoding="utf-8", errors="backslashreplace")
    
    f.write(encoded_text)
    f.close()
# reader/multireader.py

import os
import re

from reader.compressed import bzipped, gzipped

""" This maps file extewnsions to the corresponding open methods."""
extension_map = {
    '.bz2': bzipped.opener,
    '.gz': gzipped.opener,
}


class MultiReader:
    """This class reads the contents of a compressed file."""
    
    def __init__(self, filename):
        """Opens a compressed file for reading."""
        self.extension = os.path.splitext(filename)[1]
        opener = extension_map.get(self.extension, open)
        
        # determine the reader's mode
        read_mode = 'rb' if re.search("b", self.extension) else 'rt'
        self.f = opener(filename, read_mode)
        
    def __str__(self):
        """returns read content."""
        return self.text
    
    def close(self):
        self.f.close()
        
    def read(self):
        """Determines whether to decode read content."""
        if re.search("b", self.extension):
            self.text = self.f.read().decode(encoding="utf-8", errors="ignore")
        else:
            self.text = self.f.read()

Results

lessons/pyjourney/chap1 took 2m 54s
➜ python3
Python 3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from reader.multireader import MultiReader
>>> r = MultiReader('test.bz2')
>>> r.read()
>>> r.__str__()
'the rain in spain rains mainly on the plane'
>>> 
>>> q = MultiReader('test.gz')
>>> q.read()
>>> q.__str__()
'the rain in spain rains mainly on the plane'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment