Skip to content

Instantly share code, notes, and snippets.

@irwincong
Created August 8, 2020 14:28
Show Gist options
  • Save irwincong/f191b9a827fd0efa0143718ed0aafeda to your computer and use it in GitHub Desktop.
Save irwincong/f191b9a827fd0efa0143718ed0aafeda to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
# DATE: 2020.08.08
# My initial solution for:
# https://web.archive.org/web/20200618185144/https://techdevguide.withgoogle.com/resources/compress-decompression/
#
# Time complexity: O(length of input string).
# Storage complexity: I think my storage complexity is related to recursion (stack) depth and string length.
# I don't recall how python passes strings, but I hope it is passed by reference instead of by value.
# If by reference, then I think I have a O(length of input string)
# If not, my solution may have a storage requirement of O(length of input string * stack depth)
import sys
def decompress(deflated: str, pos: int=0) -> (int, str):
inflated = ""
N = ""
k = pos
# NOTE: This is not pythonic, but it works. I considered using enumerate() but
# I wouldn't be able to control the string index to "jump forward" on recursion return.
while True:
try:
c = deflated[k]
except IndexError:
return k, inflated
# What follows is probably too heavy on string copy operations because python strings are immutable.
#
# EDIT: Google's solution used nested generators, and they cited code clarity as the primary rationale
# with copy-operations as a secondary reason.
if "0" <= c <= "9":
N += c
elif c == "[":
k, s = decompress(deflated, k+1)
inflated += int(N) * s # NOTE: Could avoid the mult. operation if we checked for N == 1.
N = ""
elif c == "]":
return k, inflated
else:
inflated += c
k += 1
if __name__ == "__main__":
_, t = decompress("10[a]")
assert(t == 10*"a")
_, t = decompress("2[3[a]b]")
assert(t == 2*(3*"a"+"b"))
# EDIT: Google's worst-case test case
_, t = decompress("1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[xx]]]]]]]]]]]]]]]]]]]]")
assert(t == "xx")
@irwincong
Copy link
Author

irwincong commented Aug 8, 2020

I didn't use any of the hints. Taking a look at google's solution (below). Some observations:

  • I had not considered nested generators.
    • That's neat and more memory efficient.
    • TODO: I know how to use generators, but I am not as familiar with nesting generators.
  • Personal preference here, but I think my implementation is easier to read/comprehend even though it is less memory efficient. Maybe because I am not as well versed with nested generators.
  • Google's solution for finding sub_times (their version of my N), like my solution, assumes a well-formed decimal value. The problem description says the input is always valid according to the other rules; however, this is a little ambiguous because in Google Hint #4, it says to consider edge cases such as a[]b and 0[abc]. These, by the grammar, are all valid. The grammar, however, also allows for "weird" numbers such as 00[abc] and 01[abc]. Nothing in Google's provided solution says that these edge cases were considered, but, to be frank, I also did not consider these "weird" numbers until after I implemented my solution.
  • TODO: I should have used isdigit() instead of "0" <= c <= "9". I guess I am not familiar enough with builtin methods to the Python str class
  • Google's solution used an instance check against basestring. I was not familiar with basestring, but by its name I can guess what it is. basestring is not really used in Python3 - https://docs.python.org/3.0/whatsnew/3.0.html.

The builtin basestring abstract type was removed. Use str instead. The str and bytes types don’t have functionality enough in common to warrant a shared base class. The 2to3 tool (see below) replaces every occurrence of basestring with str.

#!/usr/bin/env python
import sys
def decomp(text, start=0, times=1):
    """
    Iterate over and decompress the compressed string.
    This approach makes use of nested Python iterators, which is a very clean way
    of expressing expansion of nested iterated items.
    Args:
        text: The entire string to decompress.  It's unobvious, but nice
          to have the whole string plus an index; this allows any error
          detection code to report the absolute index of a problematic
          character.
        start: The starting index within 'text'.  We decompress from
          'start' up through the matching close-brace or end-of-string.
        times: The number of times to repeat decompression.
    """
# Repeat iteration over our sub-chunk N times.
for _ in xrange(times):
    i = start
    # Until we hit the end of the string, or end of our chunk...
    while i < len(text) and text[i] != ']':
        # Emit letters as themselves.
        if text[i].islower():
            yield text[i]
    # If it's not a letter, it must be digit(s).  Convert to integer.
    else:
        sub_times = 0
        while text[i].isdigit():
            sub_times = sub_times * 10 + int(text[i])
            i += 1
        i += 1  # Skip past open-bracket
        # Start an iterator over the sub-chunk.
        for item in decomp(text, i, sub_times):
            # Iterator generates many characters, and then at the very end,
            # it generates an integer.  Provide characters up to our caller,
            # and use the integer to advance our index 'i' to end-of-chunk.
            if isinstance(item, basestring):
                yield item
            else:
                i = item
        # Advance 'i' to the next letter, or skip the close-bracket, whichever.
        i += 1
# Don't emit the trailing integer if we are doing the outermost call.  This
# test could be moved to the decompress() call instead; we would check there
# to see if the result item was basestring or int, just as we do above, but
# I suspect that check would be more expensive than a simple integer
# comparison here, where the type is known.
if start > 0:
    yield i
def decompress(text):
    # We could wrap 'text' to add a leading '1[' and trailing ']' to allow a
    # little bit of simplification in the logic in decomp(), but the
    # simplification would lead to harder-to-read code, as well as requiring
    # O(n) additional time, and a temporary requirement for O(n + 3) additional
    # space during the copy operation.
    #
    # This is O(decompressed-length) for speed (probably), and up to
    # O(compressed-length/4) for additional storage.  In this implementation,
    # the storage requirement is well-disguised: It appears on the function call
    # stack, rather than in declared variables.  E.g., consider a worst-case
    # input of: 1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[1[xx]]]]]]]]]]]]]]]]]]]]
    # which would require a stack depth of 20.
    #
    # The (probably) for the big-O depends on how well Python implements
    # resumption of nested iteration.  At worst, the string above would require
    # a full stack descent, then ascent for *each* of the two 'x' characters,
    # for a total of O(n^2) time.  Another very well-hidden possible cost.
    for letter in decomp(text):
        sys.stdout.write(letter)
    sys.stdout.write('
')
    if __name__ == '__main__':
        decompress(sys.argv[1])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment