stecman/dump-pyc-with-gdb.md

## dump-pyc-with-gdb.md

      
    Raw
  

              dump-pyc-with-gdb.md
            
          
    This is a technique for extracting all imported modules from a packaged Python application as .pyc files, then decompiling them. The target program needs to be run from scratch, but no debugging symbols are necessary (assuming an unmodified build of Python is being used).
This was originally performed on 64-bit Linux with a Python 3.6 target. The Python scripts have since been updated to handle pyc files for Python 2.7 - 3.9.
Theory

In Python we can leverage the fact that any module import involving a .py* file will eventually arrive as ready-to-execute Python code object at this function:
PyObject* PyEval_EvalCode(PyObject *co, PyObject *globals, PyObject *locals);
If a breakpoint is set here in gdb, the C implementation for marshal.dump() can be called to dump the bytecode to file. Conveniently the .pyc format is simply a marshaled PyCodeObject with a small header.
The script marshal-to-pyc.py below can be used to convert these raw marshaled code objects into .pyc files and decompile them if desired.
The script py-to-marshal.py can be used to create raw marshal files from Python source files to demonstrate or test this without needing to extract marshaled code from a runtime.
pyc file header

The format of pyc headers has changed between versions. The scripts handle this, but for completeness (since I haven't found it documented anywhere else all at once), here's the header format for each version:
All fields at the time of writing are written as little-endian 32-bit values.

Python 2.7: [magic_num][source_modified_time]
Python >= 3.2 (PEP-3147): [magic_num][source_modified_time][source_size]
Python >= 3.8 (PEP-0552): [magic_num][bit-field][source_modified_time][source_size]

These details are also noted in code comments.
Implementation in GDB

Start the debugger in a stopped state:
gdb target_application
Then in the GDB console:
# Wait for the Python library to load if the symbol can't be found before runtime
catch load

# Run the program
run

# Continue until gdb breaks where the target Python .so is loading
continue
# ...

# Break on the target function
break PyEval_EvalCode

Now GDB can be automated to dump every PyCodeObject evaluated at runtime to disk. You may want to test and validate a single dump manually before proceeding with the command automated version.
# Index for writing multiple files
set $index = 0

# Define code dumping command (no symbols available)
# Passing $rdi here is equivalent to passing the `co` argument when debugging symbol are present
define dump_pyc
  eval "set $handle = fopen(\"%s/%d.marshal\", \"w\")", $arg0, $index
  call (void) PyMarshal_WriteObjectToFile($rdi, $handle, 4)
  call fclose($handle)
  set $index += 1
end

command
dump_pyc "/tmp/"
continue
end

The first argument of PyEval_EvalCode should be in the rdi register on x86_64 Linux, but it may differ on your platform. You may need to find the location of the first argument yourself, but once you know the location it can be substituted above.
Testing

The script py-to-marshal.py can be used to create a marshaled code object from a Python source file for testing:
# Compile and strip header to create yourfile.marshal
python py-to-marshal.py yourfile.py

# Build the pyc header again and decompile
python marshal-to-pyc.py yourfile.marshal

# Look at the output (assuming the above didn't fail)
cat yourfile.marshal.py
The header for pyc files has changed several times to date. If you're running into errors about bad marshal data (unknown code type), use this test to confirm the script works on your Python version.

  
## marshal-to-pyc.py
from __future__ import print_function

import marshal
import struct
import sys
import time
import uncompyle6

def _pack_uint32(val):
    """ Convert integer to 32-bit little-endian bytes """
    return struct.pack("<I", val)

def code_to_bytecode(code, mtime=0, source_size=0):
    """
    Serialise the passed code object (PyCodeObject*) to bytecode as a .pyc file

    The args mtime and source_size are inconsequential metadata in the .pyc file.
    """

    # Get the magic number for the running Python version
    if sys.version_info >= (3,4):
        from importlib.util import MAGIC_NUMBER
    else:
        import imp
        MAGIC_NUMBER = imp.get_magic()

    # Add the magic number that indicates the version of Python the bytecode is for
    #
    # The .pyc may not decompile if this four-byte value is wrong. Either hardcode the
    # value for the target version (eg. b'\x33\x0D\x0D\x0A' instead of MAGIC_NUMBER)
    # or see trymagicnum.py to step through different values to find a valid one.
    data = bytearray(MAGIC_NUMBER)

    # Handle extra 32-bit field in header from Python 3.7 onwards
    # See: https://www.python.org/dev/peps/pep-0552
    if sys.version_info >= (3,7):
        # Blank bit field value to indicate traditional pyc header
        data.extend(_pack_uint32(0))

    data.extend(_pack_uint32(int(mtime)))

    # Handle extra 32-bit field for source size from Python 3.2 onwards
    # See: https://www.python.org/dev/peps/pep-3147/
    if sys.version_info >= (3,2):
        data.extend(_pack_uint32(source_size))

    data.extend(marshal.dumps(code))

    return data

if len(sys.argv) < 2:
    print("Usage %s <marshal-dump-file>" % sys.argv[0])
    sys.exit(1)

path = sys.argv[1]
pycFile = path + ".pyc"
pythonFile = path + ".py"

# Open raw code that was saved using marshal.dump()
#
# It isn't strictly necessary to unmarshal this to write back as a .pyc file,
# but this validates that the marshalled code content is valid (an exception
# is thrown otherwise)
with open(path, 'rb') as handle:
    code = marshal.load(handle)
    pyc = code_to_bytecode(code, time.time())

    with open(pycFile, 'wb') as out:
        out.write(pyc)

# Use uncompyle6 to decompile the bytecode and write it to disk alongside the .pyc
with open(pythonFile, 'w') as decompiled:
    uncompyle6.main.decompile_file(pycFile, decompiled)

## py-to-marshal.py
from __future__ import print_function

import py_compile
import sys
import os

def py_to_marshal(input_file):
    """
    Create a .pyc and .marshal file for the given Python source file
    """
    base_file, ext = os.path.splitext(input_file)
    pyc_file = base_file + ".pyc"
    marshal_file = base_file + ".marshal"

    # Compile to a pyc file
    py_compile.compile(input_file, pyc_file)

    # Trim off the pyc header, leaving only the marshalled code
    if sys.version_info >= (3,7):
        # The header size is 4 bytes longer from Python 3.7
        # See: https://www.python.org/dev/peps/pep-0552
        header_size = 16
    elif sys.version_info >= (3,2):
        # Python 3.2 changed to a 3x 32-bit field header
        # See: https://www.python.org/dev/peps/pep-3147/
        header_size = 12
    else:
        # Python 2.x uses a 2x 32-bit field header
        header_size = 8

    with open(pyc_file, 'rb') as pyc_handle:
        with open(marshal_file, 'wb') as marshal_handle:
            marshal_handle.write(pyc_handle.read()[header_size:])

    return marshal_file


if __name__ == "__main__":
    print("Python %s " % sys.version)

    # Process all arguments as filenames
    for input_file in sys.argv[1:]:
        output_file = py_to_marshal(input_file)

        print("%s -> %s" % (input_file, output_file))

## trymagicnum.py
# Utility to try different .pyc file magic numbers to find one
# The uncompyle6 command-line tool can be used at each pause to test

import struct
import binascii
import sys

# This list is from https://github.com/google/pytype/blob/master/pytype/pyc/magic.py
# These constants are from Python-3.x.x/Lib/importlib/_bootstrap_external.py
PYTHON_MAGIC = {
    # Python 1
    20121: (1, 5),
    50428: (1, 6),

    # Python 2
    50823: (2, 0),
    60202: (2, 1),
    60717: (2, 2),
    62011: (2, 3, 'a0'),
    62021: (2, 3, 'a0'),
    62041: (2, 4, 'a0'),
    62051: (2, 4, 'a3'),
    62061: (2, 4, 'b1'),
    62071: (2, 5, 'a0'),
    62081: (2, 5, 'a0'),
    62091: (2, 5, 'a0'),
    62092: (2, 5, 'a0'),
    62101: (2, 5, 'b3'),
    62111: (2, 5, 'b3'),
    62121: (2, 5, 'c1'),
    62131: (2, 5, 'c2'),
    62151: (2, 6, 'a0'),
    62161: (2, 6, 'a1'),
    62171: (2, 7, 'a0'),
    62181: (2, 7, 'a0'),
    62191: (2, 7, 'a0'),
    62201: (2, 7, 'a0'),
    62211: (2, 7, 'a0'),

    # Python 3
    3000: (3, 0),
    3010: (3, 0),
    3020: (3, 0),
    3030: (3, 0),
    3040: (3, 0),
    3050: (3, 0),
    3060: (3, 0),
    3061: (3, 0),
    3071: (3, 0),
    3081: (3, 0),
    3091: (3, 0),
    3101: (3, 0),
    3103: (3, 0),
    3111: (3, 0, 'a4'),
    3131: (3, 0, 'a5'),

    # Python 3.1
    3141: (3, 1, 'a0'),
    3151: (3, 1, 'a0'),

    # Python 3.2
    3160: (3, 2, 'a0'),
    3170: (3, 2, 'a1'),
    3180: (3, 2, 'a2'),

    # Python 3.3
    3190: (3, 3, 'a0'),
    3200: (3, 3, 'a0'),
    3220: (3, 3, 'a1'),
    3230: (3, 3, 'a4'),

    # Python 3.4
    3250: (3, 4, 'a1'),
    3260: (3, 4, 'a1'),
    3270: (3, 4, 'a1'),
    3280: (3, 4, 'a1'),
    3290: (3, 4, 'a4'),
    3300: (3, 4, 'a4'),
    3310: (3, 4, 'rc2'),

    # Python 3.5
    3320: (3, 5, 'a0'),
    3330: (3, 5, 'b1'),
    3340: (3, 5, 'b2'),
    3350: (3, 5, 'b2'),
    3351: (3, 5),

    # Python 3.6
    3360: (3, 6, 'a0'),
    3361: (3, 6, 'a0'),
    3370: (3, 6, 'a1'),
    3371: (3, 6, 'a1'),
    3372: (3, 6, 'a1'),
    3373: (3, 6, 'b1'),
    3375: (3, 6, 'b1'),
    3376: (3, 6, 'b1'),
    3377: (3, 6, 'b1'),
    3378: (3, 6, 'b2'),
    3379: (3, 6, 'rc1'),

    # Python 3.7
    3390: (3, 7, 'a1'),
    3391: (3, 7, 'a2'),
    3392: (3, 7, 'a4'),
    3393: (3, 7, 'b1'),
    3394: (3, 7, 'b5'),

    # Python 3.8
    3400: (3, 8, 'a1'),
    3401: (3, 8, 'a1'),
    3410: (3, 8, 'a1'),
    3411: (3, 8, 'b2'),
    3412: (3, 8, 'b2'),
    3413: (3, 8, 'b4'),
}

if len(sys.argv) < 2:
    print("Usage: %s <pyc-file>" % sys.argv[0])
    sys.exit(1)

target = sys.argv[1]

if not target.endswith(".pyc"):
    print("Aborting: %s doesn't end with '.pyc' (careful mode)")
    sys.exit(1)

for number in PYTHON_MAGIC:
    version = PYTHON_MAGIC[number]
    binary = struct.pack("<H", number)
    print(" --> Trying %d => %s" % (number, version))

    # Modify the first two bytes of the target file
    with open(target, "r+b") as handle:
        handle.seek(0)
        handle.write(binary)

    input("Press Enter to continue...")
	from __future__ import print_function

	import marshal
	import struct
	import sys
	import time
	import uncompyle6

	def _pack_uint32(val):
	""" Convert integer to 32-bit little-endian bytes """
	return struct.pack("<I", val)

	def code_to_bytecode(code, mtime=0, source_size=0):
	"""
	Serialise the passed code object (PyCodeObject*) to bytecode as a .pyc file

	The args mtime and source_size are inconsequential metadata in the .pyc file.
	"""

	# Get the magic number for the running Python version
	if sys.version_info >= (3,4):
	from importlib.util import MAGIC_NUMBER
	else:
	import imp
	MAGIC_NUMBER = imp.get_magic()

	# Add the magic number that indicates the version of Python the bytecode is for
	#
	# The .pyc may not decompile if this four-byte value is wrong. Either hardcode the
	# value for the target version (eg. b'\x33\x0D\x0D\x0A' instead of MAGIC_NUMBER)
	# or see trymagicnum.py to step through different values to find a valid one.
	data = bytearray(MAGIC_NUMBER)

	# Handle extra 32-bit field in header from Python 3.7 onwards
	# See: https://www.python.org/dev/peps/pep-0552
	if sys.version_info >= (3,7):
	# Blank bit field value to indicate traditional pyc header
	data.extend(_pack_uint32(0))

	data.extend(_pack_uint32(int(mtime)))

	# Handle extra 32-bit field for source size from Python 3.2 onwards
	# See: https://www.python.org/dev/peps/pep-3147/
	if sys.version_info >= (3,2):
	data.extend(_pack_uint32(source_size))

	data.extend(marshal.dumps(code))

	return data

	if len(sys.argv) < 2:
	print("Usage %s <marshal-dump-file>" % sys.argv[0])
	sys.exit(1)

	path = sys.argv[1]
	pycFile = path + ".pyc"
	pythonFile = path + ".py"

	# Open raw code that was saved using marshal.dump()
	#
	# It isn't strictly necessary to unmarshal this to write back as a .pyc file,
	# but this validates that the marshalled code content is valid (an exception
	# is thrown otherwise)
	with open(path, 'rb') as handle:
	code = marshal.load(handle)
	pyc = code_to_bytecode(code, time.time())

	with open(pycFile, 'wb') as out:
	out.write(pyc)

	# Use uncompyle6 to decompile the bytecode and write it to disk alongside the .pyc
	with open(pythonFile, 'w') as decompiled:
	uncompyle6.main.decompile_file(pycFile, decompiled)
	from __future__ import print_function

	import py_compile
	import sys
	import os

	def py_to_marshal(input_file):
	"""
	Create a .pyc and .marshal file for the given Python source file
	"""
	base_file, ext = os.path.splitext(input_file)
	pyc_file = base_file + ".pyc"
	marshal_file = base_file + ".marshal"

	# Compile to a pyc file
	py_compile.compile(input_file, pyc_file)

	# Trim off the pyc header, leaving only the marshalled code
	if sys.version_info >= (3,7):
	# The header size is 4 bytes longer from Python 3.7
	# See: https://www.python.org/dev/peps/pep-0552
	header_size = 16
	elif sys.version_info >= (3,2):
	# Python 3.2 changed to a 3x 32-bit field header
	# See: https://www.python.org/dev/peps/pep-3147/
	header_size = 12
	else:
	# Python 2.x uses a 2x 32-bit field header
	header_size = 8

	with open(pyc_file, 'rb') as pyc_handle:
	with open(marshal_file, 'wb') as marshal_handle:
	marshal_handle.write(pyc_handle.read()[header_size:])

	return marshal_file


	if __name__ == "__main__":
	print("Python %s " % sys.version)

	# Process all arguments as filenames
	for input_file in sys.argv[1:]:
	output_file = py_to_marshal(input_file)

	print("%s -> %s" % (input_file, output_file))
	# Utility to try different .pyc file magic numbers to find one
	# The uncompyle6 command-line tool can be used at each pause to test

	import struct
	import binascii
	import sys

	# This list is from https://github.com/google/pytype/blob/master/pytype/pyc/magic.py
	# These constants are from Python-3.x.x/Lib/importlib/_bootstrap_external.py
	PYTHON_MAGIC = {
	# Python 1
	20121: (1, 5),
	50428: (1, 6),

	# Python 2
	50823: (2, 0),
	60202: (2, 1),
	60717: (2, 2),
	62011: (2, 3, 'a0'),
	62021: (2, 3, 'a0'),
	62041: (2, 4, 'a0'),
	62051: (2, 4, 'a3'),
	62061: (2, 4, 'b1'),
	62071: (2, 5, 'a0'),
	62081: (2, 5, 'a0'),
	62091: (2, 5, 'a0'),
	62092: (2, 5, 'a0'),
	62101: (2, 5, 'b3'),
	62111: (2, 5, 'b3'),
	62121: (2, 5, 'c1'),
	62131: (2, 5, 'c2'),
	62151: (2, 6, 'a0'),
	62161: (2, 6, 'a1'),
	62171: (2, 7, 'a0'),
	62181: (2, 7, 'a0'),
	62191: (2, 7, 'a0'),
	62201: (2, 7, 'a0'),
	62211: (2, 7, 'a0'),

	# Python 3
	3000: (3, 0),
	3010: (3, 0),
	3020: (3, 0),
	3030: (3, 0),
	3040: (3, 0),
	3050: (3, 0),
	3060: (3, 0),
	3061: (3, 0),
	3071: (3, 0),
	3081: (3, 0),
	3091: (3, 0),
	3101: (3, 0),
	3103: (3, 0),
	3111: (3, 0, 'a4'),
	3131: (3, 0, 'a5'),

	# Python 3.1
	3141: (3, 1, 'a0'),
	3151: (3, 1, 'a0'),

	# Python 3.2
	3160: (3, 2, 'a0'),
	3170: (3, 2, 'a1'),
	3180: (3, 2, 'a2'),

	# Python 3.3
	3190: (3, 3, 'a0'),
	3200: (3, 3, 'a0'),
	3220: (3, 3, 'a1'),
	3230: (3, 3, 'a4'),

	# Python 3.4
	3250: (3, 4, 'a1'),
	3260: (3, 4, 'a1'),
	3270: (3, 4, 'a1'),
	3280: (3, 4, 'a1'),
	3290: (3, 4, 'a4'),
	3300: (3, 4, 'a4'),
	3310: (3, 4, 'rc2'),

	# Python 3.5
	3320: (3, 5, 'a0'),
	3330: (3, 5, 'b1'),
	3340: (3, 5, 'b2'),
	3350: (3, 5, 'b2'),
	3351: (3, 5),

	# Python 3.6
	3360: (3, 6, 'a0'),
	3361: (3, 6, 'a0'),
	3370: (3, 6, 'a1'),
	3371: (3, 6, 'a1'),
	3372: (3, 6, 'a1'),
	3373: (3, 6, 'b1'),
	3375: (3, 6, 'b1'),
	3376: (3, 6, 'b1'),
	3377: (3, 6, 'b1'),
	3378: (3, 6, 'b2'),
	3379: (3, 6, 'rc1'),

	# Python 3.7
	3390: (3, 7, 'a1'),
	3391: (3, 7, 'a2'),
	3392: (3, 7, 'a4'),
	3393: (3, 7, 'b1'),
	3394: (3, 7, 'b5'),

	# Python 3.8
	3400: (3, 8, 'a1'),
	3401: (3, 8, 'a1'),
	3410: (3, 8, 'a1'),
	3411: (3, 8, 'b2'),
	3412: (3, 8, 'b2'),
	3413: (3, 8, 'b4'),
	}

	if len(sys.argv) < 2:
	print("Usage: %s <pyc-file>" % sys.argv[0])
	sys.exit(1)

	target = sys.argv[1]

	if not target.endswith(".pyc"):
	print("Aborting: %s doesn't end with '.pyc' (careful mode)")
	sys.exit(1)

	for number in PYTHON_MAGIC:
	version = PYTHON_MAGIC[number]
	binary = struct.pack("<H", number)
	print(" --> Trying %d => %s" % (number, version))

	# Modify the first two bytes of the target file
	with open(target, "r+b") as handle:
	handle.seek(0)
	handle.write(binary)

	input("Press Enter to continue...")