mstefanro/pickle4-midterm-report.txt

## pickle4-midterm-report.txt
Abstract
===============================================================================
As has been initially described in my proposal, my work until midterm involved
improving the python version of pickle with the suggestions in PEP 3154 [1].

The progress so far has been tagged in the picklev4 repository at [2], under
the tag pickle4-midterm.


Features
===============================================================================
Here follows a list of the improvements which were made:
    1) pickling of very large bytes and strings
        - not yet tested, need a PC with lots of memory or a smarter way to
          test (such as creating a NullStream which simply drops everything
          written to it except for the first few bytes and an object which
          behaves like a very large bytes but doesn't actually store the data).

    2) more space-efficient pickling of small strings and bytes.
        Tested by: test_v4_efficient_unicode, test_v4_efficient_binbytes

    3) native pickling with sets and frozensets, with support for
        self-referential sets and frozensets.
        Tested by: test_v4_sets, test_v4_set_refs, test_v4_sets_opcodes,
                   test_v4_cycles

    4) removed the BINPUT opcode (the pickler and unpickler now agree to use
        the same indexing technique and agree on which opcodes memoize
        something).
        Tested by pretty much all of the tests, because unpickling would fail
            if the strategy won't be consistent between the pickler and the
            unpickler.

    5) better error checking in v4 and more consistent exceptions are raised
        (typically only UnpicklingError, EOFError and MemoryError should be
        raised by the unpickler in sane conditions).
        Tested by: test_v4_exceptions

    6) removed GLOBAL in v4 in favor of binary equivalents, with the addition
        of a BINGLOBAL_COMMON opcode (which has a list of commonly-used module
        names).
        Tested by: test_v4_binglobal_common, test_v4_binglobal_big,
                   test_v4_nested_classes

    7) pickling of nested classes, functions in classes and bound methods.
        Tested by: test_v4_nested_classes, test_v4_pickle_bound_methods

    8) pickling/unpickling calls of __new__ with keyword args.
        Tested by: test_v4_new_kwargs

    9) in case pickling on a stream where data can't be "taken back" once has
    been written, the BAIL_OUT opcode has been added for the situations when
    the pickler fails "late" and wants to inform an unpickler which reads from
    the same stream that all the data it has unpickled so far is garbage and
    should be abandoned.
        - test_v4_bail_out

New opcodes: SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64,
    SEMISHORT_BINBYTES, BINBYTES64, BINGLOBAL, BINGLOBAL_BIG, BINGLOBAL_COMMON,
    EMPTY_SET, EMPTY_FROZENSET, UPDATE_SET, FROZENSET, BIND, NEWOBJ_KW,
    BAIL_OUT


Example: small strings and no binput
===============================================================================
>>> l=[1, 2, 3, b'abc', 'abc']
>>> dis(dumps(l,4))
    0: \x80 PROTO      4
    2: ]    EMPTY_LIST
    3: (    MARK
    4: K        BININT1    1
    6: K        BININT1    2
    8: K        BININT1    3
   10: C        SHORT_BINBYTES 'abc'
   15: \x8e     SHORT_BINUNICODE 'abc'
   20: e        APPENDS    (MARK at 3)
   21: .    STOP
highest protocol among opcodes = 4
>>> len(dumps(l,4)), len(dumps(l,3))
(22, 31)


Example: sets and frozensets
===============================================================================
>>> fs=frozenset([1,2,3]); s=set([1,2,fs,3])
>>> len(dumps(s,4)), len(dumps(s,3))
(20, 75)
>>> dis(dumps(s,4))
    0: \x80 PROTO      4
    2: \x94 EMPTY_SET
    3: (    MARK
    4: K        BININT1    1
    6: K        BININT1    2
    8: K        BININT1    3
   10: (        MARK
   11: K            BININT1    1
   13: K            BININT1    2
   15: K            BININT1    3
   17: \x97         FROZENSET  (MARK at 10)
   18: \x96     UPDATE_SET (MARK at 3)
   19: .    STOP
highest protocol among opcodes = 4
>>> dis(dumps(s,3))
    0: \x80 PROTO      3
    2: c    GLOBAL     'builtins set'
   16: q    BINPUT     0
   18: ]    EMPTY_LIST
   19: q    BINPUT     1
   21: (    MARK
   22: K        BININT1    1
   24: K        BININT1    2
   26: K        BININT1    3
   28: c        GLOBAL     'builtins frozenset'
   48: q        BINPUT     2
   50: ]        EMPTY_LIST
   51: q        BINPUT     3
   53: (        MARK
   54: K            BININT1    1
   56: K            BININT1    2
   58: K            BININT1    3
   60: e            APPENDS    (MARK at 53)
   61: \x85     TUPLE1
   62: q        BINPUT     4
   64: R        REDUCE
   65: q        BINPUT     5
   67: e        APPENDS    (MARK at 21)
   68: \x85 TUPLE1
   69: q    BINPUT     6
   71: R    REDUCE
   72: q    BINPUT     7
   74: .    STOP
highest protocol among opcodes = 2
>>> a=A(); s=set([1,2,a]); a.s=s
>>> dumps(s,4)     # self-referential sets
b'\x80\x04\x94(K\x01K\x02\x93\x00\x01A)\x81}\x8e\x01sh\x00sb\x96.'
>>> dis(dumps(s,4))
    0: \x80 PROTO      4
    2: \x94 EMPTY_SET
    3: (    MARK
    4: K        BININT1    1
    6: K        BININT1    2
    8: \x93     BINGLOBAL_COMMON '0 A'
   12: )        EMPTY_TUPLE
   13: \x81     NEWOBJ
   14: }        EMPTY_DICT
   15: \x8e     SHORT_BINUNICODE 's'
   18: h        BINGET     0
   20: s        SETITEM
   21: b        BUILD
   22: \x96     UPDATE_SET (MARK at 3)
   23: .    STOP
highest protocol among opcodes = 4
>>> a=A(); fs=frozenset([1,2,a]); a.fs=fs;
>>> dis(dumps(fs,4))  # self-referential frozensets
    0: \x80 PROTO      4
    2: (    MARK
    3: K        BININT1    1
    5: K        BININT1    2
    7: \x93     BINGLOBAL_COMMON '0 A'
   11: )        EMPTY_TUPLE
   12: \x81     NEWOBJ
   13: }        EMPTY_DICT
   14: \x8e     SHORT_BINUNICODE 'fs'
   18: (        MARK
   19: K            BININT1    1
   21: K            BININT1    2
   23: h            BINGET     1
   25: \x97         FROZENSET  (MARK at 18)
   26: s        SETITEM
   27: b        BUILD
   28: 1        POP_MARK   (MARK at 2)
   29: h    BINGET     4
   31: .    STOP
highest protocol among opcodes = 4
>>>


Example: Nested globals
===============================================================================
>>> class A:
...  class B:
...   def f(self):
...    return self.x
...
>>> dis(dumps(A.B.f,4))
    0: \x80 PROTO      4
    2: \x93 BINGLOBAL_COMMON '0 A.B.f'
   10: .    STOP
>>> b=A.B(); b.x=42
>>> loads(dumps(A.B.f,4))(b)
42
>>> dumps(b.f,4)
    0: \x80 PROTO      4
    2: \x93 BINGLOBAL_COMMON '0 A.B'
    8: )    EMPTY_TUPLE
    9: \x81 NEWOBJ
   10: }    EMPTY_DICT
   11: \x8e SHORT_BINUNICODE 'x'
   14: K    BININT1    42
   16: s    SETITEM
   17: b    BUILD
   18: \x93 BINGLOBAL_COMMON '0 A.B.f'
   26: \x98 BIND
   27: .    STOP
highest protocol among opcodes = 4
>>> loads(dumps(b.f,4))()
42


Work on the second part
===============================================================================
 * Write patches for the discovered bugs and create an issue on the bugtracker
 * Implement the opcodes that have already been implemented for the Python
    version until all the tests pass.
 * Finally, documentation and testing.


Conclusion
===============================================================================
I am on track with the work. Because the code is pretty well tested and
pickletools has been kept up to date accordingly, true test-driven development
can now be achieved for _pickle.
Here is a list of some other ideas that could be implemented before pickle4 is
released:
    * Unicode compression of strings for more efficient pickling/unpickling
     (eventually with an option to enable/disable this feature so the user can
     choose a space-time tradeoff)
    * Zigzag encoding of signed integers
    * Implementation of SecurePickler/SecureUnpickler classes, which allow
     unpickling of data coming from untrusted streams (such as networks).
     Arbitrary code execution can be easily achieved at this time:
            >>> from test.pickletester import AbstractPickleTests
            >>> apt=AbstractPickleTests()
            >>> evil=apt._make_pickle_str('{PROTO} \x04'
            ...             '{BINGLOBAL} \x08builtins\x04eval'
            ...             '{SHORT_BINUNICODE} \x0e print("evil!")'
            ...             '{TUPLE1} {REDUCE} {STOP}')
            >>> loads(evil)
            evil!
    A more secure pickle can currently be created by deriving from Unpickler
    and overriding the find_class method, thus limiting what globals can be
    looked up and which objects can be created. But this has several
    disatvantages and limitations. A SecurePickler would have features such as
    the following:
        o Only a python implementation, because security is more important than
            speed here, and it's easier to screw up in C than Python.
        o Either never raise an exception in unpickling (such as by providing a
            default value to the unpickler in case unpickling fails), or raise
            only a known, small set of exceptions by using exception chaining
            (hiding exceptions is bad).
        o Disable obsolete opcodes (opcodes which have been replaced by newer
        ones) to decrease the chance of finding security weaknesses.
        o Have security parameters such as memo_maxsize, stack_maxsize,
            recurse_maxdepth, allow_globals, limiting the resources that the
            unpickler can use.
                memo_maxsize => don't memoize more objects than what this value
                                indicates
                stack_maxsize => bail out if the unpickler's stack goes above
                                 this size
                recurse_maxdepth => don't recurse above this depth when
                                    unpickling
                allow_globals => whether a limited subset of globals should be
                                 allowed
        o Have a few "profiles", where each profile is a dictionary with sane
            values for the security parameters, so the developer doesn't have
            to worry too much about what would be a good value for
            memo_maxsize. E.g.:
            basic_datatypes_only = {'memo_maxsize' : 100000,
                                    'stack_maxsize' : 100000,
                                    'recurse_maxdepth' : 10,
                                    'allow_globals' : False}
        o Take a allowed_globals security parameter, forbidding use of all
            other globals not here.
        o Maybe group certain opcodes (such as SHORT_BINBYTES,
            SEMISHORT_BINBYTES, BINBYTES, BINBYTES64 in the group bytes) so the
            developer can choose which groups of opcodes to allow.
    Clearly, SecurePickler would only work with pickle>=4.

I doubt I can also implement the above in the given timeline, but I believe
they should be considered prior to releasing pickle4, because updating later
won't be possible without creating incompatibilities (with the possible
exception of SecureUnpickler, though problems could arise there too if
implementing the SecureUnpickler requires changing stuff in Unpickler to allow
a greater degree of control).


[1] http://www.python.org/dev/peps/pep-3154/
[2] https://bitbucket.org/mstefanro/pickle4/
	Abstract
	===============================================================================
	As has been initially described in my proposal, my work until midterm involved
	improving the python version of pickle with the suggestions in PEP 3154 [1].

	The progress so far has been tagged in the picklev4 repository at [2], under
	the tag pickle4-midterm.


	Features
	===============================================================================
	Here follows a list of the improvements which were made:
	1) pickling of very large bytes and strings
	- not yet tested, need a PC with lots of memory or a smarter way to
	test (such as creating a NullStream which simply drops everything
	written to it except for the first few bytes and an object which
	behaves like a very large bytes but doesn't actually store the data).

	2) more space-efficient pickling of small strings and bytes.
	Tested by: test_v4_efficient_unicode, test_v4_efficient_binbytes

	3) native pickling with sets and frozensets, with support for
	self-referential sets and frozensets.
	Tested by: test_v4_sets, test_v4_set_refs, test_v4_sets_opcodes,
	test_v4_cycles

	4) removed the BINPUT opcode (the pickler and unpickler now agree to use
	the same indexing technique and agree on which opcodes memoize
	something).
	Tested by pretty much all of the tests, because unpickling would fail
	if the strategy won't be consistent between the pickler and the
	unpickler.

	5) better error checking in v4 and more consistent exceptions are raised
	(typically only UnpicklingError, EOFError and MemoryError should be
	raised by the unpickler in sane conditions).
	Tested by: test_v4_exceptions

	6) removed GLOBAL in v4 in favor of binary equivalents, with the addition
	of a BINGLOBAL_COMMON opcode (which has a list of commonly-used module
	names).
	Tested by: test_v4_binglobal_common, test_v4_binglobal_big,
	test_v4_nested_classes

	7) pickling of nested classes, functions in classes and bound methods.
	Tested by: test_v4_nested_classes, test_v4_pickle_bound_methods

	8) pickling/unpickling calls of __new__ with keyword args.
	Tested by: test_v4_new_kwargs

	9) in case pickling on a stream where data can't be "taken back" once has
	been written, the BAIL_OUT opcode has been added for the situations when
	the pickler fails "late" and wants to inform an unpickler which reads from
	the same stream that all the data it has unpickled so far is garbage and
	should be abandoned.
	- test_v4_bail_out

	New opcodes: SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64,
	SEMISHORT_BINBYTES, BINBYTES64, BINGLOBAL, BINGLOBAL_BIG, BINGLOBAL_COMMON,
	EMPTY_SET, EMPTY_FROZENSET, UPDATE_SET, FROZENSET, BIND, NEWOBJ_KW,
	BAIL_OUT


	Example: small strings and no binput
	===============================================================================
	>>> l=[1, 2, 3, b'abc', 'abc']
	>>> dis(dumps(l,4))
	0: \x80 PROTO 4
	2: ] EMPTY_LIST
	3: ( MARK
	4: K BININT1 1
	6: K BININT1 2
	8: K BININT1 3
	10: C SHORT_BINBYTES 'abc'
	15: \x8e SHORT_BINUNICODE 'abc'
	20: e APPENDS (MARK at 3)
	21: . STOP
	highest protocol among opcodes = 4
	>>> len(dumps(l,4)), len(dumps(l,3))
	(22, 31)


	Example: sets and frozensets
	===============================================================================
	>>> fs=frozenset([1,2,3]); s=set([1,2,fs,3])
	>>> len(dumps(s,4)), len(dumps(s,3))
	(20, 75)
	>>> dis(dumps(s,4))
	0: \x80 PROTO 4
	2: \x94 EMPTY_SET
	3: ( MARK
	4: K BININT1 1
	6: K BININT1 2
	8: K BININT1 3
	10: ( MARK
	11: K BININT1 1
	13: K BININT1 2
	15: K BININT1 3
	17: \x97 FROZENSET (MARK at 10)
	18: \x96 UPDATE_SET (MARK at 3)
	19: . STOP
	highest protocol among opcodes = 4
	>>> dis(dumps(s,3))
	0: \x80 PROTO 3
	2: c GLOBAL 'builtins set'
	16: q BINPUT 0
	18: ] EMPTY_LIST
	19: q BINPUT 1
	21: ( MARK
	22: K BININT1 1
	24: K BININT1 2
	26: K BININT1 3
	28: c GLOBAL 'builtins frozenset'
	48: q BINPUT 2
	50: ] EMPTY_LIST
	51: q BINPUT 3
	53: ( MARK
	54: K BININT1 1
	56: K BININT1 2
	58: K BININT1 3
	60: e APPENDS (MARK at 53)
	61: \x85 TUPLE1
	62: q BINPUT 4
	64: R REDUCE
	65: q BINPUT 5
	67: e APPENDS (MARK at 21)
	68: \x85 TUPLE1
	69: q BINPUT 6
	71: R REDUCE
	72: q BINPUT 7
	74: . STOP
	highest protocol among opcodes = 2
	>>> a=A(); s=set([1,2,a]); a.s=s
	>>> dumps(s,4) # self-referential sets
	b'\x80\x04\x94(K\x01K\x02\x93\x00\x01A)\x81}\x8e\x01sh\x00sb\x96.'
	>>> dis(dumps(s,4))
	0: \x80 PROTO 4
	2: \x94 EMPTY_SET
	3: ( MARK
	4: K BININT1 1
	6: K BININT1 2
	8: \x93 BINGLOBAL_COMMON '0 A'
	12: ) EMPTY_TUPLE
	13: \x81 NEWOBJ
	14: } EMPTY_DICT
	15: \x8e SHORT_BINUNICODE 's'
	18: h BINGET 0
	20: s SETITEM
	21: b BUILD
	22: \x96 UPDATE_SET (MARK at 3)
	23: . STOP
	highest protocol among opcodes = 4
	>>> a=A(); fs=frozenset([1,2,a]); a.fs=fs;
	>>> dis(dumps(fs,4)) # self-referential frozensets
	0: \x80 PROTO 4
	2: ( MARK
	3: K BININT1 1
	5: K BININT1 2
	7: \x93 BINGLOBAL_COMMON '0 A'
	11: ) EMPTY_TUPLE
	12: \x81 NEWOBJ
	13: } EMPTY_DICT
	14: \x8e SHORT_BINUNICODE 'fs'
	18: ( MARK
	19: K BININT1 1
	21: K BININT1 2
	23: h BINGET 1
	25: \x97 FROZENSET (MARK at 18)
	26: s SETITEM
	27: b BUILD
	28: 1 POP_MARK (MARK at 2)
	29: h BINGET 4
	31: . STOP
	highest protocol among opcodes = 4
	>>>


	Example: Nested globals
	===============================================================================
	>>> class A:
	... class B:
	... def f(self):
	... return self.x
	...
	>>> dis(dumps(A.B.f,4))
	0: \x80 PROTO 4
	2: \x93 BINGLOBAL_COMMON '0 A.B.f'
	10: . STOP
	>>> b=A.B(); b.x=42
	>>> loads(dumps(A.B.f,4))(b)
	42
	>>> dumps(b.f,4)
	0: \x80 PROTO 4
	2: \x93 BINGLOBAL_COMMON '0 A.B'
	8: ) EMPTY_TUPLE
	9: \x81 NEWOBJ
	10: } EMPTY_DICT
	11: \x8e SHORT_BINUNICODE 'x'
	14: K BININT1 42
	16: s SETITEM
	17: b BUILD
	18: \x93 BINGLOBAL_COMMON '0 A.B.f'
	26: \x98 BIND
	27: . STOP
	highest protocol among opcodes = 4
	>>> loads(dumps(b.f,4))()
	42


	Work on the second part
	===============================================================================
	* Write patches for the discovered bugs and create an issue on the bugtracker
	* Implement the opcodes that have already been implemented for the Python
	version until all the tests pass.
	* Finally, documentation and testing.


	Conclusion
	===============================================================================
	I am on track with the work. Because the code is pretty well tested and
	pickletools has been kept up to date accordingly, true test-driven development
	can now be achieved for _pickle.
	Here is a list of some other ideas that could be implemented before pickle4 is
	released:
	* Unicode compression of strings for more efficient pickling/unpickling
	(eventually with an option to enable/disable this feature so the user can
	choose a space-time tradeoff)
	* Zigzag encoding of signed integers
	* Implementation of SecurePickler/SecureUnpickler classes, which allow
	unpickling of data coming from untrusted streams (such as networks).
	Arbitrary code execution can be easily achieved at this time:
	>>> from test.pickletester import AbstractPickleTests
	>>> apt=AbstractPickleTests()
	>>> evil=apt._make_pickle_str('{PROTO} \x04'
	... '{BINGLOBAL} \x08builtins\x04eval'
	... '{SHORT_BINUNICODE} \x0e print("evil!")'
	... '{TUPLE1} {REDUCE} {STOP}')
	>>> loads(evil)
	evil!
	A more secure pickle can currently be created by deriving from Unpickler
	and overriding the find_class method, thus limiting what globals can be
	looked up and which objects can be created. But this has several
	disatvantages and limitations. A SecurePickler would have features such as
	the following:
	o Only a python implementation, because security is more important than
	speed here, and it's easier to screw up in C than Python.
	o Either never raise an exception in unpickling (such as by providing a
	default value to the unpickler in case unpickling fails), or raise
	only a known, small set of exceptions by using exception chaining
	(hiding exceptions is bad).
	o Disable obsolete opcodes (opcodes which have been replaced by newer
	ones) to decrease the chance of finding security weaknesses.
	o Have security parameters such as memo_maxsize, stack_maxsize,
	recurse_maxdepth, allow_globals, limiting the resources that the
	unpickler can use.
	memo_maxsize => don't memoize more objects than what this value
	indicates
	stack_maxsize => bail out if the unpickler's stack goes above
	this size
	recurse_maxdepth => don't recurse above this depth when
	unpickling
	allow_globals => whether a limited subset of globals should be
	allowed
	o Have a few "profiles", where each profile is a dictionary with sane
	values for the security parameters, so the developer doesn't have
	to worry too much about what would be a good value for
	memo_maxsize. E.g.:
	basic_datatypes_only = {'memo_maxsize' : 100000,
	'stack_maxsize' : 100000,
	'recurse_maxdepth' : 10,
	'allow_globals' : False}
	o Take a allowed_globals security parameter, forbidding use of all
	other globals not here.
	o Maybe group certain opcodes (such as SHORT_BINBYTES,
	SEMISHORT_BINBYTES, BINBYTES, BINBYTES64 in the group bytes) so the
	developer can choose which groups of opcodes to allow.
	Clearly, SecurePickler would only work with pickle>=4.

	I doubt I can also implement the above in the given timeline, but I believe
	they should be considered prior to releasing pickle4, because updating later
	won't be possible without creating incompatibilities (with the possible
	exception of SecureUnpickler, though problems could arise there too if
	implementing the SecureUnpickler requires changing stuff in Unpickler to allow
	a greater degree of control).


	[1] http://www.python.org/dev/peps/pep-3154/
	[2] https://bitbucket.org/mstefanro/pickle4/