Skip to content

Instantly share code, notes, and snippets.

@mstefanro
Created July 10, 2012 22:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mstefanro/3086647 to your computer and use it in GitHub Desktop.
Save mstefanro/3086647 to your computer and use it in GitHub Desktop.
pickle4 midterm report
Abstract
===============================================================================
As has been initially described in my proposal, my work until midterm involved
improving the python version of pickle with the suggestions in PEP 3154 [1].
The progress so far has been tagged in the picklev4 repository at [2], under
the tag pickle4-midterm.
Features
===============================================================================
Here follows a list of the improvements which were made:
1) pickling of very large bytes and strings
- not yet tested, need a PC with lots of memory or a smarter way to
test (such as creating a NullStream which simply drops everything
written to it except for the first few bytes and an object which
behaves like a very large bytes but doesn't actually store the data).
2) more space-efficient pickling of small strings and bytes.
Tested by: test_v4_efficient_unicode, test_v4_efficient_binbytes
3) native pickling with sets and frozensets, with support for
self-referential sets and frozensets.
Tested by: test_v4_sets, test_v4_set_refs, test_v4_sets_opcodes,
test_v4_cycles
4) removed the BINPUT opcode (the pickler and unpickler now agree to use
the same indexing technique and agree on which opcodes memoize
something).
Tested by pretty much all of the tests, because unpickling would fail
if the strategy won't be consistent between the pickler and the
unpickler.
5) better error checking in v4 and more consistent exceptions are raised
(typically only UnpicklingError, EOFError and MemoryError should be
raised by the unpickler in sane conditions).
Tested by: test_v4_exceptions
6) removed GLOBAL in v4 in favor of binary equivalents, with the addition
of a BINGLOBAL_COMMON opcode (which has a list of commonly-used module
names).
Tested by: test_v4_binglobal_common, test_v4_binglobal_big,
test_v4_nested_classes
7) pickling of nested classes, functions in classes and bound methods.
Tested by: test_v4_nested_classes, test_v4_pickle_bound_methods
8) pickling/unpickling calls of __new__ with keyword args.
Tested by: test_v4_new_kwargs
9) in case pickling on a stream where data can't be "taken back" once has
been written, the BAIL_OUT opcode has been added for the situations when
the pickler fails "late" and wants to inform an unpickler which reads from
the same stream that all the data it has unpickled so far is garbage and
should be abandoned.
- test_v4_bail_out
New opcodes: SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64,
SEMISHORT_BINBYTES, BINBYTES64, BINGLOBAL, BINGLOBAL_BIG, BINGLOBAL_COMMON,
EMPTY_SET, EMPTY_FROZENSET, UPDATE_SET, FROZENSET, BIND, NEWOBJ_KW,
BAIL_OUT
Example: small strings and no binput
===============================================================================
>>> l=[1, 2, 3, b'abc', 'abc']
>>> dis(dumps(l,4))
0: \x80 PROTO 4
2: ] EMPTY_LIST
3: ( MARK
4: K BININT1 1
6: K BININT1 2
8: K BININT1 3
10: C SHORT_BINBYTES 'abc'
15: \x8e SHORT_BINUNICODE 'abc'
20: e APPENDS (MARK at 3)
21: . STOP
highest protocol among opcodes = 4
>>> len(dumps(l,4)), len(dumps(l,3))
(22, 31)
Example: sets and frozensets
===============================================================================
>>> fs=frozenset([1,2,3]); s=set([1,2,fs,3])
>>> len(dumps(s,4)), len(dumps(s,3))
(20, 75)
>>> dis(dumps(s,4))
0: \x80 PROTO 4
2: \x94 EMPTY_SET
3: ( MARK
4: K BININT1 1
6: K BININT1 2
8: K BININT1 3
10: ( MARK
11: K BININT1 1
13: K BININT1 2
15: K BININT1 3
17: \x97 FROZENSET (MARK at 10)
18: \x96 UPDATE_SET (MARK at 3)
19: . STOP
highest protocol among opcodes = 4
>>> dis(dumps(s,3))
0: \x80 PROTO 3
2: c GLOBAL 'builtins set'
16: q BINPUT 0
18: ] EMPTY_LIST
19: q BINPUT 1
21: ( MARK
22: K BININT1 1
24: K BININT1 2
26: K BININT1 3
28: c GLOBAL 'builtins frozenset'
48: q BINPUT 2
50: ] EMPTY_LIST
51: q BINPUT 3
53: ( MARK
54: K BININT1 1
56: K BININT1 2
58: K BININT1 3
60: e APPENDS (MARK at 53)
61: \x85 TUPLE1
62: q BINPUT 4
64: R REDUCE
65: q BINPUT 5
67: e APPENDS (MARK at 21)
68: \x85 TUPLE1
69: q BINPUT 6
71: R REDUCE
72: q BINPUT 7
74: . STOP
highest protocol among opcodes = 2
>>> a=A(); s=set([1,2,a]); a.s=s
>>> dumps(s,4) # self-referential sets
b'\x80\x04\x94(K\x01K\x02\x93\x00\x01A)\x81}\x8e\x01sh\x00sb\x96.'
>>> dis(dumps(s,4))
0: \x80 PROTO 4
2: \x94 EMPTY_SET
3: ( MARK
4: K BININT1 1
6: K BININT1 2
8: \x93 BINGLOBAL_COMMON '0 A'
12: ) EMPTY_TUPLE
13: \x81 NEWOBJ
14: } EMPTY_DICT
15: \x8e SHORT_BINUNICODE 's'
18: h BINGET 0
20: s SETITEM
21: b BUILD
22: \x96 UPDATE_SET (MARK at 3)
23: . STOP
highest protocol among opcodes = 4
>>> a=A(); fs=frozenset([1,2,a]); a.fs=fs;
>>> dis(dumps(fs,4)) # self-referential frozensets
0: \x80 PROTO 4
2: ( MARK
3: K BININT1 1
5: K BININT1 2
7: \x93 BINGLOBAL_COMMON '0 A'
11: ) EMPTY_TUPLE
12: \x81 NEWOBJ
13: } EMPTY_DICT
14: \x8e SHORT_BINUNICODE 'fs'
18: ( MARK
19: K BININT1 1
21: K BININT1 2
23: h BINGET 1
25: \x97 FROZENSET (MARK at 18)
26: s SETITEM
27: b BUILD
28: 1 POP_MARK (MARK at 2)
29: h BINGET 4
31: . STOP
highest protocol among opcodes = 4
>>>
Example: Nested globals
===============================================================================
>>> class A:
... class B:
... def f(self):
... return self.x
...
>>> dis(dumps(A.B.f,4))
0: \x80 PROTO 4
2: \x93 BINGLOBAL_COMMON '0 A.B.f'
10: . STOP
>>> b=A.B(); b.x=42
>>> loads(dumps(A.B.f,4))(b)
42
>>> dumps(b.f,4)
0: \x80 PROTO 4
2: \x93 BINGLOBAL_COMMON '0 A.B'
8: ) EMPTY_TUPLE
9: \x81 NEWOBJ
10: } EMPTY_DICT
11: \x8e SHORT_BINUNICODE 'x'
14: K BININT1 42
16: s SETITEM
17: b BUILD
18: \x93 BINGLOBAL_COMMON '0 A.B.f'
26: \x98 BIND
27: . STOP
highest protocol among opcodes = 4
>>> loads(dumps(b.f,4))()
42
Work on the second part
===============================================================================
* Write patches for the discovered bugs and create an issue on the bugtracker
* Implement the opcodes that have already been implemented for the Python
version until all the tests pass.
* Finally, documentation and testing.
Conclusion
===============================================================================
I am on track with the work. Because the code is pretty well tested and
pickletools has been kept up to date accordingly, true test-driven development
can now be achieved for _pickle.
Here is a list of some other ideas that could be implemented before pickle4 is
released:
* Unicode compression of strings for more efficient pickling/unpickling
(eventually with an option to enable/disable this feature so the user can
choose a space-time tradeoff)
* Zigzag encoding of signed integers
* Implementation of SecurePickler/SecureUnpickler classes, which allow
unpickling of data coming from untrusted streams (such as networks).
Arbitrary code execution can be easily achieved at this time:
>>> from test.pickletester import AbstractPickleTests
>>> apt=AbstractPickleTests()
>>> evil=apt._make_pickle_str('{PROTO} \x04'
... '{BINGLOBAL} \x08builtins\x04eval'
... '{SHORT_BINUNICODE} \x0e print("evil!")'
... '{TUPLE1} {REDUCE} {STOP}')
>>> loads(evil)
evil!
A more secure pickle can currently be created by deriving from Unpickler
and overriding the find_class method, thus limiting what globals can be
looked up and which objects can be created. But this has several
disatvantages and limitations. A SecurePickler would have features such as
the following:
o Only a python implementation, because security is more important than
speed here, and it's easier to screw up in C than Python.
o Either never raise an exception in unpickling (such as by providing a
default value to the unpickler in case unpickling fails), or raise
only a known, small set of exceptions by using exception chaining
(hiding exceptions is bad).
o Disable obsolete opcodes (opcodes which have been replaced by newer
ones) to decrease the chance of finding security weaknesses.
o Have security parameters such as memo_maxsize, stack_maxsize,
recurse_maxdepth, allow_globals, limiting the resources that the
unpickler can use.
memo_maxsize => don't memoize more objects than what this value
indicates
stack_maxsize => bail out if the unpickler's stack goes above
this size
recurse_maxdepth => don't recurse above this depth when
unpickling
allow_globals => whether a limited subset of globals should be
allowed
o Have a few "profiles", where each profile is a dictionary with sane
values for the security parameters, so the developer doesn't have
to worry too much about what would be a good value for
memo_maxsize. E.g.:
basic_datatypes_only = {'memo_maxsize' : 100000,
'stack_maxsize' : 100000,
'recurse_maxdepth' : 10,
'allow_globals' : False}
o Take a allowed_globals security parameter, forbidding use of all
other globals not here.
o Maybe group certain opcodes (such as SHORT_BINBYTES,
SEMISHORT_BINBYTES, BINBYTES, BINBYTES64 in the group bytes) so the
developer can choose which groups of opcodes to allow.
Clearly, SecurePickler would only work with pickle>=4.
I doubt I can also implement the above in the given timeline, but I believe
they should be considered prior to releasing pickle4, because updating later
won't be possible without creating incompatibilities (with the possible
exception of SecureUnpickler, though problems could arise there too if
implementing the SecureUnpickler requires changing stuff in Unpickler to allow
a greater degree of control).
[1] http://www.python.org/dev/peps/pep-3154/
[2] https://bitbucket.org/mstefanro/pickle4/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment