Created
July 10, 2012 22:33
-
-
Save mstefanro/3086647 to your computer and use it in GitHub Desktop.
pickle4 midterm report
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Abstract | |
=============================================================================== | |
As has been initially described in my proposal, my work until midterm involved | |
improving the python version of pickle with the suggestions in PEP 3154 [1]. | |
The progress so far has been tagged in the picklev4 repository at [2], under | |
the tag pickle4-midterm. | |
Features | |
=============================================================================== | |
Here follows a list of the improvements which were made: | |
1) pickling of very large bytes and strings | |
- not yet tested, need a PC with lots of memory or a smarter way to | |
test (such as creating a NullStream which simply drops everything | |
written to it except for the first few bytes and an object which | |
behaves like a very large bytes but doesn't actually store the data). | |
2) more space-efficient pickling of small strings and bytes. | |
Tested by: test_v4_efficient_unicode, test_v4_efficient_binbytes | |
3) native pickling with sets and frozensets, with support for | |
self-referential sets and frozensets. | |
Tested by: test_v4_sets, test_v4_set_refs, test_v4_sets_opcodes, | |
test_v4_cycles | |
4) removed the BINPUT opcode (the pickler and unpickler now agree to use | |
the same indexing technique and agree on which opcodes memoize | |
something). | |
Tested by pretty much all of the tests, because unpickling would fail | |
if the strategy won't be consistent between the pickler and the | |
unpickler. | |
5) better error checking in v4 and more consistent exceptions are raised | |
(typically only UnpicklingError, EOFError and MemoryError should be | |
raised by the unpickler in sane conditions). | |
Tested by: test_v4_exceptions | |
6) removed GLOBAL in v4 in favor of binary equivalents, with the addition | |
of a BINGLOBAL_COMMON opcode (which has a list of commonly-used module | |
names). | |
Tested by: test_v4_binglobal_common, test_v4_binglobal_big, | |
test_v4_nested_classes | |
7) pickling of nested classes, functions in classes and bound methods. | |
Tested by: test_v4_nested_classes, test_v4_pickle_bound_methods | |
8) pickling/unpickling calls of __new__ with keyword args. | |
Tested by: test_v4_new_kwargs | |
9) in case pickling on a stream where data can't be "taken back" once has | |
been written, the BAIL_OUT opcode has been added for the situations when | |
the pickler fails "late" and wants to inform an unpickler which reads from | |
the same stream that all the data it has unpickled so far is garbage and | |
should be abandoned. | |
- test_v4_bail_out | |
New opcodes: SEMISHORT_BINUNICODE, SHORT_BINUNICODE, BINUNICODE64, | |
SEMISHORT_BINBYTES, BINBYTES64, BINGLOBAL, BINGLOBAL_BIG, BINGLOBAL_COMMON, | |
EMPTY_SET, EMPTY_FROZENSET, UPDATE_SET, FROZENSET, BIND, NEWOBJ_KW, | |
BAIL_OUT | |
Example: small strings and no binput | |
=============================================================================== | |
>>> l=[1, 2, 3, b'abc', 'abc'] | |
>>> dis(dumps(l,4)) | |
0: \x80 PROTO 4 | |
2: ] EMPTY_LIST | |
3: ( MARK | |
4: K BININT1 1 | |
6: K BININT1 2 | |
8: K BININT1 3 | |
10: C SHORT_BINBYTES 'abc' | |
15: \x8e SHORT_BINUNICODE 'abc' | |
20: e APPENDS (MARK at 3) | |
21: . STOP | |
highest protocol among opcodes = 4 | |
>>> len(dumps(l,4)), len(dumps(l,3)) | |
(22, 31) | |
Example: sets and frozensets | |
=============================================================================== | |
>>> fs=frozenset([1,2,3]); s=set([1,2,fs,3]) | |
>>> len(dumps(s,4)), len(dumps(s,3)) | |
(20, 75) | |
>>> dis(dumps(s,4)) | |
0: \x80 PROTO 4 | |
2: \x94 EMPTY_SET | |
3: ( MARK | |
4: K BININT1 1 | |
6: K BININT1 2 | |
8: K BININT1 3 | |
10: ( MARK | |
11: K BININT1 1 | |
13: K BININT1 2 | |
15: K BININT1 3 | |
17: \x97 FROZENSET (MARK at 10) | |
18: \x96 UPDATE_SET (MARK at 3) | |
19: . STOP | |
highest protocol among opcodes = 4 | |
>>> dis(dumps(s,3)) | |
0: \x80 PROTO 3 | |
2: c GLOBAL 'builtins set' | |
16: q BINPUT 0 | |
18: ] EMPTY_LIST | |
19: q BINPUT 1 | |
21: ( MARK | |
22: K BININT1 1 | |
24: K BININT1 2 | |
26: K BININT1 3 | |
28: c GLOBAL 'builtins frozenset' | |
48: q BINPUT 2 | |
50: ] EMPTY_LIST | |
51: q BINPUT 3 | |
53: ( MARK | |
54: K BININT1 1 | |
56: K BININT1 2 | |
58: K BININT1 3 | |
60: e APPENDS (MARK at 53) | |
61: \x85 TUPLE1 | |
62: q BINPUT 4 | |
64: R REDUCE | |
65: q BINPUT 5 | |
67: e APPENDS (MARK at 21) | |
68: \x85 TUPLE1 | |
69: q BINPUT 6 | |
71: R REDUCE | |
72: q BINPUT 7 | |
74: . STOP | |
highest protocol among opcodes = 2 | |
>>> a=A(); s=set([1,2,a]); a.s=s | |
>>> dumps(s,4) # self-referential sets | |
b'\x80\x04\x94(K\x01K\x02\x93\x00\x01A)\x81}\x8e\x01sh\x00sb\x96.' | |
>>> dis(dumps(s,4)) | |
0: \x80 PROTO 4 | |
2: \x94 EMPTY_SET | |
3: ( MARK | |
4: K BININT1 1 | |
6: K BININT1 2 | |
8: \x93 BINGLOBAL_COMMON '0 A' | |
12: ) EMPTY_TUPLE | |
13: \x81 NEWOBJ | |
14: } EMPTY_DICT | |
15: \x8e SHORT_BINUNICODE 's' | |
18: h BINGET 0 | |
20: s SETITEM | |
21: b BUILD | |
22: \x96 UPDATE_SET (MARK at 3) | |
23: . STOP | |
highest protocol among opcodes = 4 | |
>>> a=A(); fs=frozenset([1,2,a]); a.fs=fs; | |
>>> dis(dumps(fs,4)) # self-referential frozensets | |
0: \x80 PROTO 4 | |
2: ( MARK | |
3: K BININT1 1 | |
5: K BININT1 2 | |
7: \x93 BINGLOBAL_COMMON '0 A' | |
11: ) EMPTY_TUPLE | |
12: \x81 NEWOBJ | |
13: } EMPTY_DICT | |
14: \x8e SHORT_BINUNICODE 'fs' | |
18: ( MARK | |
19: K BININT1 1 | |
21: K BININT1 2 | |
23: h BINGET 1 | |
25: \x97 FROZENSET (MARK at 18) | |
26: s SETITEM | |
27: b BUILD | |
28: 1 POP_MARK (MARK at 2) | |
29: h BINGET 4 | |
31: . STOP | |
highest protocol among opcodes = 4 | |
>>> | |
Example: Nested globals | |
=============================================================================== | |
>>> class A: | |
... class B: | |
... def f(self): | |
... return self.x | |
... | |
>>> dis(dumps(A.B.f,4)) | |
0: \x80 PROTO 4 | |
2: \x93 BINGLOBAL_COMMON '0 A.B.f' | |
10: . STOP | |
>>> b=A.B(); b.x=42 | |
>>> loads(dumps(A.B.f,4))(b) | |
42 | |
>>> dumps(b.f,4) | |
0: \x80 PROTO 4 | |
2: \x93 BINGLOBAL_COMMON '0 A.B' | |
8: ) EMPTY_TUPLE | |
9: \x81 NEWOBJ | |
10: } EMPTY_DICT | |
11: \x8e SHORT_BINUNICODE 'x' | |
14: K BININT1 42 | |
16: s SETITEM | |
17: b BUILD | |
18: \x93 BINGLOBAL_COMMON '0 A.B.f' | |
26: \x98 BIND | |
27: . STOP | |
highest protocol among opcodes = 4 | |
>>> loads(dumps(b.f,4))() | |
42 | |
Work on the second part | |
=============================================================================== | |
* Write patches for the discovered bugs and create an issue on the bugtracker | |
* Implement the opcodes that have already been implemented for the Python | |
version until all the tests pass. | |
* Finally, documentation and testing. | |
Conclusion | |
=============================================================================== | |
I am on track with the work. Because the code is pretty well tested and | |
pickletools has been kept up to date accordingly, true test-driven development | |
can now be achieved for _pickle. | |
Here is a list of some other ideas that could be implemented before pickle4 is | |
released: | |
* Unicode compression of strings for more efficient pickling/unpickling | |
(eventually with an option to enable/disable this feature so the user can | |
choose a space-time tradeoff) | |
* Zigzag encoding of signed integers | |
* Implementation of SecurePickler/SecureUnpickler classes, which allow | |
unpickling of data coming from untrusted streams (such as networks). | |
Arbitrary code execution can be easily achieved at this time: | |
>>> from test.pickletester import AbstractPickleTests | |
>>> apt=AbstractPickleTests() | |
>>> evil=apt._make_pickle_str('{PROTO} \x04' | |
... '{BINGLOBAL} \x08builtins\x04eval' | |
... '{SHORT_BINUNICODE} \x0e print("evil!")' | |
... '{TUPLE1} {REDUCE} {STOP}') | |
>>> loads(evil) | |
evil! | |
A more secure pickle can currently be created by deriving from Unpickler | |
and overriding the find_class method, thus limiting what globals can be | |
looked up and which objects can be created. But this has several | |
disatvantages and limitations. A SecurePickler would have features such as | |
the following: | |
o Only a python implementation, because security is more important than | |
speed here, and it's easier to screw up in C than Python. | |
o Either never raise an exception in unpickling (such as by providing a | |
default value to the unpickler in case unpickling fails), or raise | |
only a known, small set of exceptions by using exception chaining | |
(hiding exceptions is bad). | |
o Disable obsolete opcodes (opcodes which have been replaced by newer | |
ones) to decrease the chance of finding security weaknesses. | |
o Have security parameters such as memo_maxsize, stack_maxsize, | |
recurse_maxdepth, allow_globals, limiting the resources that the | |
unpickler can use. | |
memo_maxsize => don't memoize more objects than what this value | |
indicates | |
stack_maxsize => bail out if the unpickler's stack goes above | |
this size | |
recurse_maxdepth => don't recurse above this depth when | |
unpickling | |
allow_globals => whether a limited subset of globals should be | |
allowed | |
o Have a few "profiles", where each profile is a dictionary with sane | |
values for the security parameters, so the developer doesn't have | |
to worry too much about what would be a good value for | |
memo_maxsize. E.g.: | |
basic_datatypes_only = {'memo_maxsize' : 100000, | |
'stack_maxsize' : 100000, | |
'recurse_maxdepth' : 10, | |
'allow_globals' : False} | |
o Take a allowed_globals security parameter, forbidding use of all | |
other globals not here. | |
o Maybe group certain opcodes (such as SHORT_BINBYTES, | |
SEMISHORT_BINBYTES, BINBYTES, BINBYTES64 in the group bytes) so the | |
developer can choose which groups of opcodes to allow. | |
Clearly, SecurePickler would only work with pickle>=4. | |
I doubt I can also implement the above in the given timeline, but I believe | |
they should be considered prior to releasing pickle4, because updating later | |
won't be possible without creating incompatibilities (with the possible | |
exception of SecureUnpickler, though problems could arise there too if | |
implementing the SecureUnpickler requires changing stuff in Unpickler to allow | |
a greater degree of control). | |
[1] http://www.python.org/dev/peps/pep-3154/ | |
[2] https://bitbucket.org/mstefanro/pickle4/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment