Pages

Sunday, January 27, 2019

Python Object Persistence Module, pickle — Quick Notes

Pickling or serialization is the process of converting a Python object to a byte stream; and Unpickling or deserialization is the process of re-creating the original in-memory Python object (not necessarily at the same memory address).

Python's pickle module has the necessary methods to pickle and unpickle Python object hierarchies.

pickle module:

  • is part of the Python standard library

  • converts arbitrary in-memory Python objects to/from byte streams
    • Those byte streams can be:

      1. saved to any binary file (bytes cannot be written to plain files) for later retrieval.
        eg., save the progress of some activity so the activity can be paused and resumed

        - OR -
      2. sent over network between Python end-points that expect binary data

It is possible to pickle a variety of data types including built-in types — numeric types (integer, float, complex numbers), sequence types (lists, tuples), text sequence type (strings), binary sequence types (bytes, bytearray), set types (set), mapping types (dictionary), classes and built-in functions defined at the top level of a module.

Any attempt to pickle an unpicklable object may trigger PicklingError exception.

Couple of gotchas:

  • Pickle is specific to Python - so, ideal to use pickled objects within Python ecosystem.
    • When dealing with applications written in different programming languages; or even within Python ecosystem with different versions of Python involved, best to avoid Pickle as non-Python applications may not be able to reconstruct pickled Python objects.

      • Alternatives: consider data formats that are ideal for interoperability such as JSON, XML
  • Given the nature of binary data, pickled Python objects are not human-readable unless earlier protocols are used to serialize the data

eg.,

A trivial example demonstrating the calls to pickle (save data to a binary file) and unpickle (load data from the binary file) a Python data structure.

#!/usr/bin/python

import pickle

EMP = {}
EMP['name'] = 'Gary'
EMP['id'] = 12345

# pickle
with open('employee.db', 'wb') as f:
 pickle.dump(EMP, f, pickle.HIGHEST_PROTOCOL)

print '  Pickled data, EMP     ', EMP

# unpickle
with open('employee.db', 'rb') as f:
 EMP_REC = pickle.load(f)

print 'Unpickled data, EMP_REC ', EMP_REC, '\n'

print '(EMP_REC is EMP)? : ', (EMP_REC is EMP)
print '(EMP_REC == EMP)? : ', (EMP_REC == EMP)

Running the above code shows the following on stdout.

  Pickled data, EMP      {'name': 'Gary', 'id': 12345}
Unpickled data, EMP_REC  {'name': 'Gary', 'id': 12345} 

(EMP_REC is EMP)? :  False
(EMP_REC == EMP)? :  True

dump() method takes a serializable Python object as the first argument; and writes pickled representation of the object (serialized object) to a file. Second argument is the file handle that points to an open file. Rest of the arguments are optional.

    Third argument, if specified, is the protocol to use. pickle.HIGHEST_PROTOCOL tells pickle module to use the highest protocol version available. When working with a mix of old and new Python versions, using earlier versions of the protocol may ease or eliminate some of the potential compatibility issues.

    As highlighted earlier, the protocol used by pickle module is Python specific so better watch out for cross-language compatibility issues while working in heterogeneous environments.

load() method reads a pickled object representation (serialized data) from a file and returns the reconstructed object. The protocol version is detected automatically so it is not necessary to specify the protocol version during unpickling process.

In-Memory Pickling/Unpickling Operations

If persistence is not a requirement, dumps() and loads() methods in pickle module can be used to serialize (pickle) and deserialize (unpickle) a Python object in memory. This is useful when sending Python objects over network between compatible applications.

eg.,
#!/usr/bin/python

import pickle

EMP = {}
EMP['name'] = 'Gary'
EMP['id'] = 12345

# in-memory pickling
x = pickle.dumps(EMP, pickle.HIGHEST_PROTOCOL)

print '  Pickled data, EMP     ', EMP

# in-memory unpickling
EMP_REC = pickle.loads(x)

print 'Unpickled data, EMP_REC ', EMP_REC, '\n'

print '(EMP_REC is EMP)? : ', (EMP_REC is EMP)
print '(EMP_REC == EMP)? : ', (EMP_REC == EMP)

Running the above code shows output identical to the output produced by the previous code listing - just that there is no file involved this time.

  Pickled data, EMP      {'name': 'Gary', 'id': 12345}
Unpickled data, EMP_REC  {'name': 'Gary', 'id': 12345} 

(EMP_REC is EMP)? :  False
(EMP_REC == EMP)? :  True

Exceptions

As mentioned earlier, any attempt to pickle or unpickle objects that are not appropriate for serialization fail with an exception. Therefore, it is appropriate to safe guard the code with try-except blocks to handle unexpected failures.

Here is another trivial example demonstrating a pickling exception.

#!/usr/bin/python

import sys
import pickle

try:

 f = open('dummy.txt', 'a')
 x = pickle.dumps(f)
 print 'Pickled file handle'

except Exception, e:
 print 'Caught ', e.__class__.__name__, '-',  str(e)

Running the above code throws a TypeError as shown below.

Caught  TypeError - can't pickle file objects

(Credit: Various Sources including Python Documentation)

No comments:

Post a Comment