Internals

mmapdict files are pickle files, containing a dictionary, but with a special format. The main idea is to have a file of predictable structure, to be able to compute the offsets for the memory maps. Moreover, a way to disable a specific key is required, either to replace it or to delete it without changing the offsets of the file.

For example, for the following dictionary:

{'key': 'value', 'test': array([1, 2, 3], dtype=uint8)}

The normal pickle module would output:

    0: \x80 PROTO      4
    2: \x95 FRAME      172
   11: }    EMPTY_DICT
   12: \x94 MEMOIZE
   13: (    MARK
   14: \x8c     SHORT_BINUNICODE 'test'
   20: \x94     MEMOIZE
   21: \x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
   44: \x94     MEMOIZE
   45: \x8c     SHORT_BINUNICODE '_reconstruct'
   59: \x94     MEMOIZE
   60: \x93     STACK_GLOBAL
   61: \x94     MEMOIZE
   62: \x8c     SHORT_BINUNICODE 'numpy'
   69: \x94     MEMOIZE
   70: \x8c     SHORT_BINUNICODE 'ndarray'
   79: \x94     MEMOIZE
   80: \x93     STACK_GLOBAL
   81: \x94     MEMOIZE
   82: K        BININT1    0
   84: \x85     TUPLE1
   85: \x94     MEMOIZE
   86: C        SHORT_BINBYTES b'b'
   89: \x94     MEMOIZE
   90: \x87     TUPLE3
   91: \x94     MEMOIZE
   92: R        REDUCE
   93: \x94     MEMOIZE
   94: (        MARK
   95: K            BININT1    1
   97: K            BININT1    3
   99: \x85         TUPLE1
  100: \x94         MEMOIZE
  101: \x8c         SHORT_BINUNICODE 'numpy'
  108: \x94         MEMOIZE
  109: \x8c         SHORT_BINUNICODE 'dtype'
  116: \x94         MEMOIZE
  117: \x93         STACK_GLOBAL
  118: \x94         MEMOIZE
  119: \x8c         SHORT_BINUNICODE 'u1'
  123: \x94         MEMOIZE
  124: K            BININT1    0
  126: K            BININT1    1
  128: \x87         TUPLE3
  129: \x94         MEMOIZE
  130: R            REDUCE
  131: \x94         MEMOIZE
  132: (            MARK
  133: K                BININT1    3
  135: \x8c             SHORT_BINUNICODE '|'
  138: \x94             MEMOIZE
  139: N                NONE
  140: N                NONE
  141: N                NONE
  142: J                BININT     -1
  147: J                BININT     -1
  152: K                BININT1    0
  154: t                TUPLE      (MARK at 132)
  155: \x94         MEMOIZE
  156: b            BUILD
  157: \x89         NEWFALSE
  158: C            SHORT_BINBYTES b'\x01\x02\x03'
  163: \x94         MEMOIZE
  164: t            TUPLE      (MARK at 94)
  165: \x94     MEMOIZE
  166: b        BUILD
  167: \x8c     SHORT_BINUNICODE 'key'
  172: \x94     MEMOIZE
  173: \x8c     SHORT_BINUNICODE 'value'
  180: \x94     MEMOIZE
  181: u        SETITEMS   (MARK at 13)
  182: .    STOP
highest protocol among opcodes = 4

This works fine, but doesn’t allow random access.

Let’s look at what a mmappickle.dict file looks like, for the same data:

    0: \x80 PROTO      4
    2: \x95 FRAME      13
   11: J    BININT     1
   16: 0    POP
   17: J    BININT     2
   22: 0    POP
   23: (    MARK
   24: \x95     FRAME      20
   33: \x8c     SHORT_BINUNICODE 'key'
   38: \x8c     SHORT_BINUNICODE 'value'
   45: J        BININT     1
   50: 0        POP
   51: \x88     NEWTRUE
   52: 0        POP
   53: \x95     FRAME      110
   62: \x8c     SHORT_BINUNICODE 'test'
   68: \x8c     SHORT_BINUNICODE 'numpy.core.fromnumeric'
   92: \x8c     SHORT_BINUNICODE 'reshape'
  101: \x93     STACK_GLOBAL
  102: \x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
  125: \x8c     SHORT_BINUNICODE 'fromstring'
  137: \x93     STACK_GLOBAL
  138: \x8e     BINBYTES8  b'\x01\x02\x03'
  150: \x8c     SHORT_BINUNICODE 'uint8'
  157: \x86     TUPLE2
  158: R        REDUCE
  159: K        BININT1    3
  161: \x85     TUPLE1
  162: \x86     TUPLE2
  163: R        REDUCE
  164: J        BININT     0
  169: 0        POP
  170: \x88     NEWTRUE
  171: 0        POP
  172: \x95     FRAME      2
  181: d        DICT       (MARK at 23)
  182: .    STOP
highest protocol among opcodes = 4

We can note the following changes:

  • There are hidden values at the beginning (version = 1, file revision = 2)
  • Each key-value couple is in an individual frame, which contains a hidden int (memo max index), finally a hidden TRUE.
  • The numpy array is created using numpy.core.fromnumeric.reshape(numpy.core.multiarray.from_string(data, dtype), shape) instead of the “traditionnal” way

The version field is used to allow further developments, and is fixed to 1 at present. The file revision is increased each time a key of the dictionary is changed, to allow caching when there is concurrent access. Memo max index is used because there may be MEMOIZE/GET/PUT to renumber when pickling values. This is a cache to avoid having to parse all the file.

Finally, the hidden TRUE is a “hack” to allow removing a key. In fact, it is not possible to move data when it’s memmap’ed. To avoid this, the first TRUE is replaced by a POP when deleting the key. In summary, the stack is working in the following way:

  • Key exists: KEY, VALUE, memo max index, POP, TRUE, POP. (reduced as KEY, VALUE)
  • Key doesn’t exist: KEY, VALUE, memo max index, POP, POP, POP. (disappears when reduced)

We can see that the file is composed of three differents parts, which are documented below:

Extending mmappickle

To add support for a new memory mapped value type, one should create a new subclass mmappickle.picklers.base.

This requires some knowledge of the Python internal pickle format, but should be straightforward, using the numpy picklers as inspiration. Feel free to open an issue if more details are required.

Internal API Documentation

class mmappickle.dict._header(mmapdict, _real_header_starts_at=0)[source]

The file header is at the beginning of the file.

It consists in the following pickle ops:

PROTO 4                                (pickle version 4 header)
FRAME <length>
BININT <_file_version_number:32> POP   (version of the pickle dict, 1)
BININT <_file_commit_number:32> POP    (commit id of the pickle dict, incremented every time something changes)
<additional data depending on the _file_version_number> (none, for version 1)
MARK                                   (start of the dictionnary)
__init__(mmapdict, _real_header_starts_at=0)[source]
Parameters:
  • mmapdict – mmapdict object containing the data
  • _real_header_starts_at – Offset of the header (normally not used)
exists
Returns:True if file contains something
write_initial()[source]

Write the initial header to the file

is_valid()[source]
Returns:True if file has a valid mmapdict pickle header, False otherwise.
commit_number

Commit number (revision) in the file

__len__()[source]
Returns:the total length of the header.
__weakref__

list of weak references to the object (if defined)

class mmappickle.dict._terminator(mmapdict)[source]

Terminator is the suffix at the end of the mmapdict file.

It consists is the following pickle ops:

FRAME 2
DICT (make the dictionnary)
STOP (end of the file)
__init__(mmapdict)[source]
Parameters:mmapdict – mmapdict object containing the data
__len__()[source]
Returns:the length of the terminator
exists
Returns:True if the file ends with the terminator, False otherwise
write()[source]

Write the terminator at the end of the file, if it doesn’t exist

__weakref__

list of weak references to the object (if defined)

class mmappickle.dict._kvdata(mmapdict, offset)[source]

kvdata is the structure holding a key-value data entry.

The trick is that it should be either two values, key and value, or nothing, if the value is deleted.

To do this, we put the key and the value on the stack. Then we either push a NEWTRUE+POP (which results in a NO-OP), or we push a POP+POP (which removes both the key and the value). Since NEWTRUE and POP both have length 1, it is easy to make the substitution.

Another trick is to cache the maximum value of the memoization index (for GET and PUT), to ensure that we have no duplicates.

The _kvdata structure has the following pickle ops:

FRAME <length>
SHORT_BINUNICODE <length> <key bytes>
<<< data >>>
BININT <max memo idx> POP (max memo index of this part)
NEWTRUE|POP POP (if NEWTRUE POP: entry is valid, else entry is deactivated.)
__init__(mmapdict, offset)[source]
Parameters:
  • mmapdict – mmapdict object containing the data
  • offset – Offset of the key-value data
__len__()[source]
Returns:the length of the key-value data
offset
Returns:the offset in the file of the key-value data
end_offset
Returns:the end-offset in the file of the key-value data
_frame_length
Returns:the frame length for this _kvdata.

This is done either by reading it in the file, or by computing it if it doesn’t exist

_exists_initial
Returns:True if the file contains the header of the frame
data_offset
Returns:the offset of the pickled data
key_length
Returns:the binary length of the key
_valid_offset
Returns:the offset of the valid byte
_memomaxidx_offset
Returns:the offset of the max memo index
data_length
Returns:True if the file contains the header of the frame
key
Returns:the key as an unicode string
memomaxidx
Returns:the (cached) max memo index
valid
Returns:True if the key-value couple is valid, False otherwise (i.e. key was deleted)
_write_if_allowed()[source]

Write to file, if it is possible to do so

__weakref__

list of weak references to the object (if defined)

class mmappickle.picklers.base.BasePickler(parent_object)[source]

Bases: object

Picklers will be attempted in decreasing priority order

__init__(parent_object)[source]

Initialize self. See help(type(self)) for accurate signature.

is_valid(offset, length)[source]

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]

Return True if object can be pickled with this pickler

read(offset, length)[source]

Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

__weakref__

list of weak references to the object (if defined)

class mmappickle.picklers.base.GenericPickler(parent_object)[source]

Bases: mmappickle.picklers.base.BasePickler

priority

int(x=0) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-‘ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

is_valid(offset, length)[source]

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]

Return True if object can be pickled with this pickler

read(offset, length)[source]

Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

mmappickle.utils.save_file_position(f)[source]

Decorator to save the object._file stream position before calling the method

mmappickle.utils.require_writable(f)[source]

Require the object’s _file to be writable, otherwise raise an exception.

mmappickle.utils.lock(f)[source]

Lock the file during the execution of this method. This is a re-entrant lock.