Thinking about Memory Consumption

Tue Sep 8 20:21:50 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I've been trying to get a handle on some of our memory consumption
concerns, and I'm noticing that we may need to re-think some of our
basic data structures.

The specific one that is showing up, is our choice to represent items in
an index as a tuple of strings. To give specifics, if I do:

b = bzrlib.branch.Branch.open('launchpad/devel')
b.lock_read()
r = b.repository
m = []
m.append(r.revisions.get_parent_map(r.revisions.keys()))
m.append(r.signatures.get_parent_map(r.signatures.keys()))
m.append(r.inventories.get_parent_map(r.inventories.keys()))
m.append(r.chk_bytes.get_parent_map(r.chk_bytes.keys()))
m.append(r.texts.get_parent_map(r.texts.keys()))

bzrlib.trace.debug_memory()

I end up with ~300MB of memory used, just to store the parent_map dicts
for all keys in the repository along with the associated 'values'.

Now, there is some amount of waste present. Specifically, if I just
compute the size of the maps directly, I end up with:
    2,381,318 objects
  132,140,086 bytes

However, if I walk all of these items, and find all the strings, and
then compute their total length, I get:
   23,808,498 bytes (sum(map(len, strs)))
or
   35,962,218 bytes (sum(map(size_of, strs)))

Compare that to:
   63,146,948 bytes (sum(map(size_of, tuples)))

So we have 22MB of actual string *content*, another 11.6MB of PyString
object overhead, and then 60MB of tuple overhead.

One answer could be: "don't load the whole graph into memory, stupid".
However, that is only really possible if we change how fetch is done
(say by telling it to fetch no more that 1000 revisions at a time, etc.)

Another obvious place is in the chk_bytes code. Specifically, there we
use a tuple('sha1:' + hex(sha1),).

If we just used the raw sha1, that would be 20 bytes of content + 24
bytes of string overhead, or 44 bytes per sha1, versus the current 101
bytes. (40 hex digits + 5 'sha1:' + 24 str + 32 tuple bytes)

Of course, if we used a different data structure, we might be all the
way down to around 24 bytes (4 bytes for a pointer, and 20 bytes for the
binary sha1.)

The cheapest reasonable object I could come up with to replace an
individual key *and* a parents list would be something like:

  bytes
  4	long	ob_refcnt
  4	ptr	type_ptr
  1     char	key_width (1 for revisions, 2 for file keys, etc)
  1	char	num_entries (max 256 entries)
  4*k*n ptr     pointers to either PyString or a raw chr buffer

This is effectively like a tuple, except:
  1) it has a max size of 256 entries and 256 key width, allowing us to
     turn a 4-byte long into a 1 byte char. (even better on 64-bit)
  2) It folds a parent list 'inline' rather than creating a tuple of
     tuples.
  3) It doesn't participate in GC because it knows that it is only
     allowed to point to strings, and thus cannot participate in garbage
     collection. This also improves GC processing time, because it
     doesn't have to iterate over all of these objects that we know
     aren't in refcycles. (setting gc.disable() in 'bzr' drops 'bzr log'
     for me from 1.1s => 0.9s or so.)
  4) It can pun as either a parent list, or a single key. It might be
     nice to distinguish between the two. I certainly wouldn't really
     mind if we added another 'flags' field to make it clear.
  5) If we really wanted to minimize things, I think we could also
     create a minimal string type with something like:
     2 bytes	length
     N bytes	content
     And then the final buffer ends up as (2+N)*k*n bytes, but that is a
     lot less than the (24+N) bytes for all the strings.
     The main reason I'm thinking against that, is because PyString is
     pretty heavily optimized in python, with intern() and object
     comparison, and insertion into dicts, etc.
     Though by a similar token, we might want to add a 'long hash' value
     to these objects, since it is across many strings, rather than
     1-per-string.

  6) At the moment, a parents list costs us 64 bytes of tuple overhead
     (tuple of tuple, both of length one.), this would be closer to
     maybe 14-20 bytes of overhead.

I'll probably do some probing here, to see how hard it is to get things
going.

John
=:->

PS> This was driven by noticing that 'bzr branch launchpad launchpad2'
peaks at around 500MB and 300MB is hit if we just load the indexes....
Which makes me think that the streaming code is actually fairly
efficient, and doing the right thing by not holding more than a small
handful of texts in memory at a time, but the index layer is chewing up
the rest of memory.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqmrs4ACgkQJdeBCYSNAAMnygCeK1GwcHz4CqcPcM9J8ayJEkyx
ebgAnidWVS3VIkR1NnXivPGUe/DC4ICn
=6pZk
-----END PGP SIGNATURE-----