Thinking about Memory Consumption
Michael Hudson
michael.hudson at canonical.com
Wed Sep 9 22:25:27 BST 2009
John Arbash Meinel wrote:
> So I've been trying to get a handle on some of our memory consumption
> concerns, and I'm noticing that we may need to re-think some of our
> basic data structures.
Speaking as someone who worries about memory consumption on
bazaar.launchpad.net, thanks for looking at this :)
> The specific one that is showing up, is our choice to represent items in
> an index as a tuple of strings. To give specifics, if I do:
>
> b = bzrlib.branch.Branch.open('launchpad/devel')
> b.lock_read()
> r = b.repository
> m = []
> m.append(r.revisions.get_parent_map(r.revisions.keys()))
> m.append(r.signatures.get_parent_map(r.signatures.keys()))
> m.append(r.inventories.get_parent_map(r.inventories.keys()))
> m.append(r.chk_bytes.get_parent_map(r.chk_bytes.keys()))
> m.append(r.texts.get_parent_map(r.texts.keys()))
>
> bzrlib.trace.debug_memory()
>
> I end up with ~300MB of memory used, just to store the parent_map dicts
> for all keys in the repository along with the associated 'values'.
>
> Now, there is some amount of waste present. Specifically, if I just
> compute the size of the maps directly, I end up with:
> 2,381,318 objects
> 132,140,086 bytes
>
> However, if I walk all of these items, and find all the strings, and
> then compute their total length, I get:
> 23,808,498 bytes (sum(map(len, strs)))
> or
> 35,962,218 bytes (sum(map(size_of, strs)))
>
> Compare that to:
> 63,146,948 bytes (sum(map(size_of, tuples)))
>
> So we have 22MB of actual string *content*, another 11.6MB of PyString
> object overhead, and then 60MB of tuple overhead.
>
> One answer could be: "don't load the whole graph into memory, stupid".
I think this goal more-or-less has to be kept in mind doesn't it?
Otherwise the hypothetical "million files, million revisions" tree is
going to be hard to work with.
> However, that is only really possible if we change how fetch is done
> (say by telling it to fetch no more that 1000 revisions at a time, etc.)
I don't really know the details, but is it possible to "load" the graph
into some kind of data structure that doesn't actually keep all the data
in RAM all the time?
> Another obvious place is in the chk_bytes code. Specifically, there we
> use a tuple('sha1:' + hex(sha1),).
[snip more technical stuff]
> I'll probably do some probing here, to see how hard it is to get things
> going.
A constant factor improvement certainly sounds worthwhile, but I do
wonder if the aim should be a little higher. I guess that's not
mutually exclusive.
> PS> This was driven by noticing that 'bzr branch launchpad launchpad2'
> peaks at around 500MB and 300MB is hit if we just load the indexes....
> Which makes me think that the streaming code is actually fairly
> efficient, and doing the right thing by not holding more than a small
> handful of texts in memory at a time, but the index layer is chewing up
> the rest of memory.
This is certainly an interesting theory!
Cheers,
mwh
More information about the bazaar
mailing list