[plugin] dirstate experiments

Robert Collins robertc at robertcollins.net
Thu Jun 15 12:22:47 BST 2006


On Wed, 2006-06-14 at 14:34 -0500, John Arbash Meinel wrote:
> Robert Collins wrote:
> 
> ...
> 
> > 
> > Well, there are two cases here. The stat cache is just a validator for
> > the fs, so unpacking it does not make sense. However, the sha1 is stored
> > in the inventory in commit, and code expects to get at it, so I think
> > having the sha be accessible is a good idea - I'd rather pay a little
> > bit more in size for a format that is directly usable in the inventory
> > logic, as streaming reads are fast - its seeks that are slow.
> > 
> >> It most certainly is not editable in that form, but it is readable.
> >>
> >> Also, it turns out that the major overhead at this point is the giant
> >> 'text.decode()', plus the overhead of operating on unicode objects
> >> rather than operating on str() objects.
> > 
> > I was wondering when that would start to bite :(.
> 
> 
> So I've hacked on the format some more. Now I'm able to include an
> arbitrary number of parent entries. And now the delimiter is '\n' all
> the way around.
> Just instead of iterating and pulling things off as I need it, I split
> the text, and use slicing to pull out the chunks.
> With 2 parent blobs (which are the same size as the current blob) I have
> the time down to 180ms for reading everything in.
> 
> I also worked out actually building up the lists of objects, and if I
> fast-path the cases of having 1 or 2 parents, then the total cost
> doesn't go up too much. I'm at 280ms for 2 parents.
> 
> It turns out that 'unicode.split' is 2x slower than 'string.split', so
> it takes 120ms (lsprof) versus 68ms. And then you pay another 60ms to
> decode('utf8') the whole file, for a total real cost of 360ms instead of
> 280ms.
> 
> At this point I'm running into the time it takes to do 60k additions,
> and 20k list.appends().
> So I don't think there is much more room for improvement.
> I'm curious if we would be better off having a separate file for basis
> information, just in case we ever have a use case for reading the
> working inventory without wanting the basis stuff.
> 
> This is the breakdown based on number of parent entries:
> num	str	unicode	file_size
> 0	100ms	150ms	3.1MB
> 1	177ms	268ms	5.7MB
> 2	275ms	352ms	5.9MB
> 3	370ms	470ms	6.1MB
> 
> (With >1 parent, I just have the extra parents have null: records, not
> perfectly reasonable, but I needed something.)
> 
> So if we restrict ourselves to not caching any parents, we can get this
> to be very fast. In comparison, hg's dirstate read of the same kernel
> tree takes 120ms. Though their dirstate only records 'state'
> (removed,added,needs merge), size, mtime, and path. They don't save the
> kind, hash, file_id, or the extra stat bits.
> 
> If I restrict my list to just include what they include, my file size
> drops to 1M, and it can be read in 70ms. (revno 42 explores an hg style
> dirstate). So we do pay for our extra information. 110ms versus 70ms for
> having file_ids. Though still faster than hg's struct packing.
> 
> And while I'm on it, adler32 checksum is really frickin' fast. On 6MB it
> takes about 4ms to complete. Way faster than sha1.

Thats why I suggested it :). 

> I'm pretty happy with how everything has turned out. It ends up that I'm
> using a format almost exactly like what Robert mentioned, and with all
> our extra information, we can still read it really fast.

This is really good. I'm approaching from the other end - reworking
status at the moment to accept a walker with the appropriate information
in it. I'll dovetail working inventory/disk status and basis inventories
on the fly in existing formats, and do it straight from a dirstate for
format 4 trees. This is shaping up in my dirstate branch, if you'd like
to collaborate on this part of - cool.

Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060615/a4923459/attachment.pgp 


More information about the bazaar mailing list