[plugin] dirstate experiments
Robert Collins
robertc at robertcollins.net
Thu Jun 15 12:22:47 BST 2006
On Wed, 2006-06-14 at 14:34 -0500, John Arbash Meinel wrote:
> Robert Collins wrote:
>
> ...
>
> >
> > Well, there are two cases here. The stat cache is just a validator for
> > the fs, so unpacking it does not make sense. However, the sha1 is stored
> > in the inventory in commit, and code expects to get at it, so I think
> > having the sha be accessible is a good idea - I'd rather pay a little
> > bit more in size for a format that is directly usable in the inventory
> > logic, as streaming reads are fast - its seeks that are slow.
> >
> >> It most certainly is not editable in that form, but it is readable.
> >>
> >> Also, it turns out that the major overhead at this point is the giant
> >> 'text.decode()', plus the overhead of operating on unicode objects
> >> rather than operating on str() objects.
> >
> > I was wondering when that would start to bite :(.
>
>
> So I've hacked on the format some more. Now I'm able to include an
> arbitrary number of parent entries. And now the delimiter is '\n' all
> the way around.
> Just instead of iterating and pulling things off as I need it, I split
> the text, and use slicing to pull out the chunks.
> With 2 parent blobs (which are the same size as the current blob) I have
> the time down to 180ms for reading everything in.
>
> I also worked out actually building up the lists of objects, and if I
> fast-path the cases of having 1 or 2 parents, then the total cost
> doesn't go up too much. I'm at 280ms for 2 parents.
>
> It turns out that 'unicode.split' is 2x slower than 'string.split', so
> it takes 120ms (lsprof) versus 68ms. And then you pay another 60ms to
> decode('utf8') the whole file, for a total real cost of 360ms instead of
> 280ms.
>
> At this point I'm running into the time it takes to do 60k additions,
> and 20k list.appends().
> So I don't think there is much more room for improvement.
> I'm curious if we would be better off having a separate file for basis
> information, just in case we ever have a use case for reading the
> working inventory without wanting the basis stuff.
>
> This is the breakdown based on number of parent entries:
> num str unicode file_size
> 0 100ms 150ms 3.1MB
> 1 177ms 268ms 5.7MB
> 2 275ms 352ms 5.9MB
> 3 370ms 470ms 6.1MB
>
> (With >1 parent, I just have the extra parents have null: records, not
> perfectly reasonable, but I needed something.)
>
> So if we restrict ourselves to not caching any parents, we can get this
> to be very fast. In comparison, hg's dirstate read of the same kernel
> tree takes 120ms. Though their dirstate only records 'state'
> (removed,added,needs merge), size, mtime, and path. They don't save the
> kind, hash, file_id, or the extra stat bits.
>
> If I restrict my list to just include what they include, my file size
> drops to 1M, and it can be read in 70ms. (revno 42 explores an hg style
> dirstate). So we do pay for our extra information. 110ms versus 70ms for
> having file_ids. Though still faster than hg's struct packing.
>
> And while I'm on it, adler32 checksum is really frickin' fast. On 6MB it
> takes about 4ms to complete. Way faster than sha1.
Thats why I suggested it :).
> I'm pretty happy with how everything has turned out. It ends up that I'm
> using a format almost exactly like what Robert mentioned, and with all
> our extra information, we can still read it really fast.
This is really good. I'm approaching from the other end - reworking
status at the moment to accept a walker with the appropriate information
in it. I'll dovetail working inventory/disk status and basis inventories
on the fly in existing formats, and do it straight from a dirstate for
format 4 trees. This is shaping up in my dirstate branch, if you'd like
to collaborate on this part of - cool.
Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060615/a4923459/attachment.pgp
More information about the bazaar
mailing list