Sax and revision.weave

John A Meinel john at arbash-meinel.com
Fri Oct 21 03:16:51 BST 2005


Martin, you asked me if it was worth it to print out the xml directly,
versus using cElementTree.
The summary is that because a weave file is sensitive to how information
is stored (sometimes dramatically), I would say that *yes* it is
important. I was able to create a weave of the revision XML that was
1.1M instead of 1.3M, and 27k lines instead of 36k lines (for 2130
revisions).

Here is the long answer:

I went ahead and resurrected my bzr-sax branch. Available from:
http://bzr.arbash-meinel.com/branches/bzr/sax/

Actually, it merged against bzr.dev with only 2 or 3 conflicts.

Anyway, I thought I would play around with the conversion, to see how
much space it will take to switch to a revision weave.
http://bzr.arbash-meinel.com/plugins/revstore2weave/

The first thing I found is that there are a lot of tricks about how you
can format the output, in order to optimize for how our weave code works.
The idea is to keep chunks that change together, and keep ones that
don't change at all together. For example, here are some possible
layouts for the revision XML:

<revision format="5" committer="foo" inventory_sha1="baz" \
timezone="-1800" timestamp="123123123">
<message>blah</message>
<parents>
<revision_ref revision_id="blah"/>
</parents>
</revision>

or

<revision
 committer="foo"
 format="5"
 inventory_sha1="baz"
 timezone="-1800"
 timestamp="123123123"
>
<message>blah</message>
<parents>
<revision_ref
  revision_id="blah"
/>
</parents>
</revision>

or

<revision format="5"
 committer="foo" timezone="-1800"
 timestamp="123123123" inventory_sha1="baz">
<message>blah</message>
<parents><revision_ref revision_id="blah"/>
</parents>
</revision>

Now the first one, is the way we currently serialize the XML using
cElementTree, the second is what I did when I just tried to put each one
on its own line (and it is one of the worst), and the last one is when I
optimized the changes.

The first form is bad because it puts everything on one line, and there
is a lot of duplication. (committer, timezone, format are all common).
The second saves some space because it can share some of the lines, but
it suffers from a surplus of control lines. Because it gets divided up
into sections:

<revision  # Always common
 committer="foo" # semi common
 format="5" # Always common
 inventory_sha1="baz" # *never* common
 timezone="-1800" # Always
 timestamp="123123123" # Never
> # Always common
<message>blah</message> # Never common
<parents> # Always common
<revision_ref # Always common
  revision_id="blah" # Never common
/> #Always
</parents> # Always
</revision> # Always

Now, the problem is that whenever you switch from a section which is
"always common" to "never common" you have a set of control lines for
each revision (4, 2 adds, 2 deletes).

And it turns out that all those control lines quickly drown out the
space savings, and cause the number of lines to grow rapidly.
The last form has the property that the first few lines are almost
always the same.
Then you have a large chunk which changes almost every revision (up to
the </parent> and then the final section never changes.
This drastically reduces the total number of lines, along with the total
size of the file.

Here are some actual numbers, as taken from the bzr.dev revno=1321


2310 revisions

Size of uncompressed revision-store:	1012.6KB = 1036854
Size of compressed revision-store:	 788.3KB =  807197
Size of bzr.dev revision-store.weave:	   1.3MB = 1318580, 36170 lines
                          compressed:      372KB =  380700
revision-store.weave with 1 line per option:1.1MB =1183151, 50244 lines
			  compressed:	   378KB =  387496
revision-store.weave SAX style:		   1.1MB = 1106937, 27988 lines
			  compressed:	   326KB =  334322


Because we know in advance what the weave format is going to be, by
manually tweaking the XML, we can keep the file size down *and* keep the
number of lines smaller, so that performance is improved.

Above you can see that you get slightly better delta compression if you
have each entry on its own line. But you trade off having almost 2x as
many lines.

And because you have to read in the entire weave, and write it out each
time, it actually might be okay to compress it. And that shrinks it to
30% of the filesize. Much smaller than compressing each file individually.

John
=:->
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051020/56528246/attachment.pgp 


More information about the bazaar mailing list