[MERGE] (robertc) Cap the amount of data we write in a single IO during local path pack operations to fix bug 255656. (Robert Collins)

Fri Aug 15 22:15:04 BST 2008

On Fri, 2008-08-15 at 11:27 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > On Thu, 2008-08-14 at 23:08 -0500, John Arbash Meinel wrote:
> > 
> > 
> >> This at least feels like it should be a helper function akin to
> >> "osutils.pumpfile".
> > 
> > Ok, I"ll put one [tested] together.
> > 
> >> BB:tweak
> >>
> >> The actual loop seems fine, though I would wonder about buffer() versus
> >> just bytes[start:end]. (I realize there is at least 1 copy in using
> >> slicing, but I also don't think we need to be using a 5MB buffer here.)
> > 
> > bytes[start:end] does a memcpy. buffer does not.
> > 
> > -Rob
> 
> As I said "1 copy using slicing". I understand that buffer works without
> copying. I'm quite curious to probe deeper and see how it works.

I didn't read what you said carefully enough, sorry.

> Specifically, does it cache its hash value? That property comes in handy
> for PatienceDiff, and GroupCompress. 

I don't think we'd want buffer objects in the compressor - because we
don't want to increase memory use by not freeing the texts already
compressed. We may well want buffer objects in the decompressor, just to
avoid many temporary copies. OTOH as you say it is still an object
itself. I am wondering about creating something honouring the buffer api
that can be array allocated in C. So the decompressor could know 'there
are 2000 instructions in this compressed text', allocate a 2000 long
vector of buffer-like objects and then fill them in.

> It still requires a object creation. There are also other limitations,
> such as:
> 
> >>> ''.join([y, z])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: sequence item 0: expected string, buffer found
> 
> So buffer() objects aren't supported for string.join()
> 
> That may be a big issue.

That fail surprises me.

> My other concern for Group-Compress specifically is that the buffer
> object has to hold a reference to the original string. So imagine a
> worse-case scenario, where you have 3 10kB texts in 3-different GC
> blocks. If we used buffer() as the way to return chunks, you would end
> up holding on to 3*20MB = 60MB of memory for 30kB of actual texts.

Indeed. OTOH if we move to a 'chunked' representation and only move to
actual lines for diff/merge operations, then I'd expect a chunks->lines
step to happen at which point we'd want to go to regular strings.

> Note also that buffer(unicode, 2, 3) works, but it exposes the
> underlying unicode implementation:

Thats expected actually - buffers are really down at the byte-sequence
level.

> >>> sio.writelines(y)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: expected string or Unicode object, buffer found
> 
> Which I find odd, because
> >>> sio.write(y[0])

That is odd.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080816/b064d6f0/attachment.pgp