[RFC] Multipart support for _urllib_

Mon Jun 19 05:38:01 BST 2006

On 18/06/2006, at 2:26 AM, John Arbash Meinel wrote:
>
> Anyway, lets recap my previous timing tests:
>
> urllib get:	18m11s
> wget -r -np:	10m54s
> pycurl get:	18m38s
> modified pycurl:11m17s
> Ellerman's urllib:10m14s ***
>
>
> Now my max download speed is 160KB/s, so theoretically I can download
> all 30MB of bzr.dev in 3 minutes. So we still have some room for
> improvement. But after my combined changes we now have:
>
> pycurl w/ multirange support:	7m24s
>
> So we are now down to 40% (2.5x faster).

That is indeed impressive.  I wonder if we can get down towards that  
number by just progressively replacing things in urllib and not  
depending on pycurl?

> Now, with a theoretical smart server, which can recompress things  
> on the
> fly:
>
> current size of .bzr => 26MB
> expanded size of .bzr => 103MB
> (zcat all of the .knit files, cat everything else)
> bzip2 of expanded texts => 7.9MB
>
> So a smart server could theoretically send only 1/3rd the amount of
> bytes over the wire, which could speed us up some more. (Giving us  
> a max
> headroom of being approximately 6x faster than the current dumb  
> protocol
> code, 2 for bandwidth, 3 for compression, though it makes things  
> slower
> on a local network).

I've started on one; I'll post it later on.

> Now my branch could probably be cleaned up a little bit, and we would
> want to write some tests for the range stuff. But the selftests pass,
> and we can do a 'get' of bzr.dev.

It looks reasonable to come in but it does reduce add some untested  
code.  One option would be to add range support to the server as you  
say but perhaps instead we could just have unit tests for each  
aspect: generating and parsing multipart bodies and so on.

> I'm not sure how we want to write multirange support into
> SimpleHTTPServer, but we might consider merging this anyway.

> +
> +    def _handle_response(self, path, response):
> +        """Interpret the code & headers and return a HTTP response.
> +
> +        This is a factory method which returns an appropriate HTTP  
> response
> +        based on the code & headers it's given.
> +        """
> +        content_type = response.headers['Content-Type']
> +        mutter('handling response code %s ctype %s', response.code,
> +            content_type)
> +
> +        if response.code == 206 and self._is_multipart(content_type):
> +            # Full fledged multipart response
> +            return HttpMultipartRangeResponse(path, content_type,  
> response)
> +        elif response.code == 206:
> +            # A response to a range request, but not multipart
> +            content_range = response.headers['Content-Range']
> +            return HttpRangeResponse(path, content_range, response)
> +        elif response.code == 200:
> +            # A regular non-range response, unfortunately the  
> result from
> +            # urllib doesn't support seek, so we wrap it in a  
> StringIO
> +            return StringIO(response.read())
> +        elif response.code == 404:
> +            raise NoSuchFile(path)
> +
> +        raise BzrError("HTTP couldn't handle code %s", response.code)

Perhaps this should be a TransportError instead, and include the  
status text from the response?

>      def put(self, relpath, f, mode=None):
>          """Copy the file-like or string object into the location.
> @@ -341,6 +329,51 @@
>          else:
>              return self.__class__(self.abspath(offset))
>
> +    def _offsets_to_ranges(self, offsets):
> +        """Turn a list of offsets and sizes into a list of byte  
> ranges.
> +
> +        :param offsets: A list of tuples of (start, size).
> +        An empty list is not accepted.
> +
> +        :return: a list of byte ranges (start, end). Adjacent  
> ranges will
> +        be combined in the result.
> +        """
> +        # We need a copy of the offsets, as the caller might  
> expect it to
> +        # remain unsorted. This doesn't seem expensive for memory  
> at least.
> +        offsets = sorted(offsets)

It's not clear from the docstring what the difference is between  
"offsets" and "byte ranges" in this context or whether there is a  
difference.  It looks like you just mean to join up adjacent ranges  
(?) - if so, say so.

-- 
Martin