StaticTuple... naming, maintenance, ...
Martin Pool
mbp at canonical.com
Tue Oct 6 01:08:16 BST 2009
2009/10/6 John Arbash Meinel <john at arbash-meinel.com>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I mentioned in an old thread, but I have a branch here:
> lp:~jameinel/bzr/2.1-static-tuple
>
> In it, I worked on re-implementing a 'tuple-like' structure. The basics are:
>
> 1) It is limited in what it can point to, so it doesn't participate in
> the python garbage collector. (It cannot reference things which would
> cause a cycle.)
>
> 2) It also is 'static' in that you can intern it based on the hash.
>
> 3) I also implemented a custom 'Interner' class, that works basically
> like a PySet, but allows lookup and uses half the memory. (1/3rd
> the memory of a dict.)
>
> (In loading all of launchpad, 24MB was used for just the dict holding
> the interned keys. So I save 16MB just with that.)
>
> 4) The next effect for 'bzr branch launchpad' was 17% less peak memory
> consumed and *40%* faster (11min => 8min).
<loud applause>
That's pretty cool that you get such a win from just this change - on
the other hand I suppose it shows Python working under a considerable
burden the rest of the time.
>
>
> So overall, I'm quite happy, and I'd like to look into what it will take
> to land this. I have quite a few open questions.
>
> 1) Name... it is currently called "StaticTuple", but if it is an object
> we are going to make use of, that is a fairly 'heavy' name.
>
> Compare: "key = (file_id, revision_id)"
> versus: "key = StaticTuple(file_id, revision_id)"
> or even: "key = StaticTuple((file_id, revision_id))"
>
> I think lowercase is reasonable, though I wonder about
> "key = stuple(file_id, revision_id)"
Well, you could always do "... import StaticTuple as stuple". I'd be
inclined to make the canonical name fairly formal.
> 2) Constructor args:
> "tuple(X)" takes a sequence X, however if you want to create a
> 3-tuple you have the (a, b, c) short form
>
> As such, I was planning on making "StaticTuple(*args)", so that you
> can just change:
> foo = (a, b, c)
> into:
> foo = StaticTuple(a, b, c)
>
Sounds good
> I would probably have a separate "StaticTuple.from_sequence()" for
> the other form. You certainly can do StaticTuple(*t), however the
> main loss is that "tuple(tuple(t)) is t", while
> StaticTuple(*st) would have to be a new object.
Right, because *st would presumably turn it into a plain tuple.
But you could have StaticTuple(st) just return st, couldn't you?
>
> 3) C/Cython/Pyrex
> The #1 memory benefit is removing the python GC header from all of
> the objects. (16 bytes / object.)
> I can easily define such a type in C, and have done so.
>
> However, as you get into doing more with these objects, (like
> creating a C level api to share with other code), there is a *lot*
> more maintenance overhead in doing it from C.
>
> You have to do all the exception handling manually, *and* write all
> of the boilerplate for exposing the dynamic loading of functions.
> In Pyrex/Cython doing so is:
>
> cdef object myfunc(object):
>
> becomes
>
> cdef api object myfunc(object):
>
> Doing so in C is about 4 lines of boilerplate per function, type
> checking, etc. Plus another 20+ lines that you have to write to
> describe that you *have* a C api that should be loaded.
>
>
> In the end, I wrote "StaticTuple" in C, and "StaticTupleInterner" in
> Cython, and the latter took a day, and the former took a week. It is
> a "sunk cost", but ongoing maintenance is not.
>
> The main issue here is that Pyrex will not generate objects without
> the HAVE_GC flag set. Cython >= 0.11 can (as long as you don't have
> 'object' attributes, which is true here, because I have to use
> PyObject** because neither Pyrex nor Cython support C arrays of
> objects)
>
> The difficulty is that would be a hard jump to go from Python 0.8 or
> so (doesn't even support +=) to Cython 0.11 (it is in Karmic, but
> Jaunty only have Cython 0.10).
>
> I would *really* like to switch to Cython 0.11+, as I have specific
> benefits. One could argue that we could try to be compatible, and
> people can compile using Pyrex, and just wouldn't get the memory and
> speed improvement of avoiding the GC...
>
> I'm also using stuff like 'cpdef' and 'inline', but I can work
> around those things easily enough. I can't hack the 'HAVE_GC' flag
> easily.
I guess we could start checking in and shipping the C files, though
people have identified that this would cause some considerable churn,
and perhaps there were other problems.
I'd be reluctant to add such a high dependency, but if you really want
it I don't think we should block it. The dirstate code shows me we
should bias the dependencies/speed/clarity tripod more towards speed
and clarity.
> I'd like to get some feedback, so I have a feel what I need to do to
> finish this off and get it landed. I think this is a net win, and we
> just need to decide some of the finer details and balance points.
Would you like more feedback or a code review?
--
Martin <http://launchpad.net/~mbp/>
More information about the bazaar
mailing list