StaticTuple... naming, maintenance, ...

Tue Oct 6 01:08:16 BST 2009

2009/10/6 John Arbash Meinel <john at arbash-meinel.com>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So I mentioned in an old thread, but I have a branch here:
>  lp:~jameinel/bzr/2.1-static-tuple
>
> In it, I worked on re-implementing a 'tuple-like' structure. The basics are:
>
> 1) It is limited in what it can point to, so it doesn't participate in
>   the python garbage collector. (It cannot reference things which would
>   cause a cycle.)
>
> 2) It also is 'static' in that you can intern it based on the hash.
>
> 3) I also implemented a custom 'Interner' class, that works basically
>   like a PySet, but allows lookup and uses half the memory. (1/3rd
>   the memory of a dict.)
>
>   (In loading all of launchpad, 24MB was used for just the dict holding
>    the interned keys. So I save 16MB just with that.)
>
> 4) The next effect for 'bzr branch launchpad' was 17% less peak memory
>   consumed and *40%* faster (11min => 8min).

<loud applause>

That's pretty cool that you get such a win from just this change - on
the other hand I suppose it shows Python working under a considerable
burden the rest of the time.

>
>
> So overall, I'm quite happy, and I'd like to look into what it will take
> to land this. I have quite a few open questions.
>
> 1) Name... it is currently called "StaticTuple", but if it is an object
>   we are going to make use of, that is a fairly 'heavy' name.
>
>   Compare: "key = (file_id, revision_id)"
>   versus:  "key = StaticTuple(file_id, revision_id)"
>   or even: "key = StaticTuple((file_id, revision_id))"
>
>   I think lowercase is reasonable, though I wonder about
>            "key = stuple(file_id, revision_id)"

Well, you could always do "... import StaticTuple as stuple".  I'd be
inclined to make the canonical name fairly formal.

> 2) Constructor args:
>   "tuple(X)" takes a sequence X, however if you want to create a
>   3-tuple you have the (a, b, c) short form
>
>   As such, I was planning on making "StaticTuple(*args)", so that you
>   can just change:
>     foo = (a, b, c)
>   into:
>     foo = StaticTuple(a, b, c)
>

Sounds good

>   I would probably have a separate "StaticTuple.from_sequence()" for
>   the other form. You certainly can do StaticTuple(*t), however the
>   main loss is that "tuple(tuple(t)) is t", while
>   StaticTuple(*st) would have to be a new object.

Right, because *st would presumably turn it into a plain tuple.

But you could have StaticTuple(st) just return st, couldn't you?

>
> 3) C/Cython/Pyrex
>   The #1 memory benefit is removing the python GC header from all of
>   the objects. (16 bytes / object.)
>   I can easily define such a type in C, and have done so.
>
>   However, as you get into doing more with these objects, (like
>   creating a C level api to share with other code), there is a *lot*
>   more maintenance overhead in doing it from C.
>
>   You have to do all the exception handling manually, *and* write all
>   of the boilerplate for exposing the dynamic loading of functions.
>   In Pyrex/Cython doing so is:
>
>      cdef object myfunc(object):
>
>    becomes
>
>      cdef api object myfunc(object):
>
>    Doing so in C is about 4 lines of boilerplate per function, type
>    checking, etc. Plus another 20+ lines that you have to write to
>    describe that you *have* a C api that should be loaded.
>
>
>    In the end, I wrote "StaticTuple" in C, and "StaticTupleInterner" in
>    Cython, and the latter took a day, and the former took a week. It is
>    a "sunk cost", but ongoing maintenance is not.
>
>    The main issue here is that Pyrex will not generate objects without
>    the HAVE_GC flag set. Cython >= 0.11 can (as long as you don't have
>    'object' attributes, which is true here, because I have to use
>    PyObject** because neither Pyrex nor Cython support C arrays of
>    objects)
>
>    The difficulty is that would be a hard jump to go from Python 0.8 or
>    so (doesn't even support +=) to Cython 0.11 (it is in Karmic, but
>    Jaunty only have Cython 0.10).
>
>    I would *really* like to switch to Cython 0.11+, as I have specific
>    benefits. One could argue that we could try to be compatible, and
>    people can compile using Pyrex, and just wouldn't get the memory and
>    speed improvement of avoiding the GC...
>
>    I'm also using stuff like 'cpdef' and 'inline', but I can work
>    around those things easily enough. I can't hack the 'HAVE_GC' flag
>    easily.

I guess we could start checking in and shipping the C files, though
people have identified that this would cause some considerable churn,
and perhaps there were other problems.

I'd be reluctant to add such a high dependency, but if you really want
it I don't think we should block it.  The dirstate code shows me we
should bias the dependencies/speed/clarity tripod more towards speed
and clarity.

> I'd like to get some feedback, so I have a feel what I need to do to
> finish this off and get it landed. I think this is a net win, and we
> just need to decide some of the finer details and balance points.

Would you like more feedback or a code review?

-- 
Martin <http://launchpad.net/~mbp/>