StaticTuple... naming, maintenance, ...

John Arbash Meinel john at arbash-meinel.com
Mon Oct 5 23:58:02 BST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So I mentioned in an old thread, but I have a branch here:
  lp:~jameinel/bzr/2.1-static-tuple

In it, I worked on re-implementing a 'tuple-like' structure. The basics are:

1) It is limited in what it can point to, so it doesn't participate in
   the python garbage collector. (It cannot reference things which would
   cause a cycle.)

2) It also is 'static' in that you can intern it based on the hash.

3) I also implemented a custom 'Interner' class, that works basically
   like a PySet, but allows lookup and uses half the memory. (1/3rd
   the memory of a dict.)

   (In loading all of launchpad, 24MB was used for just the dict holding
    the interned keys. So I save 16MB just with that.)

4) The next effect for 'bzr branch launchpad' was 17% less peak memory
   consumed and *40%* faster (11min => 8min).


So overall, I'm quite happy, and I'd like to look into what it will take
to land this. I have quite a few open questions.

1) Name... it is currently called "StaticTuple", but if it is an object
   we are going to make use of, that is a fairly 'heavy' name.

   Compare: "key = (file_id, revision_id)"
   versus:  "key = StaticTuple(file_id, revision_id)"
   or even: "key = StaticTuple((file_id, revision_id))"

   I think lowercase is reasonable, though I wonder about
            "key = stuple(file_id, revision_id)"


2) Constructor args:
   "tuple(X)" takes a sequence X, however if you want to create a
   3-tuple you have the (a, b, c) short form

   As such, I was planning on making "StaticTuple(*args)", so that you
   can just change:
     foo = (a, b, c)
   into:
     foo = StaticTuple(a, b, c)

   I would probably have a separate "StaticTuple.from_sequence()" for
   the other form. You certainly can do StaticTuple(*t), however the
   main loss is that "tuple(tuple(t)) is t", while
   StaticTuple(*st) would have to be a new object.

3) C/Cython/Pyrex
   The #1 memory benefit is removing the python GC header from all of
   the objects. (16 bytes / object.)
   I can easily define such a type in C, and have done so.

   However, as you get into doing more with these objects, (like
   creating a C level api to share with other code), there is a *lot*
   more maintenance overhead in doing it from C.

   You have to do all the exception handling manually, *and* write all
   of the boilerplate for exposing the dynamic loading of functions.
   In Pyrex/Cython doing so is:

      cdef object myfunc(object):

    becomes

      cdef api object myfunc(object):

    Doing so in C is about 4 lines of boilerplate per function, type
    checking, etc. Plus another 20+ lines that you have to write to
    describe that you *have* a C api that should be loaded.


    In the end, I wrote "StaticTuple" in C, and "StaticTupleInterner" in
    Cython, and the latter took a day, and the former took a week. It is
    a "sunk cost", but ongoing maintenance is not.

    The main issue here is that Pyrex will not generate objects without
    the HAVE_GC flag set. Cython >= 0.11 can (as long as you don't have
    'object' attributes, which is true here, because I have to use
    PyObject** because neither Pyrex nor Cython support C arrays of
    objects)

    The difficulty is that would be a hard jump to go from Python 0.8 or
    so (doesn't even support +=) to Cython 0.11 (it is in Karmic, but
    Jaunty only have Cython 0.10).

    I would *really* like to switch to Cython 0.11+, as I have specific
    benefits. One could argue that we could try to be compatible, and
    people can compile using Pyrex, and just wouldn't get the memory and
    speed improvement of avoiding the GC...

    I'm also using stuff like 'cpdef' and 'inline', but I can work
    around those things easily enough. I can't hack the 'HAVE_GC' flag
    easily.


I'd like to get some feedback, so I have a feel what I need to do to
finish this off and get it landed. I think this is a net win, and we
just need to decide some of the finer details and balance points.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrKefoACgkQJdeBCYSNAAP8EACdFVN6zY0+1V6yEHr02oYkAPsM
RF8An22YD1mNSUpiHq6aEwEHgR5FDH5V
=f4Xf
-----END PGP SIGNATURE-----



More information about the bazaar mailing list