[MERGE] DirState pyrex helpers

Thu Jul 19 02:24:19 BST 2007

On 7/18/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
> > So the
> >
> > +    ctypedef int size_t
> >
> > isn't saying that it's an int, but rather that it should be treated
> > like an int when converting to and from Python objects?  Maybe it'd be
> > good to add an comment about this, just to save people worrying.
> > Maybe an unsigned long would be more generally right?
>
> I changed it to unsigned. I don't have a specific concern between unsigned an
> unsigned long if you prefer.
>
> I've tried to track down an "official" definition, but GNU seems to have it as
> part of the compiler. I see stuff like:

C99 doesn't say that it's either of them (iirc).  It can even be a
nonstandard type like long long on some platforms.  I think unsigned long
would be safest.

> > It can be a lot easier to debug these things if they occur for a user
> > if you print the repr of the object that was actually passed.
> >
>
> Done.
>
> I can say that the python stdlib doesn't do this:
>
> >>> os.lstat(1)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: coercing to Unicode: need string or buffer, int found
>
> But I won't say it is particularly helpful.

I meant to say, "or at least the type of the parameter".  Some builtin
errors say nothing about what was passed -- not even whether it was None
or not -- which seems to put in an extra debugging round trip.

> > +    void *PyTuple_GetItem_void_void "PyTuple_GET_ITEM" (void* tpl, int
> > index)
> >
> > Why is this not _void_int?
>
> Because the nomenclature is to indicate that it takes a void* instead of an
> <object> and *returns* a void* rather than an <object>.
>
> With Pyrex, if you name a function:
>
> object function(object param)
>
> It will do work to make sure objects are Py_INCREF and Py_DECREF at appropriate
>  times. However, I'm explicitly working in void* because of wanting to work
> closer to 'bare metal'.
>
> Specifically, I'm pulling a string out of a tuple which is in a list, but all I
> really want is the char* and int size.
>
> So I'm doing:
>
> void * cur
>
> cur = PyTuple_GetItem_void_void(
>         PyList_GetItem_object_void(list, offset))
> cur_cstr = PyString_AS_STRING_void(cur)
> cur_size = PyString_GET_SIZE_void(cur)
>
> What I'm really trying to do here is work in terms of the PyObject pointers,
> rather than a higher level "<object>" abstraction.
>
> I could change these to use PyObject* instead of void*, but then I feel it is a
> bit harder to tell the difference between PyObject *foo, and object foo. And
> when I first wrote it, I didn't realize you could do PyObject*.
>
> If you feel like it would be more obvious/better/something a different way, I'm
> willing to change it.

Oh I see.  I think using voids is fine.  Maybe again just a comment
explaining why it's so would be enough.  Is this a standard Pyrex
convention?

> > I do wonder if it would be simpler to just split the paths into lists
> > as we read them in.  I suppose that would require considerably more
> > allocation, and anyhow would be out of scope for this change.
>
> Simpler to split the paths into lists as we read them in...
>
> You mean instead of having a dirblock record be:
>
> (dirname, name, file_id)
>
> Such as:
>
> ('foo/bar', 'baz', 'baz-200720707-aoudnatoheu-1')
>
> You would have it be:
>
> (['foo', 'bar'], 'baz', 'baz-200720707')
>
> That certainly would be possible, and probably wouldn't be a lot of overhead
> during parsing. Right now _get_entry is already designed to reuse the 'dirname'
> object for all entries in the same directory (saves memory, and should save time).
>
> What you would add is a .split() call for all directories (~1 in 10 entries are
> directories in large projects).
>
> It would make bisecting more straightforward, but I don't know that it would be
> faster. cmp_by_dirs can actually be pretty fast because it is only a slight
> variation on memcmp based on how the first unmatched character differs. Versus
> comparing 2 lists.

Yes, that's what I meant.  We've had a few bugs because the sort order is
not what you directly get by comparing the key fields, and all other
things being equal having the structure such that people don't have to be
careful to get it right would be better.  I was thinking more about the
safety than the speed.  It might not be worth changing in dirstate
now but it's something to consider next time.

> > Uh, also you can't rely on ints being 4 characters.  (Unless I'm
> > missing something and they reliably are in pyrex?)
>
> I don't care if they are 4-chars. As long as they are a faster way to iterate
> through memory. I realized I have a typo, in that I used '%4' instead of '%
> sizeof(int)'. But I went ahead and fixed that.

I meant '4 bytes'.

> > More importantly: for python are we sure we're comparing the utf-8
> > values, not the interpreted unicode.  I'm pretty sure we are.  But in
> > utf-8 we will have some characters with the high bit set, and those
> > will compare the wrong way if you compare them as signed ints and
> > chars.  I think you definitely need some tests of this on utf-8 paths.
>
> Good points. I'll add the tests and switch them over to 'unsigned char' to be
> explicit. (good catch, on my Mac at least, char* is a signed char, which
> evaluates incorrectly)
> I'll also add a PyString_CheckExact() call for the cmp_by_dirs() function,
> which ensures that you have a real PyString, and not a PyUnicode.

Thanks, the tests look ok.

-- 
Martin