Encoding woes

Thu Dec 29 16:26:34 GMT 2005

Jan Hudec wrote:
> On Wed, Dec 28, 2005 at 10:52:28 -0600, John A Meinel wrote:
>> Jan Hudec wrote:
>>> On Mon, Dec 26, 2005 at 16:04:44 -0600, John A Meinel wrote:
>>>> I think there should be 3 types of strings inside bzrlib:
>>>>
>>>> 1) Plain ascii strings, these are isinstance(x, string), these should
>>>> not have characters outside the ascii set. (so x.decode() should always
>>>> work)
>>>> 2) Unicode strings, for anything outside of ascii, it should be a
>>>> unicode string.
>>> Why do these need to be two types of strings? Ascii is a subset of unicode.
>> Short answer, they already exist, and we are just not forbidding library
>> users from passing them in. The point is that I can call:
>>
>> b = Branch.open('.')
>>
>> Which is an ascii string, and I don't have to always call
>> b = Branch.open(u'.')
> 
> One of things I really don't like on python is, that I can't tell it to imply
> u for all string literals. I am used to that from both perl and tcl.
> 
> Anyway, I did not mean to forbid _that_. I meant it should be forbidden for
> that argument to remain undecoded inside. Eg. if it was directly stored in an
> attribute.
> 
>> Technically right now we always have to use the latter form, because we
>> need to use unicode internally for filesystem operations, but we should
>> turn them into unicode internally if we need that. We should not
>> *require* that unicode strings be passed in.
>>
>> Perhaps a better example is:
>>
>> t = b.working_tree()
>> t.commit('text message')
>>
>> That text message only has to be unicode if you are using characters
>> outside of the ascii subset.
> 
> How do I tell ascii string originating from the source from an ebcdic string
> commit from input stream? They are the same type.

Because the code which is reading the ebcdic string should know that it
is reading from the input stream and that it needs to be decoded. That
is actually my point. Code at the interface layer (such as coming in
from sys.argv or reading from stdin, etc) needs to do the translation
before it gets deeper inside bzrlib.

> 
> Also the argument is likely to be from some text entry widget and if it is
> a plain string, it is likely a locale-encoded string quite possibly
> containing non-ascii characters. Or that case is forbidden?

Then the thing which handles the text entry widget needs to translate it
before sending it into bzrlib. bzrlib cannot be responsible for knowing
how widget foo works. It *can* tell people that "I expect this argument
to be in this form, if it is not, then it is your responsibility to make
it so".

The other possibility would be to create a new string type, which tracks
what encoding it is in. But really it is much easier to require the next
layer up to handle encode/decode.

> 
>>>> 3) Text blobs. These are just arrays of bytes. Stuff that we would never
>>>> try to encode/decode. This is stuff like file contents, etc. The only
>>>> thing we might do with these strings is split them on newlines.
>>> Hm, I believe there should be a special class made for them. So they could
>>> always be told from case 1. Also if all ascii strings are made unicode (which
>>> I think they can), then the plain string type can be outlawed except in the
>>> external interface (only the part for front-ends) so forgetting to classify
>>> the input would be immediately obvious.
>> I think Robert was specifically against forbidding plain ascii strings
>> because it makes the library harder to use. And I agree with him on that
>> point. Which is where I'm saying that if we have an object which is a
>> plain string type, it should be either a text blob which we aren't
>> planning on interpreting, or it must be ascii only.
>>
>> I think we can do okay by just properly naming our variables and
>> parameters. If it ends in 'text' or 'lines', it is a text blob, in all
>> other cases (committer, message, revision_id, etc) it needs to be either
>> a valid ascii string, or unicode.
> 
> Ok, so it's actually forbidden to pass in locale-encoded string except for
> blob? That'd work, but I am not sure it will be any easier to use. Because
> most of the developers don't regularly use non-ascii characters, it will be
> easy to forget to encode something. And that may be rather hard to hunt down
> later.

Well, actually you need them to "decode" it so that you have a full
unicode string. But yes, the idea is that inside bzrlib you would have
unencoded strings. If they are plain ascii, then they can be the plain
'string' type. Anything more, and they need to be unicode.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051229/11fdb613/attachment.pgp