Encoding woes

Wed Dec 28 16:52:28 GMT 2005

Jan Hudec wrote:
> On Mon, Dec 26, 2005 at 16:04:44 -0600, John A Meinel wrote:
>> I think there should be 3 types of strings inside bzrlib:
>>
>> 1) Plain ascii strings, these are isinstance(x, string), these should
>> not have characters outside the ascii set. (so x.decode() should always
>> work)
>> 2) Unicode strings, for anything outside of ascii, it should be a
>> unicode string.
> 
> Why do these need to be two types of strings? Ascii is a subset of unicode.

Short answer, they already exist, and we are just not forbidding library
users from passing them in. The point is that I can call:

b = Branch.open('.')

Which is an ascii string, and I don't have to always call
b = Branch.open(u'.')

Technically right now we always have to use the latter form, because we
need to use unicode internally for filesystem operations, but we should
turn them into unicode internally if we need that. We should not
*require* that unicode strings be passed in.

Perhaps a better example is:

t = b.working_tree()
t.commit('text message')

That text message only has to be unicode if you are using characters
outside of the ascii subset.

> 
>> 3) Text blobs. These are just arrays of bytes. Stuff that we would never
>> try to encode/decode. This is stuff like file contents, etc. The only
>> thing we might do with these strings is split them on newlines.
> 
> Hm, I believe there should be a special class made for them. So they could
> always be told from case 1. Also if all ascii strings are made unicode (which
> I think they can), then the plain string type can be outlawed except in the
> external interface (only the part for front-ends) so forgetting to classify
> the input would be immediately obvious.

I think Robert was specifically against forbidding plain ascii strings
because it makes the library harder to use. And I agree with him on that
point. Which is where I'm saying that if we have an object which is a
plain string type, it should be either a text blob which we aren't
planning on interpreting, or it must be ascii only.

I think we can do okay by just properly naming our variables and
parameters. If it ends in 'text' or 'lines', it is a text blob, in all
other cases (committer, message, revision_id, etc) it needs to be either
a valid ascii string, or unicode.

> 
>> Stuff that is read from stdin, or read from the argument list needs to
>> be converted into one of those 3 strings.
> 

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051228/2f5d8e32/attachment.pgp