format string should be unicode instead byte string

Mon Sep 7 21:51:54 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> 2009/9/7 INADA Naoki <songofacandy at gmail.com>:
>> Related to: https://bugs.launchpad.net/bzr/+bug/404740
>> Human readable format string should be unicode even though ascii string.
>>
>> When belowing code executed::
>>
>>  "path: %s" % (path,)
>>
>> If path is unicode string, it may cause UnicodeEncodeError.
>> But next code::
>>
>>  u"path: %s" % (path,)
>>
>> It works fine when path is both unicode and bytes.
> 
> That seems to make sense.  There may be some cases where we're using
> format strings to produce something for a file required to be in a
> particular encoding and we would prefer to get a UnicodeError, but
> they seem rare.
> 

Actually, I think he has it backwards. If you do:

"path: %s" % (path,)

Then if 'path' is unicode then it will upcast the string to Unicode. If
path is 'bytes' and contains non-ascii characters, it stays bytes.

However if you do:

u"path: %s" % (path,)

If 'path' is Unicode, things are fine, and if 'path' is ascii things are
fine (auto-upcasting ascii => unicode). However if 'path' is non-ascii
characters you get a failure.

>>> 'path: %s' % ('ascii-path',)
'path: ascii-path'
>>> 'path: %s' % (u'unicode-path',)
u'path: unicode-path'
>>> 'path: %s' % ('nonascii-\xb5path',)
'path: nonascii-\xb5path'
>>> u'path %s' % ('ascii-path',)
u'path: ascii-path'
>>> u'path %s' % (u'unicode-path',)
u'path: unicode-path'
>>> u'path: %s' % ('nonascii-\xb5path',)
UnicodeDecodeError

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqlcmoACgkQJdeBCYSNAAMuSwCgmn6BRJwIK4ApBDJzRW5h3NX3
hW4AoJP0lGZa88Z6s3fU+MT1LgWXTtBS
=wOLx
-----END PGP SIGNATURE-----