s.decode('utf-8') vs. unicode(s, 'utf-8')
John Arbash Meinel
john at arbash-meinel.com
Sun Aug 16 18:07:21 BST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Alexander Belchenko wrote:
> As I know some code in bzr uses conversion from utf-8 to unicode (e.g. dirstate?).
> Recently there is interesting discussion in comp.lang.python that shows unicode(s, 'utf-8') is
> faster than decode.
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/314a3043ea63319f/
>
> Maybe this will be useful to know for bzrlib.
>
>
>
I'll mention that internally when it really matters we tend to do:
_utf8_decode = codecs.utf_8_decode
for foo_utf8 in bar:
foo = _utf8_decode(foo_utf8)[0]
I believe I've traced through the code paths, and that was about the
fastest I could find.
$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [z.decode('utf-8') for z in x]"
10 loops, best of 3: 254 msec per loop
jameinel at samus ~/downloads/Python-2.5.2
$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [unicode(z, 'utf-8') for z in x]"
10 loops, best of 3: 125 msec per loop
jameinel at samus ~/downloads/Python-2.5.2
$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [utf_8_decode(z)[0] for z in x]"
10 loops, best of 3: 133 msec per loop
So it seems that final [0] does penalize the utf_8_decode form versus
the unicode(z, 'utf-8') form.
We should ecrtainly keep it in mind. I wonder if it depends on the
python version at all?
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkqIPMkACgkQJdeBCYSNAAMXdgCfU3ufo8qr+usg+pSz3MrWGG/N
ECYAnRt22tstso4ekRDmJR/PoD1haibn
=5Hiu
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list