s.decode('utf-8') vs. unicode(s, 'utf-8')

Sun Aug 16 18:07:21 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> As I know some code in bzr uses conversion from utf-8 to unicode (e.g. dirstate?).
> Recently there is interesting discussion in comp.lang.python that shows unicode(s, 'utf-8') is
> faster than decode.
> 
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/314a3043ea63319f/
> 
> Maybe this will be useful to know for bzrlib.
> 
> 
> 

I'll mention that internally when it really matters we tend to do:

_utf8_decode = codecs.utf_8_decode

for foo_utf8 in bar:
  foo = _utf8_decode(foo_utf8)[0]

I believe I've traced through the code paths, and that was about the
fastest I could find.

$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [z.decode('utf-8') for z in x]"
10 loops, best of 3: 254 msec per loop

jameinel at samus ~/downloads/Python-2.5.2
$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [unicode(z, 'utf-8') for z in x]"
10 loops, best of 3: 125 msec per loop

jameinel at samus ~/downloads/Python-2.5.2
$ TIMEIT -s "from codecs import utf_8_decode; x = ['abc'*(n%100) for n
in xrange(100000)]" "y = [utf_8_decode(z)[0] for z in x]"
10 loops, best of 3: 133 msec per loop

So it seems that final [0] does penalize the utf_8_decode form versus
the unicode(z, 'utf-8') form.

We should ecrtainly keep it in mind. I wonder if it depends on the
python version at all?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqIPMkACgkQJdeBCYSNAAMXdgCfU3ufo8qr+usg+pSz3MrWGG/N
ECYAnRt22tstso4ekRDmJR/PoD1haibn
=5Hiu
-----END PGP SIGNATURE-----