Identify automatic str/unicode coercions

Tue Jun 10 20:28:24 BST 2008

Hi!

There is this interesting thread called "About encoding issues" on the 
bazaar mailing list, started by Jan Hudec: 
http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/10908

The idea is that automatic conversions between byte and unicode strings 
should be avoided, as they are bound to fail if a string contains 
non-ASCII characters. Instead, all conversions should be done ecplicitely.

I liked the idea, especially as str/unicode conversions currently give 
me headaches in https://bugs.launchpad.net/bzr/+bug/128496 (in 
combination with bzr-svn). The main problem is the question of how to 
enforce this policy. A solution is to override the default encoding by a 
specil encoding, which logs all access before performing default ascii 
encoding. That's an idea originally proposed by Jan Hudec, but I found 
no implementation for it yet. Now there is a basic proof of concept: 
https://code.launchpad.net/~gagern/bzr/str-unicode

Right now it simply writes to a linear log, which quickly grows to sizes 
where it becomes difficult to manage. I tried to make the log writing 
module easily replaceable, and I would think of maybe some sqlite backed 
log with one table for backtraces (one line each with pointer to parent) 
and one with counters for actual occurrences. Of course there would 
still be some post processing overhead to turn this into something useful.

As I don't plan to become a dedicated bzr developer in the near future, 
don't even speak Python fluently, and have invested more time already 
into bzr and bzr-svn than I can honestly afford, I can't take this idea 
much further all by myself. If someone else were working on this as 
well, I might be able to cooperate from time to time. I hope I can find 
somebody interested in taking this up.

My plan would be to somehow achieve useful logs, dropping irrelevant 
stuff like when the string comes from a fixed literal in bzr code, 
grouping by leaf function that actually performs the conversion, sorting 
by number of times that conversion occurred. Then those could be tackled 
one at a time, replacing implicit coercions to explicit 
encoding/decoding, preferably with the correct encoding applicable to 
the string at hand. Some way to measure progress would be helpful as well.

Greetings,
  Martin von Gagern

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080610/0e3f3ccb/attachment.pgp