Identify automatic str/unicode coercions
Martin von Gagern
Martin.vGagern at gmx.net
Tue Jun 10 20:28:24 BST 2008
Hi!
There is this interesting thread called "About encoding issues" on the
bazaar mailing list, started by Jan Hudec:
http://thread.gmane.org/gmane.comp.version-control.bazaar-ng.general/10908
The idea is that automatic conversions between byte and unicode strings
should be avoided, as they are bound to fail if a string contains
non-ASCII characters. Instead, all conversions should be done ecplicitely.
I liked the idea, especially as str/unicode conversions currently give
me headaches in https://bugs.launchpad.net/bzr/+bug/128496 (in
combination with bzr-svn). The main problem is the question of how to
enforce this policy. A solution is to override the default encoding by a
specil encoding, which logs all access before performing default ascii
encoding. That's an idea originally proposed by Jan Hudec, but I found
no implementation for it yet. Now there is a basic proof of concept:
https://code.launchpad.net/~gagern/bzr/str-unicode
Right now it simply writes to a linear log, which quickly grows to sizes
where it becomes difficult to manage. I tried to make the log writing
module easily replaceable, and I would think of maybe some sqlite backed
log with one table for backtraces (one line each with pointer to parent)
and one with counters for actual occurrences. Of course there would
still be some post processing overhead to turn this into something useful.
As I don't plan to become a dedicated bzr developer in the near future,
don't even speak Python fluently, and have invested more time already
into bzr and bzr-svn than I can honestly afford, I can't take this idea
much further all by myself. If someone else were working on this as
well, I might be able to cooperate from time to time. I hope I can find
somebody interested in taking this up.
My plan would be to somehow achieve useful logs, dropping irrelevant
stuff like when the string comes from a fixed literal in bzr code,
grouping by leaf function that actually performs the conversion, sorting
by number of times that conversion occurred. Then those could be tackled
one at a time, replacing implicit coercions to explicit
encoding/decoding, preferably with the correct encoding applicable to
the string at hand. Some way to measure progress would be helpful as well.
Greetings,
Martin von Gagern
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080610/0e3f3ccb/attachment.pgp
More information about the bazaar
mailing list