Daniel Lemire's blog

Is Python going bad? or The curse of unicode….

I’ve wasted a considerable amount of time in the last two days upgrading my RSS aggregate so that it will have better support for atom feeds. I use the feedparser library.

One thing that gets to me is how unintuitive unicode is under Python. For example, the following is a string…

t="éee"

Just copy this in your python interpreter, and it will work nicely. For example,


>>> t='éee'
>>> print t
�ee

However, for some reason, if I just type “t”, then it can’t print it properly…

>>> t
'xe9ee'

See how it is already confusing? (And we haven’t used unicode yet!)

Next, we can map this string to unicode…

r=unicode(t)

which has the following result…

>>> r=unicode(t)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
</stdin>

Ah… so it tries to interpret t as ascii… fair enough, we know it is “latin-1” or “iso8859-1”. It is already quite strange that “print” knows what to do with my string, but nothing else in Python seems to know… so we do


>>> r=unicode(t,'latin-1')
>>> r
u'xe9ee'
>>> print r
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128)
</stdin>

because, see, you can’t print unicode to the string… but you can do the following…


>>> print r.encode('latin-1')
éee
>>> print r.encode('iso-8859-1')
éee

but also


>>> r.encode('latin-1')
'xe9ee'
>>> r.encode('iso-8859-1')
'xe9ee'

What is my beef?