Is Python going bad? or The curse of unicode….

I’ve wasted a considerable amount of time in the last two days upgrading my RSS aggregate so that it will have better support for atom feeds. I use the feedparser library.

One thing that gets to me is how unintuitive unicode is under Python. For example, the following is a string…

t="éee"

Just copy this in your python interpreter, and it will work nicely. For example,


>>> t='éee'
>>> print t
�ee

However, for some reason, if I just type “t”, then it can’t print it properly…

>>> t
'xe9ee'

See how it is already confusing? (And we haven’t used unicode yet!)

Next, we can map this string to unicode…

r=unicode(t)

which has the following result…

>>> r=unicode(t)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
</stdin>

Ah… so it tries to interpret t as ascii… fair enough, we know it is “latin-1” or “iso8859-1”. It is already quite strange that “print” knows what to do with my string, but nothing else in Python seems to know… so we do


>>> r=unicode(t,'latin-1')
>>> r
u'xe9ee'
>>> print r
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128)
</stdin>

because, see, you can’t print unicode to the string… but you can do the following…


>>> print r.encode('latin-1')
éee
>>> print r.encode('iso-8859-1')
éee

but also


>>> r.encode('latin-1')
'xe9ee'
>>> r.encode('iso-8859-1')
'xe9ee'

What is my beef?

  • If ‘print’ assumes ‘latin-1’ then shouldn’t everything else? Why is this not consistent? If it is unsafe to assume ‘latin-1’, then why does print do it?
  • The encode, decode thing is a mess. We had a perfectly valid construct for converting things to strings, and that’s ‘str’. Now, we have a new one called ‘encode’. So that, given some unicode, I can do either t.encode(‘ascii’) or str(t) for the same result. Bad. Now, I’m stuck forever in a world where I have to figure out whether I encode or decode a string, and which is which. This is hard. This is confusing.
  • A string object should know its encoding so I don’t have to. What happens if I receive a string from some library and I need to convert it to unicode? How am I supposed to know what the encoding of the string is? There is no sensible way to communicate this right now which makes debugging a pain. The only excuse I see is that sometimes it is impossible for python to know the encoding… well, then it should just fail and require the programmer to specify the encoding. There are way too many things that can go wrong when you expect the programmer to keep tracks of his strings and which is encoded how…

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

3 thoughts on “Is Python going bad? or The curse of unicode….”

  1. I totally agree with you, this whole encoding thing is a real pain. I am in the process of writing an application that will, amongst other things, rename an mp3 file according to its id3 tag. This is how I got into the horribly confusing world of python encoding, as so far all the mp3’s I’ve come across are in latin-1, and turning that into something I can manipulate has been problematic. The fact that I’m going cross platform with this doesn’t help. I also want to add support for other encodings (utf-16,utf8, etc…). Like you said, how am I supposed to know what encoding was used in the mp3? I’m thinking about a series of try: except:, or maybe a loop that tries each encoding ?!? It’ll get done eventually, but for now latin1 will have to do – at least I can rename my Brassens titles without crashing my app.
    Anyway I would like to say that this page was informative by giving me a list of things to try out all in one conveniant package, rather than searching through cryptic python doc pages. Merci l’ami !

    – ianaré

  2. Sometimes, for example, when reading RSS feeds, even the programmer do not known what is the encoding.

    I agree with you that unicode is very bad support in python, but that is the best scripting language support we can find.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.