First, the shameless plug: my long-time friend, Jean-François Racine published a book available both as hardcover and paperback. The title is “The Text of Matthew in the Writings of Basil of Caesarea”.

More seriously, and maybe he had told me about this, but he told me about this specialized word processor he uses, called Nota Bene. More interesting is a component of this word processor called Orbis. Specifically, Orbis generates vocabulary lists, as well as frequency of occurrence; and it allows you to define synonym lists to expand search capabilities.

I have somewhat of a debate with my friend Yuhong about the correct way to use a Web Service. Yuhong seems to prefer SOAP. I much prefer REST. What is a REST Web Service? For the most part, a REST Web Service is really, really simple. You simply follow a URL and magically, you get back an XML file which is the answer to your query. One benefit of REST is that you can easily debug it and write quick scripts for it. What is a SOAP Web Service? I don’t know. I really don’t get it. It seems to be a very complicated way to do the same thing: instead of sending the query as a URL, you send the query as a XML file. The bad thing about it is that if it breaks, then you have no immediate way to debug it: you can’t issue a SOAP request using your browser (or maybe you can, but I just don’t know how).

Now, things never break, do they? Well, that is the problem, they break often because either I’m being stupid or I don’t know what I’m doing or the people on the other side don’t know what they are doing or the people on the other side are experimenting a bit or whatever else. I find that being able to quickly debug my code is the primary feature any IT technology should have. The last thing I want from a technology is for it to be hard to debug.

Here is the problem that I solved this week-end. I have this list of artists and I want to get a list of all corresponding music albums so I can put it all into a relational (SQL) database. Assuming that your list of artists are in a file called artists_big.txt and that you want the result to be in a file called amazonresults.sql, the following does a nice job thanks to the magic of Amazon Web Services:

Yes: the code goes over because I cannot allow HTML to wrap lines (Python allows wrapping lines, but not arbitrarily so, white space is significant in Python). There is just no way around it that I know: suggestions with sample HTML code are invited.

import libxml2, urllib2, urllib, sys, re, traceback
ID=""# please enter your own ID here
uri="http://webservices.amazon.com/AWSECommerceService/2004-10-19"
url="http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&SubscriptionId=%s&Operation=ItemSearch&SearchIndex=Music&Artist=%s&ItemPage=%i&ResponseGroup=Request,ItemIds,SalesRank,ItemAttributes,Reviews"
outputcontent = ["ASIN","Artist","Title","Amount","NumberOfTracks","SalesRank","AverageRating","ReleaseDate"]
input = open("artists_big.txt")
output = open("amazonresults.sql", "w")
log = open("amazonlog.txt", "w")
output.write("DROP TABLE music;nCREATE TABLE music (ASIN TEXT, Artist TEXT, Title TEXT, Amount INT, NumberOfTracks INT, SalesRank INT, AverageRating NUMERIC, ReleaseDate DATE);n")
def getNodeContentByName(node, name):
for i in node:
if (i.name==name): return i.content
return None
for artist in input:#go through all artists
print "Recovering albums for artist : ", artist
page = 1
while(True):# recover all pages
resturl = url %(ID,urllib.quote(artist),page)
log.write("Issuing REST request: "+resturl+"n")
try :
data = urllib2.urlopen(resturl).read()
except urllib2.HTTPError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not retrieve :n"+resturl+"n")
continue
try :
doc = libxml2.parseDoc(data)
except libxml2.parserError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not parse (is valid XML?):n"+data+"n")
continue
ctxt=doc.xpathNewContext()
ctxt.xpathRegisterNs("aws",uri)
isvalid = (ctxt.xpathEval("//aws:Items/aws:Request/aws:IsValid")[0].content == "True")
if not isvalid :
log.write("The query %s failed " % (resturl))
errors = ctxt.xpathEval("//aws:Error/aws:Message")
for message in errors: log.write(message.content+"n")
continue
for itemnode in ctxt.xpathEval("//aws:Items/aws:Item"):
attr = {}
for nodename in outputcontent:
content = getNodeContentByName(itemnode,nodename)
if(content <> None):
content = re.sub("'","'",content)
if(nodename == "SalesRank"):
content = re.sub(",","",content)
attr[nodename] = content
columns = "("
keys = attr.keys()
for i in range(len(keys)-1):
columns += keys[i]+","
columns+=keys[len(keys)-1]+")"
row = "("
values = attr.values()
for i in range(len(values)-1):
row+="'"+str(values[i])+"',"
row+="'"+str(values[len(values)-1])+"')"
command = "INSERT INTO music "+columns+" VALUES "+row+";n"
output.write(command)
NumberOfPages = int(ctxt.xpathEval("//aws:Items/aws:TotalPages")[0].content)
if(page >= NumberOfPages): break
page += 1
input.close()
output.close()
log.close()
print "You should now be able to run the file in postgresql. Start the postgres client doing psql, and using i amazonresults.sql in the postgresql shell."

Update : this was updated to take into account these comments from Amazon following the upgrade of AWS 4.0 from beta to release:

Service Name change: You will need to modify the Service Name parameter in your application from AWSProductData to AWSECommerceService. We realize that it may take some time to implement this change in your applications. In order to make this transition as easy as possible, we will continue supporting AWSProductData for a short time.

2) REST/WSDL Endpoints: You will need to modify your application to connect to webservices.amazon.com instead of aws-beta.amazon.com. For other locales, the new endpoints are webservices.amazon.co.uk, webservices.amazon.de and webservices.amazon.co.jp.

I was an invited speaker at the AILIA (Canadian Langage Industry Association) Annual Meeting. I was talking about Semantic Web and RDF. I met some pretty fantastic people like Pierre Coulombe from Lingua Technologies and Fred Popowich from Axonwave.

I learned about a nice project by RALI called TransCheck. It is a computer-assisted revision tool to help automatically improve the quality of translations by checking for some common errors made by translators.

Today, I realized that the life of a researcher/professor is really a balancing act. A professor…

  • has a rich personal life;
  • gives great courses;
  • gets a lot of funding;
  • has many students;
  • publishes a lot papers each year;
  • consults on industrial/governmental projects;
  • manages something (departement, project, program).

It is no surprise that many professors end up being overworked. I think you simply cannot pull all these things at once. Maybe 2 or 3 from the list. You have to choose or life will choose for you.

Tall, Dark, and Mysterious is a Math. professor somewhere in Canada, possibly in British Columbia. She graduated from a big school and now teaches at a smaller (lesser?) school.

Well, is it a lesser school? That’s where her tale becomes interesting. Myself, I attended UofT. I don’t know if the rule is true, probably not, but it seem that the larger the school, the more it suffers from the jobs-are-for-little-people syndrome as documented in a post by Tall, Dark, and Mysterious. Here is an insightful quote:

University isn’t job training, because universities are adamant about university not being job training. And it’s not because they’re too busy enriching students’ lives and fostering a love of learning. Underneath all of the cheap idealism – trumpeted by gainfully employed people, many of whom haven’t learned how to play a musical insturment, how to speak a foreign language, or how to play a new sport because none of those things are related to their jobs and because they’re too old to be doing that sort of thing – about learning for the sake of learning is a willful inability to confront the fact that students are not at universities to learn for the sake of learning.

« Previous PageNext Page »

Powered by WordPress