So, you want to do a Ph.D.?

Seb sent me this extract of a book. The extract is called So, you want to do a Ph.D.? As usual with this sort of book, it is delightful.

Here’s a fun quote:

One thing which is seldom mentioned is what happens to you after you finish the PhD. A classic story is as follows. A student focuses clearly, submits the thesis and starts looking for a lecturing job, only to discover that they need two years of lecturing experience and preferably a journal publication as well if they are to be appointable for a job in a good department in their field. If they had known this two years previously, they could have started doing some part-time lecturing and submitted a paper or two to a journal.

I haven’t read the entire book, of course, and I’m somewhat worried that the book might not be sufficiently focused on why one does a Ph.D. and might be a tad too cynical. Learning the rules is very nice and very important, and I wished I had learned them when it was time. However, there is also the issue of figuring out whether these rules make sense, and knowing when to break them. Well, I guess that learning the rules to begin with is a very good start.

Building the Open Warehouse

Here’s a link to slides from a talk by Roger Magoulas, (O’Reilly Media, Inc.) about building the open warehouse. The talk was presented at O’Reilly Open Source Convention 2004.

Commodity hardware, faster disks, and open source software now make building a data warehouse more of a resource and design issue than a cost issue for many organizations. Now a robust analysis infrastructure can be built on an open source platform with no performance or functional compromises.

This talk will cover a proven analysis architecture, the open source tool options for each architecture component, the basics of dimensional modeling, and a few tricks of the trade.

Why open source? Aside from the cost savings, open source lets you leverage what your staff already knows — tools like Perl, SQL and Apache — rather than having to procure and staff for the proprietary tools that dominate the commercial space.

Data Warehouse Architecture: – Consolidated Data Store (CDS)
– Process to condition, correlate and transform data
– Multi-topic data marts
– dimensional models
– Multi-channel data access

Open Source Components
Database: MySQL
– fast, effective
Data Movement: Perl/DBI/SQL
– flexible data access
Data Access: Perl/Apache/SQL
– template toolkit for ad hoc SQL
– Perl hash for crosstabs/pivot
– Perl for reports

Dimensional Model
– organizes data for queries and navigation from detail to summary
– normalized fact table for quantitative data
– denormalized dimensions with descriptive data
– conforming dimensions available to multiple facts

Performance Considerations
– configuration
– indexing
– SQL-92 joins
– aggregate tables and aggregate navigation

The presentation should provide you with the basic architecture, toolkit, design principles, and strategy for building an effective open source data warehouse.

Graduate student/faculty relations

Sharleen talks about how evil junior faculty can be in their approach with grad students:

(…) in academia, (…), there are limited options, and a poor grad student may have to work with the asshole who has naive, unethical, or objectionable approaches to working with grad students. Now, we could simply say, the ones who survive are the ones who deserve to get jobs/get the PhD. We could point out that the market is much tougher. But if we respond this way, we’re not critiquing the culture of academia (a culture which, if I may point out, is largely responsible for the other problems that we all bitch about); we’re justifying it.

I’m unsure why she points at junior faculty as the source of the problem. She’s probably got some personal experience going.

However, I agree with her criticism of the tough love approach to supervising graduate students. I don’t think it can be justified from a pedagogical point of view, it is not justified from a management point of view, and so, indeed, it might be some kind of power trip.

On the other hand, I disagree with her implication that there are no choices. In most cases, the graduate student can go with another supervisor. It might costly, but it is almost always an option. Or else, you can simply go out there and find a job and be happy.

Repeat after me: the world is big and there are almost always options. Unless you are a slave stranded somewhere, you can almost certainly find another job, another graduate program, another project… it might be costly, it might imply extra work, but it is most often possible.

The reason why these professors are getting away with treating graduate students badly is that graduate students allow it. If they chose not to go with this “evil” supervisor, there wouldn’t be any problems any more.

That’s how the real world works. Evil employers will have trouble finding good employees. The good employees will leave for a better employer. That’s the market at work.

The day when the employees stop leaving, because they are scared or tired, the market stops working and the trouble starts.

Generally speaking, academia doesn’t have so much a culture problem as it has a market problem: too many potential candidates for some positions leading to a general degradation of the working conditions for everyone involved.

The art of supervising students

I had an off-line discussion with a collaborator about student supervision and how frustrating it can be. As a professor, you have, from time to time, to supervise students. It could be a graduate student you are supervising as part of their studies, it could be an undergraduate project, it could an assistant you’ve hired.

You know you have a bad student if the student

  • cannot keep track of tasks assigned to him and be responsible for such tasks;
  • lies to you about what has been done and what hasn’t been done;
  • repeatedly ignores some of your phone calls or emails.

In my experience, a bad student is a drain on your resources and a professor simply has to drop such a student as soon as possible. Even if you have funding or need of a student, you are better off with no student than a bad student.

So, what about my title? The art of supervising students?

My experience has been that there is no need to be tough or strict with the students. There is nothing magical you can do: forcefully organizing many meetings with the student often won’t help. If you have a bad student (see above), cut your losses as early as possible. Otherwise, trust the student.

Here are a few rules based on my experience:

  • Be clear about the tasks you expect the student to perform and the time it should take.
  • Be available to the student in a personalized way: some students benefit from frequent meetings, others do not.
  • Get to know and leverage the student strengths and know his weaknesses: you are better off doing some of the tasks yourself.
  • Trust the student: most students have tremendous potential and will deliver greatness given a chance.

e-Learning or else…

Important post today by Yuhong, on her experience with e-Learning. She recalls a few facts:

  • a decent videoconference setup for a classroom is less than $5000;
  • MIT is setting itself up to become the major competitor in the future education market through e-Learning: webLab and open sourceware;
  • we know of some tremendously succesful endeavours like MusicGrid lead by Martin Brooks.

I think that Yuhong misses the most important example of all: the U.K. Open University. An entire university based on e-Learning and distance education, and yet, it is one of the best schools in U.K.

I think Downes once wrote that while physical classrooms won’t go away, they will increasingly become a lifestyle choice. In the near future, when my son will attend college (if he does so), he will find a very different landscape. There will much high quality learning opportunities outside classrooms, to the extend that he may avoid entirely classrooms and actually get an even better education. On the other hand, the remaining classrooms will be high-tech classrooms with remote instructors, remote laboratories and so on.

You don’t believe me? About a quarter of current students [in U.K.] are now doing all or part of their courses online.

How to Misuse SQL’s FROM Clause

I stumbled on an interesting SQL article on the Misuse of the FROM Clause. The author argues that FROM clauses should refer to only two types of tables:

  • those from which you want values returned
  • those allowing to join two or more tables in the above category

In other words, if your select is on tables A and B, then you can select from tables A and B, and any table that can be joined with A and B, but no others.

The argument he offers is based on performance concerns. It does seem to me that any query not fulfilling this requirement would have to be relatively complex.

If we taught you to memorize, we failed you

Tall, Dark, and Mysterious wrote about this student she has in her class who is actually a fairly typical student:

“I memorized how to do the problem you did in class, but then on the test you put a DIFFERENT problem, and you never showed us how to do THAT one, and it’s not fair! My method of doing math by memorizing formulas and then blindly applying them to problems that are identical to the ones I’ve seen has gotten me A’s until now, so what gives?”

Repeat after me: memorization is not learning. Learning has to be a higher level task.

More on the CS enrollment drop

I’ve written on this blog about the recent drop in enrollment for Computer Science degrees in North America: I gave an estimate of a drop by 25%. Looks like it is worse:

The number of new undergraduate majors in U.S. computer science programs has fallen 28 percent since 2000, reports the Computing Research Association, a group of more than 200 North American computer science, computer engineering and related academic departments.

The explanation would be that students do not want a Dilbertesque life:

One reason, say those in the field, is that technology jobs appear less lucrative than they did during the dot-com boom. Then, students thought a computer science degree would lead to riches and a quick retirement. Many took on the major.

Another reason might be that Business Schools are now competing with Computer Science departments for students:

Colleges have also begun to integrate computer instruction into other majors such as e-commerce programs in business schools. A computer science degree, therefore, can be unnecessary.

Don’t memorize, change your neural pathways!

Some days ago, I stated on this blog that I had a Ph.D. in mathematics (true fact) and that I didn’t know my own phone number nor did I know multiplication tables (also true). My wife knows it is true. She still claim she has superior brain power because not only does she know our phone number, but she even knows our postal code, and she knows many other things. There is not question that my wife is one of the smartest lady in Montreal. Hey! There is a reason why I fell in love with her!

Still, I claim not to be a brain-damaged moron despite these apparent short-comings. You see, I do not memorize on purpose because I think that my time is better used by solving problems and learning new tricks.

From Downes’, I got the following bit of wisdom telling I’m not alone in thinking that memorizing facts is not key to learning…

My own research – reserach that can be extended through the many resources on this site – has already convinced me that neural structures are, as they say, plastic. For me what this means is that learning based on the fostering of habits is more important than learning based on transmission of facts, that, indeed, the facts aren’t that important at all, not nearly as important modelling effective practice, paying attention to environment, immersive, experiential based education.

So, please, do me a favor: if you teach, do not ask your students to memorize. Ask them to change their neural pathways, their thinking patterns… let their PDAs and the Web be a fact storage unit, don’t waste their brains.

Update: A colleague who has a training in history and who holds a Ph.D. says he could never remember dates, and only memorized one: December 25th 800. So, I can say that I’m not alone to think that memorization is only a minor part of learning.

Don’t Be Afraid to Drop the SOAP

Through Downes’, I found another article speaking up against SOAP: Don’t Be Afraid to Drop the SOAP. Here’s a few things it holds against SOAP, all of which are things I can testify to:

  • SOAP is difficult to debug. The SOAP message format is verbose even by XML standards, and decoding it by hand is a great way to waste an afternoon. As a result, development took almost twice as long as anticipated.
  • The fact that all requests happened live over the network further hampered debugging. Unless the user was careful to log debugging output to a file it was difficult to determine what went wrong.
  • SOAP doesn’t handle large amounts of data well. This became immediately apparent as we tried to load a large data import in a single request. Since SOAP requires the entire request to travel in one XML document, SOAP implementations usually load the entire request into memory. This required us to split large jobs into multiple requests, reducing performance and making it impossible to run a complete import inside a transaction.
  • Network problems affected operations that needed to access multiple machines, such as the program responsible for moving templates and elements. Requests would frequently timeout in the middle, sometimes leaving the target system in an inconsistent state.

SOAP leads to strongly coupled, poorly scalable, and bandwidth hungry solutions?

Here’s some comments by Joe Walnes on his experience with SOAP. The scary thing is that he comes to exactly the same conclusions as I did on my own… Any SOAP supporter out there wants to answer these:

On the last system I worked on, we were struggling with SOAP and switched to a simpler REST approach. It had a number of benefits.

Firstly, it simplified things greatly. With REST there was no need for complicated SOAP libraries on either the client or server, just use a plain HTTP call. This reduced coupling and brittleness. We had previously lost hours (possibly days) tracing problems through libraries that were outside of our control.

Secondly, it improved scalability. Though this was not the reason we moved, it was a nice side-effect. The web-server, client HTTP library and any HTTP proxy in-between understood things like the difference between GET and POST and when a resource has not been modified so they can offer effective caching – greatly reducing the amount of traffic. This is why REST is a more scalable solution than XML-RPC or SOAP over HTTP.

Thirdly, it reduced the payload over the wire. No need for SOAP envelope wrappers and it gave us the flexibility to use formats other than XML for the actual resource data. For instance a resource containing the body of an unformatted news headline is simpler to express as plain text and a table of numbers is more concise (and readable) as CSV.

Victor Shoup’s A Computational Introduction to Number Theory and Algebra

Through Didier, I got to Victor Shoup’s Home Page. He has an on-line textbook called A Computational Introduction to Number Theory and Algebra. It is unclear whether he intends the textbook to remain free, but it is pretty cool to post the book on his home page. Shoup’s is an expert in cryptography.

How Technology Will Destroy Schools

Through Downes’, I found an article by David Wiley’s with the provocative title How Technology Will Destroy Schools (he actually is being needlessly provocative, he means “schools as they exist now”). The gist of his argument goes as follows:

The development of (…) technology will obviate the need for certain types of instruction — like the teaching of facts. Why spend time memorizing when the same information is available just as quickly from the network as it is from your own memory? But never fear, schools! The technology will create the need for new types of instruction — in higher level information literacy skills. Perhaps this will finally force some change through the public schools.

Well, I must admit. I have a Ph.D. in mathematics and I never learned my multiplication tables. There you go. I never saw the point of learning these tables, so I didn’t. Instead, I learned a few tricks to do multiplications… like 9 times 8 is almost 10 times 8, you have to subtract 1 times 8.

Mathematics is not about learning facts. I suspect that all disciplines have a component above learning the facts. You can’t be an expert in something if you only know the facts… because I can easily input the facts into a piece of software and compete with you, but we all know that software can’t compete (yet) against human experts. I’m not very good at memorizing facts, I’ve never been good at it. In fact, I’m not good at memorizing anything and that’s why I have a PDA always with me. Yet, I’m in expert at some things.

It is the difference between real knowledge and shallow knowledge. Most of our education system is based on acquiring and testing shallow knowledge. Most but not all.

How are you going to get past shallow knowledge through technology as Wiley predicts we will? I think that blogs, games, and simulations are good examples. Yes, we can role play without technology, but it becomes so much cheaper to deploy gaming scenarios through technology (because you only have to do it once) that it might become more common place in the future.

Maybe my son Lohan, by the time he makes it to school, will have “gaming instruction” where he will enter a gaming universe to learn basic mathematics. Who knows.

I’m not holding my breath though, I think we lack the human power to do pull it off in the next 5 years.

What the Bubble Got Right

A beautiful article by Paul Graham: What the Bubble Got Right. It is a good analysis of the dot-com era. I totally agree with the analysis too! People tend to overestimate the impact of technology over the short term, but underestimate it over the longer term. The dot-com bubble was proof of that. It is not so much that the new economy was a sham… but rather that the new economy will take a bit more than 2 years to settle… Here’s the conclusion of this beautiful article:

When one looks over these trends, is there any overall theme? There does seem to be: that in the coming century, good ideas will count for more. That 26 year olds with good ideas will increasingly have an edge over 50 year olds with powerful connections. That doing good work will matter more than dressing up– or advertising, which is the same thing for companies. That people will be rewarded a bit more in proportion to the value of what they create.

If so, this is good news indeed. Good ideas always tend to win eventually. The problem is, it can take a very long time. It took decades for relativity to be accepted, and the greater part of a century to establish that central planning didn’t work. So even a small increase in the rate at which good ideas win would be a momentous change– big enough, probably, to justify a name like the “new economy.”

As a side-note, the mere fact that such a good article is waiting at the end of a URL, for all to see and absolutely free, should remind you of how powerful, after all, the Web really is. I was raised in an era where you needed to go buy a magazine to read such a great article. Then you’d get many bad articles, but what could you do: there were few magazines, and your choices were limited. Things have changed, they have changed tremendously.

Data centers as a utility?

Seems like Gartner predicts data centers are going to become a utility:

The office environment will dramatically change in 50 years’ time, with desktop computers disappearing, robots handling more manual tasks, and global connectivity enabling more intercontinental collaboration. Data centers located outside the city will run powerful database and processing applications, serving up computing power as a utility; many more people will work remotely, using handheld devices to stay connected wherever they go, although those devices will be much more sophisticated and easier to use than current handhelds.

If you haven’t switched to Firefly, do it now.

I’ve finally moved all my machines to Mozilla Firefox 1.0. It is, by far, the best browser I ever used, and it is totally, truely free. Unfortunately, the French version is lagging behind a bit. Unless you are running something else than Windows, Linux, or MacOS, you have no excuse to use another browser. None.

Update: Sean asks why I switched away from Konqueror. The main reasons are XML support and Gmail. Gmail doesn’t support konqueror for some reason, and I badly need a browser having decent support for XSLT. Also, there is a comment below saying that Firefox is not stable on OS X 10.2.

On tools for academic writting and a shameless plug

First, the shameless plug: my long-time friend, Jean-François Racine published a book available both as hardcover and paperback. The title is “The Text of Matthew in the Writings of Basil of Caesarea”.

More seriously, and maybe he had told me about this, but he told me about this specialized word processor he uses, called Nota Bene. More interesting is a component of this word processor called Orbis. Specifically, Orbis generates vocabulary lists, as well as frequency of occurrence; and it allows you to define synonym lists to expand search capabilities.

An Amazon Web Services (AWS) 4.0 application in just a few lines

I have somewhat of a debate with my friend Yuhong about the correct way to use a Web Service. Yuhong seems to prefer SOAP. I much prefer REST. What is a REST Web Service? For the most part, a REST Web Service is really, really simple. You simply follow a URL and magically, you get back an XML file which is the answer to your query. One benefit of REST is that you can easily debug it and write quick scripts for it. What is a SOAP Web Service? I don’t know. I really don’t get it. It seems to be a very complicated way to do the same thing: instead of sending the query as a URL, you send the query as a XML file. The bad thing about it is that if it breaks, then you have no immediate way to debug it: you can’t issue a SOAP request using your browser (or maybe you can, but I just don’t know how).

Now, things never break, do they? Well, that is the problem, they break often because either I’m being stupid or I don’t know what I’m doing or the people on the other side don’t know what they are doing or the people on the other side are experimenting a bit or whatever else. I find that being able to quickly debug my code is the primary feature any IT technology should have. The last thing I want from a technology is for it to be hard to debug.

Here is the problem that I solved this week-end. I have this list of artists and I want to get a list of all corresponding music albums so I can put it all into a relational (SQL) database. Assuming that your list of artists are in a file called artists_big.txt and that you want the result to be in a file called amazonresults.sql, the following does a nice job thanks to the magic of Amazon Web Services:

Yes: the code goes over because I cannot allow HTML to wrap lines (Python allows wrapping lines, but not arbitrarily so, white space is significant in Python). There is just no way around it that I know: suggestions with sample HTML code are invited.

import libxml2, urllib2, urllib, sys, re, traceback
ID=""# please enter your own ID here
uri="http://webservices.amazon.com/AWSECommerceService/2004-10-19"
url="http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&SubscriptionId=%s&Operation=ItemSearch&SearchIndex=Music&Artist=%s&ItemPage=%i&ResponseGroup=Request,ItemIds,SalesRank,ItemAttributes,Reviews"
outputcontent = ["ASIN","Artist","Title","Amount","NumberOfTracks","SalesRank","AverageRating","ReleaseDate"]
input = open("artists_big.txt")
output = open("amazonresults.sql", "w")
log = open("amazonlog.txt", "w")
output.write("DROP TABLE music;nCREATE TABLE music (ASIN TEXT, Artist TEXT, Title TEXT, Amount INT, NumberOfTracks INT, SalesRank INT, AverageRating NUMERIC, ReleaseDate DATE);n")
def getNodeContentByName(node, name):
for i in node:
if (i.name==name): return i.content
return None
for artist in input:#go through all artists
print "Recovering albums for artist : ", artist
page = 1
while(True):# recover all pages
resturl = url %(ID,urllib.quote(artist),page)
log.write("Issuing REST request: "+resturl+"n")
try :
data = urllib2.urlopen(resturl).read()
except urllib2.HTTPError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not retrieve :n"+resturl+"n")
continue
try :
doc = libxml2.parseDoc(data)
except libxml2.parserError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not parse (is valid XML?):n"+data+"n")
continue
ctxt=doc.xpathNewContext()
ctxt.xpathRegisterNs("aws",uri)
isvalid = (ctxt.xpathEval("//aws:Items/aws:Request/aws:IsValid")[0].content == "True")
if not isvalid :
log.write("The query %s failed " % (resturl))
errors = ctxt.xpathEval("//aws:Error/aws:Message")
for message in errors: log.write(message.content+"n")
continue
for itemnode in ctxt.xpathEval("//aws:Items/aws:Item"):
attr = {}
for nodename in outputcontent:
content = getNodeContentByName(itemnode,nodename)
if(content <> None):
content = re.sub("'","'",content)
if(nodename == "SalesRank"):
content = re.sub(",","",content)
attr[nodename] = content
columns = "("
keys = attr.keys()
for i in range(len(keys)-1):
columns += keys[i]+","
columns+=keys[len(keys)-1]+")"
row = "("
values = attr.values()
for i in range(len(values)-1):
row+="'"+str(values[i])+"',"
row+="'"+str(values[len(values)-1])+"')"
command = "INSERT INTO music "+columns+" VALUES "+row+";n"
output.write(command)
NumberOfPages = int(ctxt.xpathEval("//aws:Items/aws:TotalPages")[0].content)
if(page >= NumberOfPages): break
page += 1
input.close()
output.close()
log.close()
print "You should now be able to run the file in postgresql. Start the postgres client doing psql, and using i amazonresults.sql in the postgresql shell."

Update : this was updated to take into account these comments from Amazon following the upgrade of AWS 4.0 from beta to release:

Service Name change: You will need to modify the Service Name parameter in your application from AWSProductData to AWSECommerceService. We realize that it may take some time to implement this change in your applications. In order to make this transition as easy as possible, we will continue supporting AWSProductData for a short time.

2) REST/WSDL Endpoints: You will need to modify your application to connect to webservices.amazon.com instead of aws-beta.amazon.com. For other locales, the new endpoints are webservices.amazon.co.uk, webservices.amazon.de and webservices.amazon.co.jp.

Academic life: a balancing act

Today, I realized that the life of a researcher/professor is really a balancing act. A professor…

  • has a rich personal life;
  • gives great courses;
  • gets a lot of funding;
  • has many students;
  • publishes a lot papers each year;
  • consults on industrial/governmental projects;
  • manages something (departement, project, program).

It is no surprise that many professors end up being overworked. I think you simply cannot pull all these things at once. Maybe 2 or 3 from the list. You have to choose or life will choose for you.