In the electronic world, less structure is better

Here’s a quote touching on something very important for me: people tend to try to reproduce structures they know to work in the “real world” into the eWorld. So, they create electronic management systems that are like real management: hierarchical, centralized and rigid. No, no and no! When will they learn that such things never win in the end?

What worked? Please look at what worked! Email and the Web. These are the things that worked… look at them!!! Both of these things have very minimal models of how people work. Look at google: it doesn’t assume much at all about your work process. Look at amazon: their main goals are fewer clicks and less structure.

I don’t want an electronic management system to tell me how to work. In fact, just stay away from monolithic tools: give me some simplistic tools (like a hammer) and let me work!

The general theme is that less management is better, and that individual learners could write all of their posts, assignments and papers from their own site, and these could be directed to each class as web feeds. The classes would aggregate the feeds from all the studens and instructors. The beauty of this kind of system is that each student keeps all of his/her content, and it does not get locked away in an inaccessible archive of a centrally controlled LMS.

From a post by Harold who was writting about a post by James Farmer

Tim Berners-Lee first executive summary of the World Wide Web

I copy this here for historical reasons. Notice how Tim didn’t simply point to a specification, he actually pointed to a working demo of what the Web could be. (Complete version can be found on the w3c Web site.)

From :Tim Berners-Lee ([email protected]_.cern.ch)
Subject :WorldWideWeb: Summary
alt.hypertext
Date :1991-08-06 13:37:40 PST
(...)
     Information provider view
The WWW browsers can access many existing data systems via existing protocols  
(FTP, NNTP) or via HTTP and a gateway. In this way, the critical mass of data  
is quickly exceeded, and the increasing use of the system by readers and  
information suppliers encourage each other.
Making a web is as simple as writing a few SGML files which point to your  
existing data. Making it public involves running the FTP or HTTP daemon, and  
making at least one link into your web from another. In fact,  any file  
available by anonymous FTP can be immediately linked into a web. The very small  
start-up effort is designed to allow small contributions.  At the other end of  
the scale, large information providers may provide an HTTP server with full  
text or keyword indexing.
The WWW model gets over the frustrating incompatibilities of data format  
between suppliers and reader by allowing negotiation of format between a smart  
browser and a smart server. This should provide a basis for extension into  
multimedia, and allow those who share application standards to make full use of  
them across the web.
This summary does not describe the many exciting possibilities opened up by the  
WWW project, such as efficient document caching. the reduction of redundant  
out-of-date copies, and the use of knowledge daemons.  There is more  
information in the online project documentation, including some background on  
hypertext and many technical notes. (...)

You can also check out Linus’ first email presenting Linux.

Mozilla Web Developer’s documentation

I wasted a lot of time last night searching for JavaScript documentation. My friend Scott Flinn was nice enough to give me these pointers regarding DOM and general Web work:

This is much better than flying blind, but I wish I had something more like the Java API documentation.

BTW if you don’t know Scott Flinn, you should. He is probably the best technical resource I ever met. And I don’t mean “technical resource” in an insulting way. He simply understands hands-on technology very deeply. He is also a pessimist like myself, so we do get along, I think.

Here’s some advice from Scott:

If you just want core JavaScript stuff, then you use Rhino or
SpiderMonkey (the Mozilla implementations in Java and C++ respectively).
I swear by Rhino. You just drop js.jar into your extensions directory
and add this simple script to your path:

#!/bin/sh
java org.mozilla.javascript.tools.shell.Main

Then ‘rhino’ will give you a command line prompt that will evaluate
arbitrary JavaScript expressions. The nice part is that you have
a bridge to Java, so you can do things like:

js> sb = new java.lang.StringBuffer( ‘This’ );
This
js> sb.append( ‘ works!’ );
This works!
js> sb
This works!

What I did was to download Rhino, open the archive, and type “java -jar js.jar”. It brought up a console. System.out doesn’t work, but you can print using the “print” command. (Update:Obviously, you have to do java.lang.System.out…)

Joel on Software – Advice for Computer Science College Students

Through slashdot, I saw this nice article by Joel on what you should do if you want to become a programmer and are studying Computer Science:

Joel on Software – Advice for Computer Science College Students here are Joel’s Seven Pieces of Free Advice for Computer Science College Students (worth what you paid for them):

  • Learn how to write before graduating.
  • Learn C before graduating.
  • Learn microeconomics before graduating.
  • Don’t blow off non-CS classes just because they’re boring.
  • Take programming-intensive courses.
  • Stop worrying about all the jobs going to India.
  • No matter what you do, get a good summer internship.

Affordable TeraBytes

From Slashdot, I learned that for $3k, one can buy a 1.6TB hard drive similar to normal PC hard drives:

IO Data Device’s new ‘HDZ-UE1.6TS’ exemplifies the recent trend towards demand for higher storage capacities — it’s an external hard drive setup offering a total capacity of 1.6TB. Not much larger than four 3.5″ hard drives, the HDZ-UE1.6TS goes to show that any (rich) consumer can now easily have a boatload of storage space. (At current conversion rates, this would cost nearly $2,900.)

Maybe $3k seems like a lot but I bet that in 5 years, these beasts will cost under $1k and fit inside a normal PC.

The consequences of so much storage (nearly infinite) are still not well understood, but I believe it could bring about new killer applications we can’t even imagine right now.

Globalization and the American IT Worker

Norman Matloff wrote a solid paper called Globalization and the American IT Worker, published in the latest issue (Nov. 2004) of Communications of the ACM. Here’s a rather bleak quote:

University computer science departments must be
honest with students regarding career opportunities
in the field. The reduction in programming jobs
open to U.S. citizens and green card holders is per-
manent, not just a dip in the business cycle. Students
who want technological work must have less of a
mindset on programming and put more effort into
understanding computer systems in preparation for
jobs not easily offshored (such as system and data-
base administrators). For instance, how many gradu-
ates can give a cogent explanation of how an OS
boots up?

Wal-Mart’s Data Obsession

According to this Slashdot thread, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. That’s something like 249 if my computations are exact. That’s about 10,000 average hard drives (at 50 GB each) or 1,500 large hard drives (at 300 GB each). Suppose you want to store a matrix having 10 millions rows by 10 millions columns. Assume you store 4 bytes in each cell. That’s pretty much what Walmart has in storage. Another way to put it is that you can store 400 KB of data every second for 30 years and not exceed this storage capacity. If you had Walmart’s storage, you could pretty much make a video of your entire life. The Internet Wayback machine is reported to have about twice this storage.

One can only assume that storage will increase considerably in coming years. The impact this will have on our lives is still unknown, but the fact that gmail offers 1 GB of storage for free right now is telling: almost infinite storage is right around the corner and it will change the way we think about information technology.

Cringely on Microsoft

Cringely has nice things to say about Microsoft this time around. The gist of his argument is that Microsoft is a better company because it is lean and mean. I have no idea how lean and mean Microsoft is, but I buy it. I remember seeing a picture of the entire Windows dev. team, and it was a reasonably small team (around 15 people). I think that Microsoft is not the type of company that would waste money on layers upon layers of management. I always felt that in an IT project, if you can’t code a script or install a database, you have no business being around the table.

The IBMs, EDSes, Lockheed-Martins, Computer Science Corps, and Boeings of the IT world make their money based on head count, while Microsoft makes its money primarily from software licenses. If any of those other big companies were to win the NHS IT contract, they’d budget 30 to 40 percent of the total amount for management, which is to say for doing very little, if any, real work on the project. They would throw a couple thousand bodies at the project whether those bodies were really needed or not. Microsoft, on the other hand, will put a dozen key developers on the project with probably two managers above them. They’ll get the work done in less time and for less money because they don’t have to carry the baggage of that old business model.

Backing up your data is hard!

I lost the last 3 months of data on my Palm. People who know me can imagine my face right now. I take all my notes on my Palm. It is my primary data source/data management tool.

It is stupid, really. Every time there is some quarters in my pocket, the Palm goes crazy. I think the metal from the quarters does something to the Palm. If there is a good engineer out there, maybe he could explain what happens? I have a Palm m500, and it comes with a builtin cover. In theory, a quarter should do nothing, but in practice, it destroys the Palm: I have to do a hard reset, losing all data.

Anyhow, today, I bought coffee and dropped a quarter in my Palm pocket (I always carry my Palm). I could tell it was dead when I picked it up and the green light was permanently bright and the unit unresponsive.

The good thing about the Palm m500 is that it has excellent (several weeks) battery life. It has no color, no Wifi, no… just plain Palm software. The downside is that when you lose your data, well, you might have an old backup.

In my case, things got worse. Some months ago, I switched from coldsync, a solid Linux Desktop Palm syncing software, to KPilot, a shiny, but buggy Palm syncing software.The nice thing about KPilot is that it is integrated with KDE, so you get calendar and address book syncing for free.

However, I just learned that KPilot doesn’t have reliable backups. While I had synced two weeks ago, I was only able to retrieve my data up until September 15th. It might be the date I upgraded KDE from 3.2 to 3.3, I don’t know…

Now, I tried tweaking KPilot so that it would do better backups, I couldn’t find what could possibly have gone wrong, but I tried switching options randomly… but how do I know whether it has reliable backups now?

I have no way to know. I could switch back to coldsync, but coldsync appears to be frozen in time.

This is a general problem. For example, the Mysql tables for this blog are dumped and saved carefully on another machine. How do I know that I could retrieve all of the data and put back my blog together should something bad happen? Only way to know would be to periodically rollback my blog from my backups. Seems awkward. So I’ll just wait until something bad happens and pray.

Funny differences between Mysql and Postgresql

I hope someone can explain these funny differences between Mysql and Postgresql. (Yes, see update below.)

Here’s an easy one… What is 11/5?

select 11/5;

What should a SQL engine answer? Anyone knows? I could check as I used to be a member of the “SQL” ISO committee, but I’m too lazy and the ISO specs are too large. Mysql gives me 2.20 whereas Postgresql gives me 2 (integer division). It seems to me like Postgresql is more in line with most programming language (when dividing integers, use integer division).

It gets even weirder… how do you round 0.5? I was always taught that the answer is 1.

select round(0.5);

Mysql gives me 0 (which I feel is wrong) and Postgresql gives me 1 (which I feel is right).

On both counts, Mysql gives me an unexpected answer.

(The color scheme above for SQL statements shows I learned to program with Turbo Pascal.)

Update: Scott gave me the answer regarding Mysql rounding rule. It will alternate rounding up with rounding down, so

select round(1.5);

gives you 2 under Mysql. The idea is that rounding should not, probabilistically speaking, favor “up” over “down”. Physicists know this principle well. Joseph Scott also gave me the answer, and in fact he gave me quite a detailed answer on his blog. I think Joseph’s answer is slightly wrong. I don’t think Mysql uses the standard C librairies because the following code:

#include <cmath>
#include <iostream>
using namespace std;
int main() {
        cout  << round(0.5) << endl;
        cout  << round(1.5) <<endl;
}

outputs 1 and 2 on my machine (not what Mysql gives me).

Building the Open Warehouse

Here’s a link to slides from a talk by Roger Magoulas, (O’Reilly Media, Inc.) about building the open warehouse. The talk was presented at O’Reilly Open Source Convention 2004.

Commodity hardware, faster disks, and open source software now make building a data warehouse more of a resource and design issue than a cost issue for many organizations. Now a robust analysis infrastructure can be built on an open source platform with no performance or functional compromises.

This talk will cover a proven analysis architecture, the open source tool options for each architecture component, the basics of dimensional modeling, and a few tricks of the trade.

Why open source? Aside from the cost savings, open source lets you leverage what your staff already knows — tools like Perl, SQL and Apache — rather than having to procure and staff for the proprietary tools that dominate the commercial space.

Data Warehouse Architecture: – Consolidated Data Store (CDS)
– Process to condition, correlate and transform data
– Multi-topic data marts
– dimensional models
– Multi-channel data access

Open Source Components
Database: MySQL
– fast, effective
Data Movement: Perl/DBI/SQL
– flexible data access
Data Access: Perl/Apache/SQL
– template toolkit for ad hoc SQL
– Perl hash for crosstabs/pivot
– Perl for reports

Dimensional Model
– organizes data for queries and navigation from detail to summary
– normalized fact table for quantitative data
– denormalized dimensions with descriptive data
– conforming dimensions available to multiple facts

Performance Considerations
– configuration
– indexing
– SQL-92 joins
– aggregate tables and aggregate navigation

The presentation should provide you with the basic architecture, toolkit, design principles, and strategy for building an effective open source data warehouse.

What the Bubble Got Right

A beautiful article by Paul Graham: What the Bubble Got Right. It is a good analysis of the dot-com era. I totally agree with the analysis too! People tend to overestimate the impact of technology over the short term, but underestimate it over the longer term. The dot-com bubble was proof of that. It is not so much that the new economy was a sham… but rather that the new economy will take a bit more than 2 years to settle… Here’s the conclusion of this beautiful article:

When one looks over these trends, is there any overall theme? There does seem to be: that in the coming century, good ideas will count for more. That 26 year olds with good ideas will increasingly have an edge over 50 year olds with powerful connections. That doing good work will matter more than dressing up– or advertising, which is the same thing for companies. That people will be rewarded a bit more in proportion to the value of what they create.

If so, this is good news indeed. Good ideas always tend to win eventually. The problem is, it can take a very long time. It took decades for relativity to be accepted, and the greater part of a century to establish that central planning didn’t work. So even a small increase in the rate at which good ideas win would be a momentous change– big enough, probably, to justify a name like the “new economy.”

As a side-note, the mere fact that such a good article is waiting at the end of a URL, for all to see and absolutely free, should remind you of how powerful, after all, the Web really is. I was raised in an era where you needed to go buy a magazine to read such a great article. Then you’d get many bad articles, but what could you do: there were few magazines, and your choices were limited. Things have changed, they have changed tremendously.

Data centers as a utility?

Seems like Gartner predicts data centers are going to become a utility:

The office environment will dramatically change in 50 years’ time, with desktop computers disappearing, robots handling more manual tasks, and global connectivity enabling more intercontinental collaboration. Data centers located outside the city will run powerful database and processing applications, serving up computing power as a utility; many more people will work remotely, using handheld devices to stay connected wherever they go, although those devices will be much more sophisticated and easier to use than current handhelds.

If you haven’t switched to Firefly, do it now.

I’ve finally moved all my machines to Mozilla Firefox 1.0. It is, by far, the best browser I ever used, and it is totally, truely free. Unfortunately, the French version is lagging behind a bit. Unless you are running something else than Windows, Linux, or MacOS, you have no excuse to use another browser. None.

Update: Sean asks why I switched away from Konqueror. The main reasons are XML support and Gmail. Gmail doesn’t support konqueror for some reason, and I badly need a browser having decent support for XSLT. Also, there is a comment below saying that Firefox is not stable on OS X 10.2.

On tools for academic writting and a shameless plug

First, the shameless plug: my long-time friend, Jean-François Racine published a book available both as hardcover and paperback. The title is “The Text of Matthew in the Writings of Basil of Caesarea”.

More seriously, and maybe he had told me about this, but he told me about this specialized word processor he uses, called Nota Bene. More interesting is a component of this word processor called Orbis. Specifically, Orbis generates vocabulary lists, as well as frequency of occurrence; and it allows you to define synonym lists to expand search capabilities.

An Amazon Web Services (AWS) 4.0 application in just a few lines

I have somewhat of a debate with my friend Yuhong about the correct way to use a Web Service. Yuhong seems to prefer SOAP. I much prefer REST. What is a REST Web Service? For the most part, a REST Web Service is really, really simple. You simply follow a URL and magically, you get back an XML file which is the answer to your query. One benefit of REST is that you can easily debug it and write quick scripts for it. What is a SOAP Web Service? I don’t know. I really don’t get it. It seems to be a very complicated way to do the same thing: instead of sending the query as a URL, you send the query as a XML file. The bad thing about it is that if it breaks, then you have no immediate way to debug it: you can’t issue a SOAP request using your browser (or maybe you can, but I just don’t know how).

Now, things never break, do they? Well, that is the problem, they break often because either I’m being stupid or I don’t know what I’m doing or the people on the other side don’t know what they are doing or the people on the other side are experimenting a bit or whatever else. I find that being able to quickly debug my code is the primary feature any IT technology should have. The last thing I want from a technology is for it to be hard to debug.

Here is the problem that I solved this week-end. I have this list of artists and I want to get a list of all corresponding music albums so I can put it all into a relational (SQL) database. Assuming that your list of artists are in a file called artists_big.txt and that you want the result to be in a file called amazonresults.sql, the following does a nice job thanks to the magic of Amazon Web Services:

Yes: the code goes over because I cannot allow HTML to wrap lines (Python allows wrapping lines, but not arbitrarily so, white space is significant in Python). There is just no way around it that I know: suggestions with sample HTML code are invited.

import libxml2, urllib2, urllib, sys, re, traceback
ID=""# please enter your own ID here
uri="http://webservices.amazon.com/AWSECommerceService/2004-10-19"
url="http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&SubscriptionId=%s&Operation=ItemSearch&SearchIndex=Music&Artist=%s&ItemPage=%i&ResponseGroup=Request,ItemIds,SalesRank,ItemAttributes,Reviews"
outputcontent = ["ASIN","Artist","Title","Amount","NumberOfTracks","SalesRank","AverageRating","ReleaseDate"]
input = open("artists_big.txt")
output = open("amazonresults.sql", "w")
log = open("amazonlog.txt", "w")
output.write("DROP TABLE music;nCREATE TABLE music (ASIN TEXT, Artist TEXT, Title TEXT, Amount INT, NumberOfTracks INT, SalesRank INT, AverageRating NUMERIC, ReleaseDate DATE);n")
def getNodeContentByName(node, name):
for i in node:
if (i.name==name): return i.content
return None
for artist in input:#go through all artists
print "Recovering albums for artist : ", artist
page = 1
while(True):# recover all pages
resturl = url %(ID,urllib.quote(artist),page)
log.write("Issuing REST request: "+resturl+"n")
try :
data = urllib2.urlopen(resturl).read()
except urllib2.HTTPError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not retrieve :n"+resturl+"n")
continue
try :
doc = libxml2.parseDoc(data)
except libxml2.parserError,e:
log.write("n")
log.write(str(traceback.format_exception(*sys.exc_info())))
log.write("n")
log.write("could not parse (is valid XML?):n"+data+"n")
continue
ctxt=doc.xpathNewContext()
ctxt.xpathRegisterNs("aws",uri)
isvalid = (ctxt.xpathEval("//aws:Items/aws:Request/aws:IsValid")[0].content == "True")
if not isvalid :
log.write("The query %s failed " % (resturl))
errors = ctxt.xpathEval("//aws:Error/aws:Message")
for message in errors: log.write(message.content+"n")
continue
for itemnode in ctxt.xpathEval("//aws:Items/aws:Item"):
attr = {}
for nodename in outputcontent:
content = getNodeContentByName(itemnode,nodename)
if(content <> None):
content = re.sub("'","'",content)
if(nodename == "SalesRank"):
content = re.sub(",","",content)
attr[nodename] = content
columns = "("
keys = attr.keys()
for i in range(len(keys)-1):
columns += keys[i]+","
columns+=keys[len(keys)-1]+")"
row = "("
values = attr.values()
for i in range(len(values)-1):
row+="'"+str(values[i])+"',"
row+="'"+str(values[len(values)-1])+"')"
command = "INSERT INTO music "+columns+" VALUES "+row+";n"
output.write(command)
NumberOfPages = int(ctxt.xpathEval("//aws:Items/aws:TotalPages")[0].content)
if(page >= NumberOfPages): break
page += 1
input.close()
output.close()
log.close()
print "You should now be able to run the file in postgresql. Start the postgres client doing psql, and using i amazonresults.sql in the postgresql shell."

Update : this was updated to take into account these comments from Amazon following the upgrade of AWS 4.0 from beta to release:

Service Name change: You will need to modify the Service Name parameter in your application from AWSProductData to AWSECommerceService. We realize that it may take some time to implement this change in your applications. In order to make this transition as easy as possible, we will continue supporting AWSProductData for a short time.

2) REST/WSDL Endpoints: You will need to modify your application to connect to webservices.amazon.com instead of aws-beta.amazon.com. For other locales, the new endpoints are webservices.amazon.co.uk, webservices.amazon.de and webservices.amazon.co.jp.

How Google is just plain better

What is one of the most visited page on all of my sites? It is “Jolie, petite coquine”, the Web page of our cat. The page was originally designed by my wife back when she had a Web site of her own. The page ranks high on Voila for some sex bound keyword searches and people arrive at my cat’s page because they are horny. Now, that’s at least 10 hits per day. How come Google doesn’t fall short the same way? I don’t know, but somehow, Google knows my cat isn’t a sex object…

Or is she? Well, check it out for yourself!

Note: in doing research for this post, I found out that dmoz.org actually indexes various sex sites in a taxonomy. But my cat is nowhere to be found.

Most amazing Cringely article ever…

Cringely published an amazing paper on crime in the USA. Turns out that in 1982, a study was paid-for by the American Department of Justice. Three people were involved: Michael Block, Fred Nold, and Sandy Lerner. Cringely believes their study showed that the current sentencing guidelines would lead to a poor, more crime-ridden USA (and it did). The study was “hidden away”. Turns out that killed himself in 1983. Block became a law professor and won’t comment to Cringely about the study. Sandy Lerner went on to found Cisco.

A few things are amazing. The suicide of a researcher who possibly felt like a loser. It reminds me of Wallace Carothers who invented Nylon. It is unclear to me how you can feel like a loser after inventing Nylon, but apparently someone did. The second one is that the USA knows and knew that they were headed for a crime-ridden society and they went ahead anyhow. Why? I can’t figure it out. Lastly, there is the little detail that the statistician part of the study, Sandy Lerner, founded Cisco. This is an interesting contrast with the other fellow who killed himself.

A megabyte is a mebibyte, and a kilobyte is a kibibyte

If you’ve been annoyed about the fact that a kilobyte has 1024 bytes and not 1000 bytes, well, you were right all along! What people call a kilobyte is really a kibibyte. (Thanks to Owen for pointing it out to me!)

Examples and comparisons with SI prefixes
one kibibit  1 Kibit = 210 bit = 1024 bit
one kilobit  1 kbit = 103 bit = 1000 bit
one mebibyte  1 MiB = 220 B = 1 048 576 B
one megabyte  1 MB = 106 B = 1 000 000 B
one gibibyte  1 GiB = 230 B = 1 073 741 824B
one gigabyte  1 GB = 109 B = 1 000 000 000 B

Source: Definitions of the SI units: The binary prefixes