Yuhong Yan’s Tips to Graduate Students

I really like Yuhong Yan. She’s one of my favorite collaborator of all times. It is quite strange too because we were colleagues for a long time and never collaborated much at all. Then, I left my NRC job, I went to live something like 700 km away and since then, we’ve never been closer. Maybe this says something about how efficient technology has become.

In any case, if you are a graduate student or are thinking about becoming one, you should read Yuhong Yan’s Tips to Graduate Students. The mere fact that she put this page together is enough to make me like her! I find that very few schools care enough about their student to put together similar advice. It seems to be enough for many professors to just throw students into research and see who swims and who sinks. My advice to graduate students would be to seek supervisors who will give you such advice. I think Yuhong is probably a good supervisor.

She also posted a copy of an unpublished paper we wrote together. Myself, I tend to keep unpublished papers private, but at the same time, I keep arguing that researchers should promote their papers more agressively, so I’m not going to complain about what Yuhong did.

Object Oriented Learning Objects

Those of you who have broadband and are interested in having more details about the SCTIC-CREPUQ Meeting I described earlier, you can get Stephen Downes’ slides and audio stream on Object Oriented Learning Objects:

Slides and the MP3 audio (English and French, 7 megabytes) of my presentation in Montreal are now available (the audio also includes the
presentations from other panelists, used with permission). In it, I present again the idea of “e-learning as dynamic, unstructured stream of learning resources obtained and organized by learners.” In this talk I extend the idea bit by elaborating on the community aspect of learning resources and outlining how the learning objects should be designed in order to facilitate this. More – much more – on this in the future.

By Stephen Downes, Stephen’s Web,
November 26, 2004

EC-W e b 2005 (February 19, 2005 / August 23, 2005)

6th International Conference on
Electronic Commerce and Web Technologies

E C – W e b 2005

August 23 – August 26, 2005
Copenhagen, Denmark


A large number of organizations are exploiting the opportunities offered by Internet-based technologies for electronic commerce and electronic business. Companies sell and purchase via the Internet, search engines and directories allow electronic market participants around the globe to locate potential trading partners, and a set of protocols and standards has been established to exchange goods and services via the Internet. The Internet is changing the way how companies and organizations are working, and the amount of innovation and change seems to accelerate. However, numerous technical issues need still to be resolved. The main objective of this conference is to bring together researchers and practitioners from different disciplines, all
interested in electronic commerce and Web technologies and to assess current methodologies and new research directions. Although a natural focus will be on computer science issues, we welcome research contributions from economics, business administration, law, and other disciplines. EC-Web 2005 is organized by the DEXA Association in parallel with DEXA 2005 (16th International Conference on Database and Expert Systems Applications).

Suggested Topics
The major topics of interest include but are not limited to:

* Auction and Negotiation Technology
* Agent-Mediated Electronic Commerce
* Business Process Integration
* Business Process Modeling
* Customer Relationship Management
* Decision Support and Optimization in EC
* Digital Goods and Products
* Electronic Data Interchange (EDI)
* Enterprise Application Integration
* Electronic Contracting
* Formation of Supply Chains, Coalitions, and Virtual Enterprises
* Grid Computing for EC
* Intellectual Property Licensing
* Interorganizational Systems
* IPR, Legal and Privacy Issues
* Knowledge Discovery in Web-based IS and EC
* Languages and Ontologies for Describing Goods, Services, and Contracts
* Mobile Commerce
* P2P-Computing
* Pricing and Metering of On-Demand Services
* Quality of Service (Performance, Security, Reliabilty, etc.)
* Recommender Systems
* Rule Languages and Rule-based Systems
* Security and Trust in EC
* Semantic Web
* Supply Chain Management and Supplier Relationship Management
* Ubiquitous and Pervasive Technologies for EC
* Usability Issues for EC
* User Behavior, Web Usage Mining
* Web Data Quality Aspects
* Web Data Visualization
* Web Services Computing
* Web Site Monitoring
* XML-based Standards
* Applications and Case Studies in EC

Entering the Mainstream: The Quality and Extent of Online Education in the United States, 2003 and 2004

Here is a very important report:Entering the Mainstream: The Quality and Extent of Online Education in the United States, 2003 and 2004

This study takes a look at online learning in American Universities. Here’s a few facts the study brings to bare.

Will online enrollments continue their rapid growth?

  • Over 1.9 million students were studying online in the fall of 2003.
  • Schools expect the number of online students to grow to over 2.6 million by the fall of 2004.
  • Schools expect online enrollment growth to accelerate — the expected average growth rate for online students for 2004 is 24.8%, up from 19.8% in 2003.
  • Overall, schools were pretty accurate in predicting enrollment growth — last year’s predicted online enrollment for 2003 was 1,920,734; this year’s number from the survey is 1,971,397.

Are students as satisfied with online courses as they are with face-to-face instruction?

  • 40.7% of schools offering online courses agree that “students are at least as satisfied” with their online courses, 56.2% are neutral and only 3.1% disagree.

What about the quality of online offerings, do schools continue to believe that it measures up?

  • A majority of academic leaders believe that online learning quality is already equal to or superior to face-to-face instruction.
  • Three quarters of academic leaders at public colleges and universities believe that online learning quality is equal to or superior to face-to-face instruction.
  • Three quarters of all academic leaders believe that online learning quality will be equal to or superior to face-to-face instruction in three years.

In light of these facts, recall my earlier prediction:

I predict that in 5 years, students all over the world will learn Calculus with little input from from instructors (but a lot of input from other students!). They will use sophisticated on-line laboratories and on-line testing, and on-line support. The technology is already here, but we still don’t know how to use it properly.

It looks like it might happen even faster than 5 years! But my prediction is bold enough as it is, so I’ll keep it in its current form.

Affordable TeraBytes

From Slashdot, I learned that for $3k, one can buy a 1.6TB hard drive similar to normal PC hard drives:

IO Data Device’s new ‘HDZ-UE1.6TS’ exemplifies the recent trend towards demand for higher storage capacities — it’s an external hard drive setup offering a total capacity of 1.6TB. Not much larger than four 3.5″ hard drives, the HDZ-UE1.6TS goes to show that any (rich) consumer can now easily have a boatload of storage space. (At current conversion rates, this would cost nearly $2,900.)

Maybe $3k seems like a lot but I bet that in 5 years, these beasts will cost under $1k and fit inside a normal PC.

The consequences of so much storage (nearly infinite) are still not well understood, but I believe it could bring about new killer applications we can’t even imagine right now.

Edd Dumbill on Web 2.0

Edd Dumbill has a cool post on what the future of the Web is.

What’s hot:

  • Intellectual property and privacy law. If exchange and manipulation of data is key to the future web, then we need to understand that and be the ones in control. If corporations have too much control of data, as they are striving for, that’s the equivalent of API lock-in, and we’ll all suffer. But on the other hand, we want to tightly control the data about ourselves. An interesting conflict, about which I’d like to write more in future.
  • Transformation and annotation. Nobody’s going to own a unique hold on the form of expression of the data flying around in “web 2.0”, but they’re certainly going to want to transform between those forms. From the crude “emergent keywords” of del.icio.us to the intensive but scope-limited integration done by Google and A9, there’s going to be a lot of value in joining together previously isolated data islands.
  • Network engineering. Dumb, happy protocols that give quick results are on the rise. Look at the RSS madness, servers being pummelled. And RSS isn’t even mainstream yet, though it’s about to get that way. It gets messier before it gets better.

What’s not:

  • Complicated web service standards. Forget the WS-I lunacy. Web applications for computers were happening before the web services standards junk. Amazon would still be providing their interfaces with or without SOAP, WSDL and UDDI, and indeed all the evidence is that their users prefer to use the simpler HTTP/XML APIs anyway. As far as the web is concerned, the WS-* work is about sprinkling XML pixie dust on a failing idea.
  • Frameworks and silos. Don’t believe anyone who claims to have a wonderful new framework that’ll solve your problems if only you’d migrate everything you do to it. The web is all about separate pieces, loosely joined. The really clever businesses know how to manage uncertainty, they’re not looking to eliminate it. Circling the wagons will not integrate you into the web, neither will it promote web-like innovation inside a business.

Computing argmax fast in Python

Update: see Fast argmax in Python for the final word.

Python doesn’t come with an argmax function built-in. That is, a function that tells you where a maximum is in an array. And I keep needing to write my own argmax function so I’m getting better and better at it. Here’s the best I could come up with so far:

from itertools import izip
argmax = lambda array: max(izip(array, xrange(len(array))))[1]

The really nice thing about izip and xrange is that they don’t actually output arrays, but only lazy iterators. You also have plenty of similar functions such as imap or ifilter. Very neat.

Here’s a challenge to you: can you do better? That is, can you write a faster, less memory hungry function? If not, did I find the optimal solution? If so, do I get some kind of prize?

Next, suppose you want to find argmax, but excluding some set of bad indexes, you can do it this way…

from itertools import izip
argmaxbad = lambda array,badindexes: max(ifilter(lambda t: t[1] not in badindexes,izip(array, xrange(len(array)))))[1]

Python is amazingly easy.

As a side-note, you can also do intersections and union of sets in Python in a similar (functional) spirit:
def intersection(set1,set2): return filter(lambda s:s in set2,set1)
def union(set1, set2): return set1 + filter(lambda s:s not in set1, set2)

Update: you can do the same things with hash tables:
max(izip(hashtable[max].itervalues(), hashtable[max].iterkeys()))[1]

Aaron Straup Cope’s NYTimes Widgets

One of the most interesting talk we had at SWIG’04 was “Design Issues and Technical Challenges Making the Eatdrinkfeelgood Markup Language RDF” where Aaron showed why it was hard to use RDF in a XML project. I think it all boils down to the fact that we have no good widespread way of serializing RDF to XML. In any case, Aaron finally sent me a link to his NYTimes Widgets.

It lacks sufficient documentation for me to grok it quickly, but from what I understand, Aaron tried to create a useful and innovative RDF application. Here’s what he says about his widgets:

The New York Times includes a large amount of topical metadata with each article it publishes. These are widgets that, having harvested the data, try to do something interesting with it.

Good software engineering according to Paul Graham

Paul Graham describes what good software developers do:

In software, paradoxical as it sounds, good craftsmanship means working fast. If you work slowly and meticulously, you merely end up with a very fine implementation of your initial, mistaken idea. Working slowly and meticulously is premature optimization. Better to get a prototype done fast, and see what new ideas it gives you.

Globalization and the American IT Worker

Norman Matloff wrote a solid paper called Globalization and the American IT Worker, published in the latest issue (Nov. 2004) of Communications of the ACM. Here’s a rather bleak quote:

University computer science departments must be
honest with students regarding career opportunities
in the field. The reduction in programming jobs
open to U.S. citizens and green card holders is per-
manent, not just a dip in the business cycle. Students
who want technological work must have less of a
mindset on programming and put more effort into
understanding computer systems in preparation for
jobs not easily offshored (such as system and data-
base administrators). For instance, how many gradu-
ates can give a cogent explanation of how an OS
boots up?

RSS is the Semantic Web

Here’s what Stephen Downes has to say about the Semantic Web:

RSS is the semantic web. It is not the official semantic web as I said, it is not sanctioned by any standards body or organization whatsoever. But RSS is what has emerged as the de facto description of online content, used by more than four million sites already worldwide, used to describe not only resources, but people, places, objects, calendar entries, and in my way of thinking, learning resources and learning objects.

What makes RSS work is that it approaches search a lot more like Google and a lot less like the Federated search described above. Metadata moves freely about the internet, is aggregated not by one but by many sources, is recombined, and fed forward. RSS is now used to describe the content of blogs, and when aggregated, is the combining of blog posts into new and novel forms. Sites like Technorati and Bloglines, Popdex and Blog Digger are just exploring this potential. RSS is the new syntax, and the people using it have found a voice.

TOOL: The Open Opinion Layer

Here’s an interesting paper by Hassan Masum, TOOL: The Open Opinion Layer. Here’s the abstract:

Shared opinions drive society: what we read, how we vote, and where we shop are all heavily influenced by the choices of others. However, the cost in time and money to systematically share opinions remains high, while the actual performance history of opinion generators is often not tracked.

This article explores the development of a distributed open opinion layer, which is given the generic name of TOOL. Similar to the evolution of network protocols as an underlying layer for many computational tasks, we suggest that TOOL has the potential to become a common substrate upon which many scientific, commercial, and social activities will be based.

Valuation decisions are ubiquitous in human interaction and thought itself. Incorporating information valuation into a computational layer will be as significant a step forward as our current communication and information retrieval layers.

Follow-up on “Sébastien Paquet on blogs and wikis”

In one my previous post commenting on the fact that technology had changed dramatically learning, I predicted that in 5 years, it would be an accepted fact that some university courses are better taught using mostly technology and very little live input from an instructor…

I had one reply from an anonymous Scott (but I know who you are!) which is worth citing in full here:

If you equate learning with time spent in school, then I tend to agree (…) there is a lot of value in the traditional methods, and we would be foolish to replace them with untested modern contrivances (bordering here on the “computers in schools” debate).

But if you view learning as a continuous experience that is not confined to attendance at institutionalized schools, then I wholeheartedly agree (…). I left the research world for five years (1997-2002), and was astounded when I returned at how the process of dissemintation and discovery has been completely transformed by the Internet. Academic discourse these days is utterly dependent on electricity.

And I see elements of the same transformation in schools at every level. I run a couple of historical Web sites, and I receive an endless stream of questions from students doing projects. But I think the more important observation is that students (of all ages) are applying the information discovery skills they learn in school (and on their own) to other activities. Here’s an example. Our 15 year old TV and two 15 year old VCRs all decided to die in October. So we’re now faced with the daunting task of choosing new technology. Buying a TV used to be easy: you chose your size, identified some trusted brands, then picked a cabinet to match your decor. These days, you have to choose between CRT vs LCD vs projection vs plasma, 19:6 vs 4:3, HD capable or not, progressive scan vs interlace, presence of RF/composite/s-video/component connectors, etc. And that’s just the TV. What about a replacement VCR? Really, you want a DVD recorder that plays about a dozen disk formats, and you also have to think about future requirements for networked content delivery throughout the house, and compatibility between all the components. The combinatorial explosion is overwhelming. I know that my father could no longer make a choice of television that was better than a random guess. So I asked the sales guy how much time they have to devote to educating customers these days regarding all the options, expecting to hear that people generally feel as overwhelmed as I do. But the answer was quite the opposite. The guy said that most customers (of all ages – I specifically asked about age) come to the store with a comprehensive understanding of the options. Not only do they understand the choices (often following guides from such places as Consumer Reports), but they come armed with reviews from epinions.com, advice from discussion forums, (wikis and blogs?), etc. The sales guy said that as often as not they learn something from the customer.

If that isn’t a fundamental (and welcome) change in how people learn, I don’t know what is. It suggests to me a process of continuous and pervasive learning that I rarely saw emerge from traditional schools. Yet that’s the culture that today’s children are experiencing. I don’t know if calculus teachers will be obsolete in five years, but neither do I see the pace of change slowing any time soon. If anything, I expect it to accelerate as technologies for continuous social communication and global network access (cell/PDA/SMS/IM/etc) go mainstream.

Wal-Mart’s Data Obsession

According to this Slashdot thread, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. That’s something like 249 if my computations are exact. That’s about 10,000 average hard drives (at 50 GB each) or 1,500 large hard drives (at 300 GB each). Suppose you want to store a matrix having 10 millions rows by 10 millions columns. Assume you store 4 bytes in each cell. That’s pretty much what Walmart has in storage. Another way to put it is that you can store 400 KB of data every second for 30 years and not exceed this storage capacity. If you had Walmart’s storage, you could pretty much make a video of your entire life. The Internet Wayback machine is reported to have about twice this storage.

One can only assume that storage will increase considerably in coming years. The impact this will have on our lives is still unknown, but the fact that gmail offers 1 GB of storage for free right now is telling: almost infinite storage is right around the corner and it will change the way we think about information technology.

Sébastien Paquet on blogs and wikis

As pointed out by Nicolas, Sébastien Paquet was giving a talk on Friday. He talked about blogs and wikis for collaborative learning. As an interesting sidenote, the famous Gilles Brassard attended Sébastien’s talk since Gilles was Sébastien’s thesis supervisor. As reported by Nicolas, I think a number of professors do not like the bottom-up nature of these tools.

I think it is quite natural to get these reactions. Blogs and wikis are part of a larger trend where central control is being neutralized. I think the Web in general will eventually have a very deep effect on society: my son will live in world where individuals are in charge much more than companies or governments. Universities will be hit hard too: I see students taking progressively charge of their learning. Currently, we decide what my son eats, but he started to protest when he doesn’t like something, and soon, he will alone decide what he eats. I think technology will progressively put students more and more in charge. I don’t think this will diminish the need for university professors, but their role will change from spoon-feeding students to providing guidance.

I predict that in 5 years, students all over the world will learn Calculus with little input from from instructors (but a lot of input from other students!). They will use sophisticated on-line laboratories and on-line testing, and on-line support. The technology is already here, but we still don’t know how to use it properly.

Update: Jeff Erickson seems to predict that in 5 years, I’ll still be predicting that in 5 years learning will change. Well, he is right. Learning is changing all the time. The role of university professors has changed quite a bit with technology, whether we care to admit it or not. What I’m saying in this post is that the Web has reached sufficient maturity now that it will soon be able to do away (mostly) with instructors in some of the more spoon-feeding courses.

Cringely on Microsoft

Cringely has nice things to say about Microsoft this time around. The gist of his argument is that Microsoft is a better company because it is lean and mean. I have no idea how lean and mean Microsoft is, but I buy it. I remember seeing a picture of the entire Windows dev. team, and it was a reasonably small team (around 15 people). I think that Microsoft is not the type of company that would waste money on layers upon layers of management. I always felt that in an IT project, if you can’t code a script or install a database, you have no business being around the table.

The IBMs, EDSes, Lockheed-Martins, Computer Science Corps, and Boeings of the IT world make their money based on head count, while Microsoft makes its money primarily from software licenses. If any of those other big companies were to win the NHS IT contract, they’d budget 30 to 40 percent of the total amount for management, which is to say for doing very little, if any, real work on the project. They would throw a couple thousand bodies at the project whether those bodies were really needed or not. Microsoft, on the other hand, will put a dozen key developers on the project with probably two managers above them. They’ll get the work done in less time and for less money because they don’t have to carry the baggage of that old business model.

Backing up your data is hard!

I lost the last 3 months of data on my Palm. People who know me can imagine my face right now. I take all my notes on my Palm. It is my primary data source/data management tool.

It is stupid, really. Every time there is some quarters in my pocket, the Palm goes crazy. I think the metal from the quarters does something to the Palm. If there is a good engineer out there, maybe he could explain what happens? I have a Palm m500, and it comes with a builtin cover. In theory, a quarter should do nothing, but in practice, it destroys the Palm: I have to do a hard reset, losing all data.

Anyhow, today, I bought coffee and dropped a quarter in my Palm pocket (I always carry my Palm). I could tell it was dead when I picked it up and the green light was permanently bright and the unit unresponsive.

The good thing about the Palm m500 is that it has excellent (several weeks) battery life. It has no color, no Wifi, no… just plain Palm software. The downside is that when you lose your data, well, you might have an old backup.

In my case, things got worse. Some months ago, I switched from coldsync, a solid Linux Desktop Palm syncing software, to KPilot, a shiny, but buggy Palm syncing software.The nice thing about KPilot is that it is integrated with KDE, so you get calendar and address book syncing for free.

However, I just learned that KPilot doesn’t have reliable backups. While I had synced two weeks ago, I was only able to retrieve my data up until September 15th. It might be the date I upgraded KDE from 3.2 to 3.3, I don’t know…

Now, I tried tweaking KPilot so that it would do better backups, I couldn’t find what could possibly have gone wrong, but I tried switching options randomly… but how do I know whether it has reliable backups now?

I have no way to know. I could switch back to coldsync, but coldsync appears to be frozen in time.

This is a general problem. For example, the Mysql tables for this blog are dumped and saved carefully on another machine. How do I know that I could retrieve all of the data and put back my blog together should something bad happen? Only way to know would be to periodically rollback my blog from my backups. Seems awkward. So I’ll just wait until something bad happens and pray.

Funny differences between Mysql and Postgresql

I hope someone can explain these funny differences between Mysql and Postgresql. (Yes, see update below.)

Here’s an easy one… What is 11/5?

select 11/5;

What should a SQL engine answer? Anyone knows? I could check as I used to be a member of the “SQL” ISO committee, but I’m too lazy and the ISO specs are too large. Mysql gives me 2.20 whereas Postgresql gives me 2 (integer division). It seems to me like Postgresql is more in line with most programming language (when dividing integers, use integer division).

It gets even weirder… how do you round 0.5? I was always taught that the answer is 1.

select round(0.5);

Mysql gives me 0 (which I feel is wrong) and Postgresql gives me 1 (which I feel is right).

On both counts, Mysql gives me an unexpected answer.

(The color scheme above for SQL statements shows I learned to program with Turbo Pascal.)

Update: Scott gave me the answer regarding Mysql rounding rule. It will alternate rounding up with rounding down, so

select round(1.5);

gives you 2 under Mysql. The idea is that rounding should not, probabilistically speaking, favor “up” over “down”. Physicists know this principle well. Joseph Scott also gave me the answer, and in fact he gave me quite a detailed answer on his blog. I think Joseph’s answer is slightly wrong. I don’t think Mysql uses the standard C librairies because the following code:

#include <cmath>
#include <iostream>
using namespace std;
int main() {
        cout  << round(0.5) << endl;
        cout  << round(1.5) <<endl;

outputs 1 and 2 on my machine (not what Mysql gives me).

The Public Referee Reports Debate

I think the wave was started by Seb this time as he hints we should consider publishing reviews when an academic paper is submitted.

A reply comes from Lance who says we should kill this idea quick. I think his counterargument is badly flawed. For example, he describes the review process as iterative:

The referees read the paper and write a report on whether to recommend the paper for publication and give suggested changes. They send this report to the editor who relays this information to the authors without revealing the identities of the referees. The authors will comment on the referee reports and revise the paper. The paper often goes back to the referees and the process repeats until the editor is happy with the paper or decides the paper is not worthy of the journal.

Here’s what I replied:

The back in forth you refer to is not present in 99% of conferences I know. You submit paper, random reviewers write a review, you get the result, end of story. Because the selection process is based on a percentage of acceptance, even a good review is not sufficient: you need to have the random reviewers be delighted at your work. Now, I claim that more than half the reviewers don’t even read the papers. So, what do you think happens? Papers who are sufficiently fashionable get through, others get canned. The only way I can imagine to get this process fixed is to publish reviews.

Then he actually explains why he thinks public reviews are a bad idea:

The process of refereeing requires considerable back and forth conversation between the three parties: the authors, the editor and the referees. Posting the original reports will give a misleading view of the process and will cause referees to act far too cautiously about pointing out problems in a paper.

I’m sorry but how is having your name as the referee of a paper going to entice you to be lenient? How many people wants to go down in history as having accepted a paper that’s flawed?

Quite the opposite from Lance, I believe that reviewers, in a public review system, have a strong incentive at being extremely hard. Nobody will care about the reviews a rejected paper got… nobody but the author… or if the review is really terrible… But people will read the reviews an accepted bad paper got and if they see that a certain individual is letting bad papers pass, his reputation will go down.

I argue quite the opposite: a public review system would make it very hard to get bad papers through.

As for the possible argument that it would make it harder to find reviewers. This argument (which Lance doesn’t push) is probably valid. First of all, you’ll have to work much harder to find capable reviewers willing to invest the time needed to produce thoughtful reviews. However, these people would get rewarded for their work since their review becomes public and hence, contribute to their status in the community.