Computing argmax fast in Python

Update: see Fast argmax in Python for the final word.

Python doesn’t come with an argmax function built-in. That is, a function that tells you where a maximum is in an array. And I keep needing to write my own argmax function so I’m getting better and better at it. Here’s the best I could come up with so far:

from itertools import izip
argmax = lambda array: max(izip(array, xrange(len(array))))[1]

The really nice thing about izip and xrange is that they don’t actually output arrays, but only lazy iterators. You also have plenty of similar functions such as imap or ifilter. Very neat.

Here’s a challenge to you: can you do better? That is, can you write a faster, less memory hungry function? If not, did I find the optimal solution? If so, do I get some kind of prize?

Next, suppose you want to find argmax, but excluding some set of bad indexes, you can do it this way…

from itertools import izip
argmaxbad = lambda array,badindexes: max(ifilter(lambda t: t[1] not in badindexes,izip(array, xrange(len(array)))))[1]

Python is amazingly easy.

As a side-note, you can also do intersections and union of sets in Python in a similar (functional) spirit:
def intersection(set1,set2): return filter(lambda s:s in set2,set1)
def union(set1, set2): return set1 + filter(lambda s:s not in set1, set2)

Update: you can do the same things with hash tables:
max(izip(hashtable[max].itervalues(), hashtable[max].iterkeys()))[1]

Aaron Straup Cope’s NYTimes Widgets

One of the most interesting talk we had at SWIG’04 was “Design Issues and Technical Challenges Making the Eatdrinkfeelgood Markup Language RDF” where Aaron showed why it was hard to use RDF in a XML project. I think it all boils down to the fact that we have no good widespread way of serializing RDF to XML. In any case, Aaron finally sent me a link to his NYTimes Widgets.

It lacks sufficient documentation for me to grok it quickly, but from what I understand, Aaron tried to create a useful and innovative RDF application. Here’s what he says about his widgets:

The New York Times includes a large amount of topical metadata with each article it publishes. These are widgets that, having harvested the data, try to do something interesting with it.

Good software engineering according to Paul Graham

Paul Graham describes what good software developers do:

In software, paradoxical as it sounds, good craftsmanship means working fast. If you work slowly and meticulously, you merely end up with a very fine implementation of your initial, mistaken idea. Working slowly and meticulously is premature optimization. Better to get a prototype done fast, and see what new ideas it gives you.

Globalization and the American IT Worker

Norman Matloff wrote a solid paper called Globalization and the American IT Worker, published in the latest issue (Nov. 2004) of Communications of the ACM. Here’s a rather bleak quote:

University computer science departments must be
honest with students regarding career opportunities
in the field. The reduction in programming jobs
open to U.S. citizens and green card holders is per-
manent, not just a dip in the business cycle. Students
who want technological work must have less of a
mindset on programming and put more effort into
understanding computer systems in preparation for
jobs not easily offshored (such as system and data-
base administrators). For instance, how many gradu-
ates can give a cogent explanation of how an OS
boots up?

RSS is the Semantic Web

Here’s what Stephen Downes has to say about the Semantic Web:

RSS is the semantic web. It is not the official semantic web as I said, it is not sanctioned by any standards body or organization whatsoever. But RSS is what has emerged as the de facto description of online content, used by more than four million sites already worldwide, used to describe not only resources, but people, places, objects, calendar entries, and in my way of thinking, learning resources and learning objects.

What makes RSS work is that it approaches search a lot more like Google and a lot less like the Federated search described above. Metadata moves freely about the internet, is aggregated not by one but by many sources, is recombined, and fed forward. RSS is now used to describe the content of blogs, and when aggregated, is the combining of blog posts into new and novel forms. Sites like Technorati and Bloglines, Popdex and Blog Digger are just exploring this potential. RSS is the new syntax, and the people using it have found a voice.

TOOL: The Open Opinion Layer

Here’s an interesting paper by Hassan Masum, TOOL: The Open Opinion Layer. Here’s the abstract:

Shared opinions drive society: what we read, how we vote, and where we shop are all heavily influenced by the choices of others. However, the cost in time and money to systematically share opinions remains high, while the actual performance history of opinion generators is often not tracked.

This article explores the development of a distributed open opinion layer, which is given the generic name of TOOL. Similar to the evolution of network protocols as an underlying layer for many computational tasks, we suggest that TOOL has the potential to become a common substrate upon which many scientific, commercial, and social activities will be based.

Valuation decisions are ubiquitous in human interaction and thought itself. Incorporating information valuation into a computational layer will be as significant a step forward as our current communication and information retrieval layers.

Follow-up on “Sébastien Paquet on blogs and wikis”

In one my previous post commenting on the fact that technology had changed dramatically learning, I predicted that in 5 years, it would be an accepted fact that some university courses are better taught using mostly technology and very little live input from an instructor…

I had one reply from an anonymous Scott (but I know who you are!) which is worth citing in full here:

If you equate learning with time spent in school, then I tend to agree (…) there is a lot of value in the traditional methods, and we would be foolish to replace them with untested modern contrivances (bordering here on the “computers in schools” debate).

But if you view learning as a continuous experience that is not confined to attendance at institutionalized schools, then I wholeheartedly agree (…). I left the research world for five years (1997-2002), and was astounded when I returned at how the process of dissemintation and discovery has been completely transformed by the Internet. Academic discourse these days is utterly dependent on electricity.

And I see elements of the same transformation in schools at every level. I run a couple of historical Web sites, and I receive an endless stream of questions from students doing projects. But I think the more important observation is that students (of all ages) are applying the information discovery skills they learn in school (and on their own) to other activities. Here’s an example. Our 15 year old TV and two 15 year old VCRs all decided to die in October. So we’re now faced with the daunting task of choosing new technology. Buying a TV used to be easy: you chose your size, identified some trusted brands, then picked a cabinet to match your decor. These days, you have to choose between CRT vs LCD vs projection vs plasma, 19:6 vs 4:3, HD capable or not, progressive scan vs interlace, presence of RF/composite/s-video/component connectors, etc. And that’s just the TV. What about a replacement VCR? Really, you want a DVD recorder that plays about a dozen disk formats, and you also have to think about future requirements for networked content delivery throughout the house, and compatibility between all the components. The combinatorial explosion is overwhelming. I know that my father could no longer make a choice of television that was better than a random guess. So I asked the sales guy how much time they have to devote to educating customers these days regarding all the options, expecting to hear that people generally feel as overwhelmed as I do. But the answer was quite the opposite. The guy said that most customers (of all ages – I specifically asked about age) come to the store with a comprehensive understanding of the options. Not only do they understand the choices (often following guides from such places as Consumer Reports), but they come armed with reviews from, advice from discussion forums, (wikis and blogs?), etc. The sales guy said that as often as not they learn something from the customer.

If that isn’t a fundamental (and welcome) change in how people learn, I don’t know what is. It suggests to me a process of continuous and pervasive learning that I rarely saw emerge from traditional schools. Yet that’s the culture that today’s children are experiencing. I don’t know if calculus teachers will be obsolete in five years, but neither do I see the pace of change slowing any time soon. If anything, I expect it to accelerate as technologies for continuous social communication and global network access (cell/PDA/SMS/IM/etc) go mainstream.

Wal-Mart’s Data Obsession

According to this Slashdot thread, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. That’s something like 249 if my computations are exact. That’s about 10,000 average hard drives (at 50 GB each) or 1,500 large hard drives (at 300 GB each). Suppose you want to store a matrix having 10 millions rows by 10 millions columns. Assume you store 4 bytes in each cell. That’s pretty much what Walmart has in storage. Another way to put it is that you can store 400 KB of data every second for 30 years and not exceed this storage capacity. If you had Walmart’s storage, you could pretty much make a video of your entire life. The Internet Wayback machine is reported to have about twice this storage.

One can only assume that storage will increase considerably in coming years. The impact this will have on our lives is still unknown, but the fact that gmail offers 1 GB of storage for free right now is telling: almost infinite storage is right around the corner and it will change the way we think about information technology.

Sébastien Paquet on blogs and wikis

As pointed out by Nicolas, Sébastien Paquet was giving a talk on Friday. He talked about blogs and wikis for collaborative learning. As an interesting sidenote, the famous Gilles Brassard attended Sébastien’s talk since Gilles was Sébastien’s thesis supervisor. As reported by Nicolas, I think a number of professors do not like the bottom-up nature of these tools.

I think it is quite natural to get these reactions. Blogs and wikis are part of a larger trend where central control is being neutralized. I think the Web in general will eventually have a very deep effect on society: my son will live in world where individuals are in charge much more than companies or governments. Universities will be hit hard too: I see students taking progressively charge of their learning. Currently, we decide what my son eats, but he started to protest when he doesn’t like something, and soon, he will alone decide what he eats. I think technology will progressively put students more and more in charge. I don’t think this will diminish the need for university professors, but their role will change from spoon-feeding students to providing guidance.

I predict that in 5 years, students all over the world will learn Calculus with little input from from instructors (but a lot of input from other students!). They will use sophisticated on-line laboratories and on-line testing, and on-line support. The technology is already here, but we still don’t know how to use it properly.

Update: Jeff Erickson seems to predict that in 5 years, I’ll still be predicting that in 5 years learning will change. Well, he is right. Learning is changing all the time. The role of university professors has changed quite a bit with technology, whether we care to admit it or not. What I’m saying in this post is that the Web has reached sufficient maturity now that it will soon be able to do away (mostly) with instructors in some of the more spoon-feeding courses.

Cringely on Microsoft

Cringely has nice things to say about Microsoft this time around. The gist of his argument is that Microsoft is a better company because it is lean and mean. I have no idea how lean and mean Microsoft is, but I buy it. I remember seeing a picture of the entire Windows dev. team, and it was a reasonably small team (around 15 people). I think that Microsoft is not the type of company that would waste money on layers upon layers of management. I always felt that in an IT project, if you can’t code a script or install a database, you have no business being around the table.

The IBMs, EDSes, Lockheed-Martins, Computer Science Corps, and Boeings of the IT world make their money based on head count, while Microsoft makes its money primarily from software licenses. If any of those other big companies were to win the NHS IT contract, they’d budget 30 to 40 percent of the total amount for management, which is to say for doing very little, if any, real work on the project. They would throw a couple thousand bodies at the project whether those bodies were really needed or not. Microsoft, on the other hand, will put a dozen key developers on the project with probably two managers above them. They’ll get the work done in less time and for less money because they don’t have to carry the baggage of that old business model.

Backing up your data is hard!

I lost the last 3 months of data on my Palm. People who know me can imagine my face right now. I take all my notes on my Palm. It is my primary data source/data management tool.

It is stupid, really. Every time there is some quarters in my pocket, the Palm goes crazy. I think the metal from the quarters does something to the Palm. If there is a good engineer out there, maybe he could explain what happens? I have a Palm m500, and it comes with a builtin cover. In theory, a quarter should do nothing, but in practice, it destroys the Palm: I have to do a hard reset, losing all data.

Anyhow, today, I bought coffee and dropped a quarter in my Palm pocket (I always carry my Palm). I could tell it was dead when I picked it up and the green light was permanently bright and the unit unresponsive.

The good thing about the Palm m500 is that it has excellent (several weeks) battery life. It has no color, no Wifi, no… just plain Palm software. The downside is that when you lose your data, well, you might have an old backup.

In my case, things got worse. Some months ago, I switched from coldsync, a solid Linux Desktop Palm syncing software, to KPilot, a shiny, but buggy Palm syncing software.The nice thing about KPilot is that it is integrated with KDE, so you get calendar and address book syncing for free.

However, I just learned that KPilot doesn’t have reliable backups. While I had synced two weeks ago, I was only able to retrieve my data up until September 15th. It might be the date I upgraded KDE from 3.2 to 3.3, I don’t know…

Now, I tried tweaking KPilot so that it would do better backups, I couldn’t find what could possibly have gone wrong, but I tried switching options randomly… but how do I know whether it has reliable backups now?

I have no way to know. I could switch back to coldsync, but coldsync appears to be frozen in time.

This is a general problem. For example, the Mysql tables for this blog are dumped and saved carefully on another machine. How do I know that I could retrieve all of the data and put back my blog together should something bad happen? Only way to know would be to periodically rollback my blog from my backups. Seems awkward. So I’ll just wait until something bad happens and pray.

Funny differences between Mysql and Postgresql

I hope someone can explain these funny differences between Mysql and Postgresql. (Yes, see update below.)

Here’s an easy one… What is 11/5?

select 11/5;

What should a SQL engine answer? Anyone knows? I could check as I used to be a member of the “SQL” ISO committee, but I’m too lazy and the ISO specs are too large. Mysql gives me 2.20 whereas Postgresql gives me 2 (integer division). It seems to me like Postgresql is more in line with most programming language (when dividing integers, use integer division).

It gets even weirder… how do you round 0.5? I was always taught that the answer is 1.

select round(0.5);

Mysql gives me 0 (which I feel is wrong) and Postgresql gives me 1 (which I feel is right).

On both counts, Mysql gives me an unexpected answer.

(The color scheme above for SQL statements shows I learned to program with Turbo Pascal.)

Update: Scott gave me the answer regarding Mysql rounding rule. It will alternate rounding up with rounding down, so

select round(1.5);

gives you 2 under Mysql. The idea is that rounding should not, probabilistically speaking, favor “up” over “down”. Physicists know this principle well. Joseph Scott also gave me the answer, and in fact he gave me quite a detailed answer on his blog. I think Joseph’s answer is slightly wrong. I don’t think Mysql uses the standard C librairies because the following code:

#include <cmath>
#include <iostream>
using namespace std;
int main() {
        cout  << round(0.5) << endl;
        cout  << round(1.5) <<endl;

outputs 1 and 2 on my machine (not what Mysql gives me).

The Public Referee Reports Debate

I think the wave was started by Seb this time as he hints we should consider publishing reviews when an academic paper is submitted.

A reply comes from Lance who says we should kill this idea quick. I think his counterargument is badly flawed. For example, he describes the review process as iterative:

The referees read the paper and write a report on whether to recommend the paper for publication and give suggested changes. They send this report to the editor who relays this information to the authors without revealing the identities of the referees. The authors will comment on the referee reports and revise the paper. The paper often goes back to the referees and the process repeats until the editor is happy with the paper or decides the paper is not worthy of the journal.

Here’s what I replied:

The back in forth you refer to is not present in 99% of conferences I know. You submit paper, random reviewers write a review, you get the result, end of story. Because the selection process is based on a percentage of acceptance, even a good review is not sufficient: you need to have the random reviewers be delighted at your work. Now, I claim that more than half the reviewers don’t even read the papers. So, what do you think happens? Papers who are sufficiently fashionable get through, others get canned. The only way I can imagine to get this process fixed is to publish reviews.

Then he actually explains why he thinks public reviews are a bad idea:

The process of refereeing requires considerable back and forth conversation between the three parties: the authors, the editor and the referees. Posting the original reports will give a misleading view of the process and will cause referees to act far too cautiously about pointing out problems in a paper.

I’m sorry but how is having your name as the referee of a paper going to entice you to be lenient? How many people wants to go down in history as having accepted a paper that’s flawed?

Quite the opposite from Lance, I believe that reviewers, in a public review system, have a strong incentive at being extremely hard. Nobody will care about the reviews a rejected paper got… nobody but the author… or if the review is really terrible… But people will read the reviews an accepted bad paper got and if they see that a certain individual is letting bad papers pass, his reputation will go down.

I argue quite the opposite: a public review system would make it very hard to get bad papers through.

As for the possible argument that it would make it harder to find reviewers. This argument (which Lance doesn’t push) is probably valid. First of all, you’ll have to work much harder to find capable reviewers willing to invest the time needed to produce thoughtful reviews. However, these people would get rewarded for their work since their review becomes public and hence, contribute to their status in the community.

Being a Nice Researcher and the Real World: pure, applied, and industrial research

This morning, I am deeply upset. Some of you who know me will know why. No, I don’t care so much that Buch was elected, though it does puzzle me. Read through, you might find out why I’m upset.

I did my Ph.D. with the intent of getting into “industrial research”. Yes, there are different types of research. You have “pure” research where people have no idea why they do the research except that it looks nice. For example, solving for all algebra having property X is pure research. You have “applied” research which is often closely related to “pure” research except that the topic is a bit closer to “real world” concerns. For example, finding a new way to solve Navier-Stokes equations is applied research. It should be noted that “applied” research doesn’t equate with “useful in the real world” research. In fact, it could very well be that some “pure” research is more applicable in the real world. Finally, you have “industrial” research. Industrial research is meant to be useful in the real world. It is the primary purpose of such research. The excitement comes not from elegance alone, but mostly by solving real problems people have. You might say that some people working on the linux kernel are industrial researchers, at least when they innovate. A given researcher may cover all three research types, but he may switch hat depending on the project he is working on.

All types of research are equally worthy, but they are not equal in all things. For example, “pure” research might give you a lot of prestige is some universities. Pure researchers have “pure” concerns and it is often thought that only the smarter researchers can be pure researchers. Applied researcher have often an edge when it comes time to get some research funding. That’s because the applied researcher can easily justify that his work might be applicable in the real world. Finally, the industrial researcher might be looked down upon by some people: because industrial research must be tied closely to the real world, it will sometimes trade fashion or elegance for convenience or applicability. Often, “industrial” research may appear to be simpler, maybe easier. Believe me, it is not easier. On the other hand, the “industrial” researcher can really license technology or even start a company. This, in turn, may bring tremendous leverage to such a researcher. Unfortunately, these things take time, and in a university setting where tenure should be first on your mind, the industrial researcher might have a harder time. Hence, in many schools, most professors are pure or applied researchers. Fortunately, the industrial researcher has a broader choice of employment.

This is actually a very interesting topic. Many questions may come to mind… For example, is there any such thing as industrial research in mathematics? You bet! See SIAM.

Now, that you see what I mean by an industrial researcher, you might understand that one of his strength is when dealing with companies. He is able to understand the concerns of their engineers and his research accounted for many of these concerns already. However, for him, it is crucial that people do not get in the way between him and a company. He needs, he wants technology transfert.

That’s why I’m upset today: I feel someone pushed me aside and got in the way. People are always eager to take someone else’s work and then ignore this person when money is involved and the author is not being difficult. I hope this will get fixed, but either way, I’ve been reminded that the industrial researcher must keep a very close eye and much control on the technology transfert process. And not be nice.

At Cross Purposes: What the experiences of today’s doctoral students reveal about doctoral education

Here’s a report released in 2001 on the state of doctoral education. It looks like a serious study and the conclusion is scary:

What we learned may not be entirely surprising because our findings confirm many of the concerns that have been raised in the last 10 years. However, our data provide detailed, confirmatory evidence of particular tension points. We found that:

  • The training doctoral students receive is not what they want, nor does it prepare them for the jobs they take.
  • Many students do not clearly understand what doctoral study entails, how the process works and how to navigate it effectively.

The Alps

Found an interesting UK Indie Rock bad on inDiscover: The Alps. They have a number of freely available songs, but I only listened to “The Shining”.

There are also some new bands to explore on inDiscover: Steel Poniez (bunch of ladies) and School for the Dead (bunch of dudes).

Go! Download now! It is free! It is legal! It is fun! And be sure to rate these artists so that you can help others find the good ones!

McGrath on XML usage for Web clients

Interesting post by Sean McGrath (not the inDiscover‘s Sean McGrath, the Propylon’s Sean McGrath) on how Gmail (Google Mail) was designed. For those who don’t know Gmail is a revolutionary Web mail service à la Hotmail, a step beyond anything else I had ever seen. He explains that Gmail is thin client running thanks to javascript (and not Java!!!).

This bring him to raise an interesting question. Why doesn’t Gmail sends XML back and forth? Indeed, isn’t XML the data format of the Web? Here’s what he has to say:

Web clients carry around a basic, low level programming language called Javascript. The real beauty of Javascript is that it is dynamic – you can blurr the distinction between code and data. You can hoist the level of abstraction you work with in your app by layering domain specific concepts on top of it in the form of functions and data structures. You can sling across data structures already teed up for use on the other end with the aid of the magic of “eval”. You can implement complex behaviour by sending across a program to be run rather than trying to explain what you want done declaratively to the other side.

Now, in such a world – would you send XML data to and from? Developers with a static typing programming language background might be inclined to say yes but I suspect javascriptophiles, lispers, pythoneers and rubyites are more likely to say no. Reason being, it is so much more natural to exchange lumps of code – mere text strings remember – that can be eval’ed to re-create the data structure you have in the XML.

I think he is very much on target in the sense that people who see everything as Java or C# are likely to perceive XML very much differently from people using higher level languages.

The lesson here people is that you should master a range of languages and not one or two. And no, taking a class in Haskell once in your life doesn’t qualify.

Tim Bray opposing Web Services

Tim Bray who invented XML among other things, takes a stand against Web Services. Here’s what he says:

No matter how hard I try, I still think the WS-* stack is bloated, opaque, and insanely complex. I think it’s going to be hard to understand, hard to implement, hard to interoperate, and hard to secure.

I look at Google and Amazon and EBay and Salesforce and see them doing tens of millions of transactions a day involving pumping XML back and forth over HTTP, and I can’t help noticing that they don’t seem to need much WS-apparatus.

I’m deeply suspicious of “standards” built by committees in advance of industry experience, and I’m deeply suspicious of Microsoft and IBM, and I’m deeply suspicious of multiple layers of abstraction that try to get between me and the messages full of angle-bracketed text that I push around to get work done.

It should be noted that Tim has recently taken a job with Sun Microsystems. His current employer is very actively involved in Web Services, so I believe he takes this stand despite the current interest of his employer.

So, you want to do a Ph.D.?

Seb sent me this extract of a book. The extract is called So, you want to do a Ph.D.? As usual with this sort of book, it is delightful.

Here’s a fun quote:

One thing which is seldom mentioned is what happens to you after you finish the PhD. A classic story is as follows. A student focuses clearly, submits the thesis and starts looking for a lecturing job, only to discover that they need two years of lecturing experience and preferably a journal publication as well if they are to be appointable for a job in a good department in their field. If they had known this two years previously, they could have started doing some part-time lecturing and submitted a paper or two to a journal.

I haven’t read the entire book, of course, and I’m somewhat worried that the book might not be sufficiently focused on why one does a Ph.D. and might be a tad too cynical. Learning the rules is very nice and very important, and I wished I had learned them when it was time. However, there is also the issue of figuring out whether these rules make sense, and knowing when to break them. Well, I guess that learning the rules to begin with is a very good start.