Edd Dumbill on Web 2.0

Edd Dumbill has a cool post on what the future of the Web is.

What’s hot:

  • Intellectual property and privacy law. If exchange and manipulation of data is key to the future web, then we need to understand that and be the ones in control. If corporations have too much control of data, as they are striving for, that’s the equivalent of API lock-in, and we’ll all suffer. But on the other hand, we want to tightly control the data about ourselves. An interesting conflict, about which I’d like to write more in future.
  • Transformation and annotation. Nobody’s going to own a unique hold on the form of expression of the data flying around in “web 2.0”, but they’re certainly going to want to transform between those forms. From the crude “emergent keywords” of del.icio.us to the intensive but scope-limited integration done by Google and A9, there’s going to be a lot of value in joining together previously isolated data islands.
  • Network engineering. Dumb, happy protocols that give quick results are on the rise. Look at the RSS madness, servers being pummelled. And RSS isn’t even mainstream yet, though it’s about to get that way. It gets messier before it gets better.

What’s not:

  • Complicated web service standards. Forget the WS-I lunacy. Web applications for computers were happening before the web services standards junk. Amazon would still be providing their interfaces with or without SOAP, WSDL and UDDI, and indeed all the evidence is that their users prefer to use the simpler HTTP/XML APIs anyway. As far as the web is concerned, the WS-* work is about sprinkling XML pixie dust on a failing idea.
  • Frameworks and silos. Don’t believe anyone who claims to have a wonderful new framework that’ll solve your problems if only you’d migrate everything you do to it. The web is all about separate pieces, loosely joined. The really clever businesses know how to manage uncertainty, they’re not looking to eliminate it. Circling the wagons will not integrate you into the web, neither will it promote web-like innovation inside a business.

Computing argmax fast in Python

Update: see Fast argmax in Python for the final word.

Python doesn’t come with an argmax function built-in. That is, a function that tells you where a maximum is in an array. And I keep needing to write my own argmax function so I’m getting better and better at it. Here’s the best I could come up with so far:

from itertools import izip
argmax = lambda array: max(izip(array, xrange(len(array))))[1]

The really nice thing about izip and xrange is that they don’t actually output arrays, but only lazy iterators. You also have plenty of similar functions such as imap or ifilter. Very neat.

Here’s a challenge to you: can you do better? That is, can you write a faster, less memory hungry function? If not, did I find the optimal solution? If so, do I get some kind of prize?

Next, suppose you want to find argmax, but excluding some set of bad indexes, you can do it this way…

from itertools import izip
argmaxbad = lambda array,badindexes: max(ifilter(lambda t: t[1] not in badindexes,izip(array, xrange(len(array)))))[1]

Python is amazingly easy.

As a side-note, you can also do intersections and union of sets in Python in a similar (functional) spirit:
def intersection(set1,set2): return filter(lambda s:s in set2,set1)
def union(set1, set2): return set1 + filter(lambda s:s not in set1, set2)

Update: you can do the same things with hash tables:
max(izip(hashtable[max].itervalues(), hashtable[max].iterkeys()))[1]

Aaron Straup Cope’s NYTimes Widgets

One of the most interesting talk we had at SWIG’04 was “Design Issues and Technical Challenges Making the Eatdrinkfeelgood Markup Language RDF” where Aaron showed why it was hard to use RDF in a XML project. I think it all boils down to the fact that we have no good widespread way of serializing RDF to XML. In any case, Aaron finally sent me a link to his NYTimes Widgets.

It lacks sufficient documentation for me to grok it quickly, but from what I understand, Aaron tried to create a useful and innovative RDF application. Here’s what he says about his widgets:

The New York Times includes a large amount of topical metadata with each article it publishes. These are widgets that, having harvested the data, try to do something interesting with it.

Good software engineering according to Paul Graham

Paul Graham describes what good software developers do:

In software, paradoxical as it sounds, good craftsmanship means working fast. If you work slowly and meticulously, you merely end up with a very fine implementation of your initial, mistaken idea. Working slowly and meticulously is premature optimization. Better to get a prototype done fast, and see what new ideas it gives you.

Globalization and the American IT Worker

Norman Matloff wrote a solid paper called Globalization and the American IT Worker, published in the latest issue (Nov. 2004) of Communications of the ACM. Here’s a rather bleak quote:

University computer science departments must be
honest with students regarding career opportunities
in the field. The reduction in programming jobs
open to U.S. citizens and green card holders is per-
manent, not just a dip in the business cycle. Students
who want technological work must have less of a
mindset on programming and put more effort into
understanding computer systems in preparation for
jobs not easily offshored (such as system and data-
base administrators). For instance, how many gradu-
ates can give a cogent explanation of how an OS
boots up?

RSS is the Semantic Web

Here’s what Stephen Downes has to say about the Semantic Web:

RSS is the semantic web. It is not the official semantic web as I said, it is not sanctioned by any standards body or organization whatsoever. But RSS is what has emerged as the de facto description of online content, used by more than four million sites already worldwide, used to describe not only resources, but people, places, objects, calendar entries, and in my way of thinking, learning resources and learning objects.

What makes RSS work is that it approaches search a lot more like Google and a lot less like the Federated search described above. Metadata moves freely about the internet, is aggregated not by one but by many sources, is recombined, and fed forward. RSS is now used to describe the content of blogs, and when aggregated, is the combining of blog posts into new and novel forms. Sites like Technorati and Bloglines, Popdex and Blog Digger are just exploring this potential. RSS is the new syntax, and the people using it have found a voice.

TOOL: The Open Opinion Layer

Here’s an interesting paper by Hassan Masum, TOOL: The Open Opinion Layer. Here’s the abstract:

Shared opinions drive society: what we read, how we vote, and where we shop are all heavily influenced by the choices of others. However, the cost in time and money to systematically share opinions remains high, while the actual performance history of opinion generators is often not tracked.

This article explores the development of a distributed open opinion layer, which is given the generic name of TOOL. Similar to the evolution of network protocols as an underlying layer for many computational tasks, we suggest that TOOL has the potential to become a common substrate upon which many scientific, commercial, and social activities will be based.

Valuation decisions are ubiquitous in human interaction and thought itself. Incorporating information valuation into a computational layer will be as significant a step forward as our current communication and information retrieval layers.

Follow-up on “Sébastien Paquet on blogs and wikis”

In one my previous post commenting on the fact that technology had changed dramatically learning, I predicted that in 5 years, it would be an accepted fact that some university courses are better taught using mostly technology and very little live input from an instructor…

I had one reply from an anonymous Scott (but I know who you are!) which is worth citing in full here:

If you equate learning with time spent in school, then I tend to agree (…) there is a lot of value in the traditional methods, and we would be foolish to replace them with untested modern contrivances (bordering here on the “computers in schools” debate).

But if you view learning as a continuous experience that is not confined to attendance at institutionalized schools, then I wholeheartedly agree (…). I left the research world for five years (1997-2002), and was astounded when I returned at how the process of dissemintation and discovery has been completely transformed by the Internet. Academic discourse these days is utterly dependent on electricity.

And I see elements of the same transformation in schools at every level. I run a couple of historical Web sites, and I receive an endless stream of questions from students doing projects. But I think the more important observation is that students (of all ages) are applying the information discovery skills they learn in school (and on their own) to other activities. Here’s an example. Our 15 year old TV and two 15 year old VCRs all decided to die in October. So we’re now faced with the daunting task of choosing new technology. Buying a TV used to be easy: you chose your size, identified some trusted brands, then picked a cabinet to match your decor. These days, you have to choose between CRT vs LCD vs projection vs plasma, 19:6 vs 4:3, HD capable or not, progressive scan vs interlace, presence of RF/composite/s-video/component connectors, etc. And that’s just the TV. What about a replacement VCR? Really, you want a DVD recorder that plays about a dozen disk formats, and you also have to think about future requirements for networked content delivery throughout the house, and compatibility between all the components. The combinatorial explosion is overwhelming. I know that my father could no longer make a choice of television that was better than a random guess. So I asked the sales guy how much time they have to devote to educating customers these days regarding all the options, expecting to hear that people generally feel as overwhelmed as I do. But the answer was quite the opposite. The guy said that most customers (of all ages – I specifically asked about age) come to the store with a comprehensive understanding of the options. Not only do they understand the choices (often following guides from such places as Consumer Reports), but they come armed with reviews from epinions.com, advice from discussion forums, (wikis and blogs?), etc. The sales guy said that as often as not they learn something from the customer.

If that isn’t a fundamental (and welcome) change in how people learn, I don’t know what is. It suggests to me a process of continuous and pervasive learning that I rarely saw emerge from traditional schools. Yet that’s the culture that today’s children are experiencing. I don’t know if calculus teachers will be obsolete in five years, but neither do I see the pace of change slowing any time soon. If anything, I expect it to accelerate as technologies for continuous social communication and global network access (cell/PDA/SMS/IM/etc) go mainstream.

Wal-Mart’s Data Obsession

According to this Slashdot thread, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. That’s something like 249 if my computations are exact. That’s about 10,000 average hard drives (at 50 GB each) or 1,500 large hard drives (at 300 GB each). Suppose you want to store a matrix having 10 millions rows by 10 millions columns. Assume you store 4 bytes in each cell. That’s pretty much what Walmart has in storage. Another way to put it is that you can store 400 KB of data every second for 30 years and not exceed this storage capacity. If you had Walmart’s storage, you could pretty much make a video of your entire life. The Internet Wayback machine is reported to have about twice this storage.

One can only assume that storage will increase considerably in coming years. The impact this will have on our lives is still unknown, but the fact that gmail offers 1 GB of storage for free right now is telling: almost infinite storage is right around the corner and it will change the way we think about information technology.