Data has natural layouts:

  • text is written from the first to the last word,
  • database tables are written one row at a time,
  • Google presents results one document at a time,
  • the early recommender systems compared users to other users,
  • discussions are organized in newsgroups and posting boards by topic,
  • research papers are organized in journals and conferences,
  • objects have attributes (a ball is red), and from these attributes we determine similarities between objects.

Using a database terminology, these are horizontal layouts.

We can rotate these models to create vertical layouts:

  • Instead of writing text sequentially, we can store the locations of each word in an inverted index.
  • Instead of writing database tables row-by-row (e.g., Oracle, MySQL), we can write databases column by column (e.g., C-Store/Vertica, LucidDB, Sybase IQ, and my Lemur Bitmap Index Library).
  • Instead of presenting results sets one document at a time, we present tag clouds and use faceted search to support exploration. Thus, instead of listing documents, we focus on attributes (date, topic, author).
  • Recommender systems are often more scalable when they compare items instead of users: the most famous example is Greg Linden‘s Amazon recommender (if you liked this book, you may like…). For example, the Slope One algorithms outperform many user-to-user algorithms.
  • The social web started out with topic-oriented newsgroups and posting boards, but it is not dominated by user-centric blogs and social sites (such as Facebook or Twitter). Since then, we have realized that user-oriented blogs can be preferable.
  • While research papers are published in conferences and journals, I argue that we should turn this around and organize research papers by author through author-specific feeds.
  • Some AI researchers are suggesting that relations might be primary whereas attributes would be secondary.
Many of the best solutions are hybrids. For example, text search sometimes require full-text indexes such as suffix arrays and Oracle recently announced a row/column hybrid.
Take away message: If you are stuck, try to rotate your data model. If neither the vertical nor the horizontal model is a good fit, create an hybrid.

Too many research papers in Computer Science are nonsense: they convey no worthy message. Yet they pass a Turing test of sort: at a glance, they are indistinguishable from interesting research papers. In fact, they are designed as nonsense from the beginning: the authors mimic the output of good research. The goal is to appear to be doing valued research. Some of these papers appear in top conferences, and even go on to be highly cited.

What is the difference between a Stephen King novel and the average horror manuscript? At first glance, probably not much. After all, it is all a matter of taste. Yet if you have read enough novels, you can recognize King’s mastery. Avid readers know who are the best writers and they stick with them. It is not sufficient to get a manuscript accepted by King’s publisher to get the attention of his readers. People first care for the author. Stephen King is the brand.

In science, we publish by venue, by journal and by topic. Yet if we followed specific researchers, the same way we follow specific writers, these researchers would have a strong incentive to produce interesting work. It would not be profitable to write many uninteresting research papers since nobody would follow your work.

The idea is not new nor original. Back in 2005, I urged researchers to setup RSS feeds for their publications. Arxiv recently made author feeds available. Creating RSS feeds is technically easy. We have no excuse.

Please: if you want me to read you, make it easy to receive notice whenever you publish a new research paper. If enough researchers do it, we might improve the quality science significantly. Why wouldn’t you want people to follow your work?

Further reading: Scholarly Communications must be Syndicated by Gideon Burton

Eating my own dog food: You can subscribe to my research papers by email or through a reader.

PLoS One is a new peer-reviewed journal (2006) with many interesting features:

Unfortunately, for a Computer Scientist, it is not yet attractive:

  • The Computer Science section is filled with biology and medicine papers making use of Information Technology. In other words, the PLoS One taxonomy  confuses Information Technology and Computer Science! Thankfully, I could find one article in Natural Language Processing which might be the first and only Computer Science paper published in PLoS One. So there is hope.
  • As a related point, PLoS One is not indexed at the usual places as a Computer Science journal (DBLP, ACM DL, and so on). Of course, no Computer Science indexing is possible until PLoS One correctly classifies the Computer Science articles.

If they could fix these problems, I would gladly submit some of my work to them. PLoS One could become a useful journal in Computer Science over time. What about prestige? PLoS One uses article-level metrics. Instead of trying to be a prestigious journal, PLoS One helps you measure the impact of your own papers.

    « Previous Page

    Powered by WordPress