Oracle and MySQL — is MySQL in a weak position?

Oracle has recently bought Innobase which makes one library MySQL relies upon for storing its tables. One user on slashdot had the following insightful comment:

Among the technologies that MySQL licenses from third parties under commercial redistribution licenses:

Berkeley DB (Sleepycat Software)
InnoDB (Oracle, formerly Innobase)
MaxDB (SAP AG)

See the problem? MySQL itself is largely a language parser and a simple and technically inadequate storage engine (for anything where data integrity matters). In other words they don’t own any of the foundations of their technologies.

This is interesting. We always encourage developers to use and reuse existing libraries. Should MySQL be blamed for doing so?

The comparison with PostgreSQL is interesting. PostgreSQL works in a decentralized way as opposed to MySQL which is developed by single company, using libraries.

I think that MySQL could definitively be a fragile product whose development could be impaired through various business decisions. However, I think it has nothing to do with the fact that MySQL relies on libraries it hasn’t written, but rather on the fact that there is no community of MySQL developers.

Free Sofware is not a cure to the world’s hunger.However, building software using a highly distributed community might very be the best possible way to develop generic software.

Analyzing Large Collections of Electronic Text Using OLAP at APICS 2005

Steven will be presenting our paper Analyzing Large Collections of Electronic Text Using OLAP at APICS 2005. This work is based on an idea by Owen Kaser: what happens if we apply multidimensional databases (OLAP) to literary research?

Data Mining and Information Retrieval techniques are used routinely for literary research or processing text in general, but decision support techniques commonly used in the business world (sometimes called “Business Intelligence”) have not seen much use yet in text processing. The main difference between decision support systems and data mining is the fact that in decision support, the user remains in control, thus simple yet extremely efficient algorithms are favoured over sophisticated, but possibly expensive algorithms. Ideally, all decision support algorithms should be O(1) after accounting for precomputations. With infinite storage almost available now, decision support research is due for a technological and scientific boom.

Computer-assisted reading and analysis of text has various applications in the humanities and social sciences. The increasing size of many electronic text archives has the advantage of a more complete analysis but the disadvantage of taking longer to obtain results. On-Line Analytical Processing is a method used to store and quickly analyze multidimensional data. By storing text analysis information in an OLAP system, a user can obtain solutions to inquiries in a matter of seconds as opposed to minutes, hours, or even days. This analysis is user-driven allowing various users the freedom to pursue their own direction of research.

2005 QlikView Think Outside The Cube Contest

If you are into Business Intelligence, this might interest you: the 2005 QlikView Think Outside The Cube Contest is open.

We’re inviting you to “Think Outside The Cube” and move beyond the limits of traditional OLAP-based BI.Your application/entry can be a QlikView application that covers anything – results and statistics from a favorite sport, stock tracking, movies, fantasy football, financial results from obscure stock exchanges, your kids’ swim team times, the price of tea in China. Anything that generates data is fair game.

(Organized by QlikTech.)

Now, that’s fun stuff!

Shane Butler is building a learner to predict AFL game results. I don’t know why, but this sounds like fun stuff!

Its been a while since my I last wrote that I was thinking of building a learner to predict AFL game results. Obviously with the finals starting this week I’m a bit late in the season but oh well! I can still simulate the season using the data I’ve collected. I’ve developed a number of Perl screen scrapers to collect all the data nessasary, which so far includes things like the match fixtures, game results, player stats and people’s tips (yes, this will be a hyprid of both historical data and human input). I’m at the stage now of using SQL generate things like a Home Ground Advantage (HGA) parameter – I guess you might call this the ETL stage!!

If he fails with his learner, he could turn this project into an OLAP project! That’d be cool!

Here’s why we are soon going to be flooded by data

Paul Graham says that transparency (and thus data recording) is the way out of corruption:

How do you break the connection between wealth and power? Demand transparency. Watch closely how power is exercised, and demand an account of how decisions are made. Why aren’t all police interrogations videotaped? Why did 36% of Princeton’s class of 2007 come from prep schools, when only 1.7% of American kids attend them? Why did the US really invade Iraq? Why don’t government officials disclose more about their finances, and why only during their term of office?

This is very important. Being rich brings you security, but not a lot of power if you have to go through the same process as everyone else. A just society is an open society where we record everything.

Big Brother might actually be the ticket to a just society.

Pentaho – Open Source Business Intelligence

In relation to a previous post of mine about open source Business Intelligence where I wrote “So, maybe someone out there should start a support company for Open Source Business Intelligence?”, Krishnaswamy Ram pointed out to Pentaho which seems to be exactly what I had imagined a smart businessman could do.

The Pentaho BI Project provides enterprise-class reporting, analysis, dashboard, data mining and workflow capabilities that help organizations operate more efficiently and effectively. The software offers flexible deployment options that enable use as embeddable components, customized BI application solutions, and as a complete out-of-the-box, integrated BI platform.

PODS 2006 (December 1st, 2005 / June 26-28, 2006)

The PODS 2006 call for papers is out. It will be held in Chicago along with SIGMOD.

The PODS symposium series, held in conjunction with the SIGMOD conference series, provides a premier annual forum for the communication of new advances in the theoretical foundation of database systems. For the 25th edition, original research papers providing new insights in the specification, design, or implementation of data management tools are called for.

SIGMOD 2006 (November 17, 2005 / June 27-29, 2006)

The SIGMOD call for papers is out. It will be held in Chicago (cool!).

The annual ACM SIGMOD conference is a leading international forum for database researchers, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. We invite the submission of original research contributions as well as proposals for demonstrations, tutorials, industrial presentations, and panels. We encourage submissions relating to all aspects of data management defined broadly and particularly encourage work that represent deep technical insights or present new abstractions and novel approaches to problems of significance. We especially welcome submissions that help identify and solve data management systems issues by leveraging knowledge of applications and related areas, such as information retrieval and search, operating systems & storage technologies, and web services.

Generalized Multi-dimensional Data Mapping and Query Processing

Rui Zhang, a student at the University of Singapore, made available his Generalized Multi-dimensional Data Mapping and Query Processing software package as a tar ball. I started browsing the related paper (Generalized Multi-dimensional Data Mapping and Query Processing. TODS, September 2005) and it is interesting work, if you care about UB-Trees and this sort of stuff (OLAP-related).

A future in IT? Think about IT security!

Cringely predicts certified security experts will see a rapid appreciation of their value. I agree. It is quite an obvious prediction, but well worth stating in an era where IT experts are often struggling. Can a job in IT security be interesting, be fun? I don’t know.

All and in all, there is clearly a bright future for IT experts, but you’ve got to go do the difficult stuff: security and data warehousing.

BI is Ready for Open Source – Is Open Source Ready for BI?

A recent article in DW Review asks whether Open Source is ready for Business Intelligence. There are already some interesting Business Intelligence Open Source tools such as Octopus Enhydra, Eclipse Birt and Mondrian, All of them in Java by the way!!! Add to this MySQL 5.0 with features matching IBM DB2 or PostgreSQL with its legendary reliability and you have everything you need for Open Source Business Intelligence, or do you? Well, support still sucks.

So, maybe someone out there should start a support company for Open Source Business Intelligence? This is compelling:

As companies invest to stay compliant with the host of new government regulations, improve operational efficiency and retain customers, business intelligence (BI) has continued its march up the corporate food chain. BI is a top IT priority. At $16.8 billion in 2001 and growing to $29 billion in 2006, the BI and data warehousing market is exploding.

A “Measure of Transaction Processing” 20 Years Later

I just read an interesting short report by Jim Gray. The gist of the matter is that, since 2000, the rate of increase in computer performance per dollar has gone down. It is still exponential, but the rate of growth is much, much smaller. Gray blames memory latency.

As a side-note, how fast can you sort 16 GB of data on a typical PC? The answer is about 16 minutes.

More on IBM versus Essbase

I wrote earlier that IBM announced it would no longer sell its DB2 OLAP Server. It looks like the move by IBM might mean that they plan to focus on their own OLAP product:

in fact it’s more to do with their current focus on their Cube Views product, which in his opinion is more likely to be IBM’s future OLAP direction.

So DB2 Cube Views will be the main IBM OLAP product?

IBM killed its DB/2 Olap Server

According to COMPUTERWOCHE ONLINE, IBM is killing its DB/2 Olap Server by breaking its deal with Hyperion. This somewhat surprising move brings questions as to what IBM will do in the Business Intelligence arena… partner with Oracle or Microsoft, or do do something else? Maybe get out of the OLAP business altogether?

High demand for storage

In What was Sun thinking? (CNET News.com), Charles Cooper tells us that storage is now in high demand:

What with some of the confusing–make that idiotic–federal regulations governing corporate behavior that have appeared the last couple years, there’s a near bottomless demand for big storage systems. After the passage of Sarbanes-Oxley and HIPAA, CEOs are so keen on covering their posteriors these days that there’s no such thing as too much documentation. Identity and management access is the hot ticket these days as every management team worth its salt wants to tout how tough it now is on compliance.

Wow. Seems like it is going to be a nice era for data warehousing and OLAP, no?

ACM Queue – A Conversation with Tim Bray

This is brilliant! ACM Queue is publishing an interview with Tim Bray (of XML fame) done by Jim Gray (of data cube and database transactions fame). Tim now runs Web technologies for Sun Microsystems. Tim Bray basically says that RDF and Semantic Web are a no go but we knew that’s what he thought.

However, there are many cool quotes. Try to find the pattern in these:

My CEO, Tom Jenkins, agreed to turn me loose to work on it myself, and I spent six months basically doing nothing else and built the crawler and the interfaces. (…) I lost weeks and weeks and weeks of sleep, hacking and patching and kludging to keep this thing on the air under the pressure of the load.

Lark was the first XML processor, implemented in Java. I wrote it myself. I used it also as a vehicle to learn Java. It shipped in January 1997 and actually got used by a bunch of people. (…) So, I let Lark go. It was fun to write and I think it was helpful, but it hasn’t been maintained since 1998.

Some of the people working in syndication were extremely upset about XML’s strictness, saying, “Well, you know, people just can’t be expected to generate well-formed data.” And I said, “Yes they can.” I went looking around and found that there are some quite decent libraries capable of doing that for Java and Perl and Python, but there didn’t seem to be one for C.

So sitting on the beach in Australia I wrote this little library in C called Genx that generates XML efficiently and guarantees that it is well-formed and canonical.

See the pattern? Tim Bray is a hacker with a degree in mathematics and computer science. [Tim doesn’t have a graduate degree.] And he changed the world.

But his life was not always easy:

Microsoft really went insane. There was a major meltdown and a war, and I was temporarily fired as XML coeditor. There was an aggressive attempt to destroy my career over that.

(Note that the interviewer, Jim Gray, works for Microsoft!)

Inexpensive ubiquitous mass storage is closer than you think!

I started using Google Mail (GMail) last year because I want to be able to read my mail from everywhere, all the time. Google offered me 1000 MB of free storage and one of the greatest user interface for a mail client. Oh! Did I mention it has nearly perfect spam filtering, without any effort on my part?

I wondered what would happen when I would reached 1000 MB of mail. For me, that’s about 2 years of incoming mail, maybe a bit less.

Well, my account has now 2057 MB of storage. That’s about 3 years worth of storage. It seems like Google increases your limit as need arises.

Affordable TeraBytes

From Slashdot, I learned that for $3k, one can buy a 1.6TB hard drive similar to normal PC hard drives:

IO Data Device’s new ‘HDZ-UE1.6TS’ exemplifies the recent trend towards demand for higher storage capacities — it’s an external hard drive setup offering a total capacity of 1.6TB. Not much larger than four 3.5″ hard drives, the HDZ-UE1.6TS goes to show that any (rich) consumer can now easily have a boatload of storage space. (At current conversion rates, this would cost nearly $2,900.)

Maybe $3k seems like a lot but I bet that in 5 years, these beasts will cost under $1k and fit inside a normal PC.

The consequences of so much storage (nearly infinite) are still not well understood, but I believe it could bring about new killer applications we can’t even imagine right now.

Wal-Mart’s Data Obsession

According to this Slashdot thread, Wal-Mart has 460 terabytes of data stored on Teradata mainframes, at its Bentonville headquarters. That’s something like 249 if my computations are exact. That’s about 10,000 average hard drives (at 50 GB each) or 1,500 large hard drives (at 300 GB each). Suppose you want to store a matrix having 10 millions rows by 10 millions columns. Assume you store 4 bytes in each cell. That’s pretty much what Walmart has in storage. Another way to put it is that you can store 400 KB of data every second for 30 years and not exceed this storage capacity. If you had Walmart’s storage, you could pretty much make a video of your entire life. The Internet Wayback machine is reported to have about twice this storage.

One can only assume that storage will increase considerably in coming years. The impact this will have on our lives is still unknown, but the fact that gmail offers 1 GB of storage for free right now is telling: almost infinite storage is right around the corner and it will change the way we think about information technology.