IRMA 2007 – Data Warehousing and Mining track (October 1, 2006 / May 19-23, 2007)

The IRMA 2007 International Conference has a Data Warehousing and Mining track organized by Andrew Kusiak and Qiang Zhu. The conference will be held in Vancouver.

A key to success for enterprises in today’s competitive markets is their ability to manage the staggering volumes and complexity of data from various sources in an efficient and economic manner. Data warehousing and mining have become prevailing technologies for data analysis, knowledge extraction and decision support in modern enterprises and organizations. The emergence of vast applications and other related software/hardware technologies continues to raise challenges (e.g., demand for real-time, active, mobile, parallel, distributed, secure, and spatio-temporal characteristics) for data warehousing and mining. The objective of this track is to provide a forum for researchers and practitioners to disseminate and exchange ideas on both the technical and managerial issues associated with data warehousing and mining.

Papers on pretty much every related topics are invited.

Kamel Aouiche in Montreal!

I met Kamel in Montreal today. I invited him over for a post-doc and he arrived 2 months ahead of time (talk about eagerness!) from Lyon (Eric laboratory). Kamel is a very promising, young (well, younger than me) researcher in OLAP, Data Warehousing and Data Mining. Among other things, Kamel will be working of view size estimation in data warehouses.

Oh! How did Kamel learned about me and my work? Nah! Not a conference or a talk… nothing like that, he just got to me through my blog! So this blog has a purpose after all!

Efficient Storage Methods for a Literary Data Warehouse


MCS Degree

Efficient Storage Methods for a Literary Data Warehouse
Steven W. Keith
Examining Committee:
Supervisors: Dr. Owen Kaser (UNBSJ), 
Dr. Daniel Lemire (Adjunct Prof., Univ.of Quebec)
Chairperson: Dr. Larry Garey (UNBSJ)
Internal Reader: Dr. Weichang Du
External Reader: Dr. George Stoica (UNBSJ)
Monday, May 29, 2006
10:30 a.m.
UNBSJ LOCATION: MacMurray Room-Oland Hall- Rm 203
UNBF LOCATION:  Multimedia Center(1st floor, room 126)Marshall D'Avray Hall


Computer-assisted reading and analysis of text has applications in the
humanities and social sciences. Ever-larger electronic text archives have
the advantage of allowing a more complete analysis but the disadvantage
of forcing longer waits for results. This thesis addresses the issue of
efficiently storing data in a literary data warehouse. The method in which
the data is stored directly influences the ability to extract useful, 
analytical results from the data warehouse in a timely fashion. 
A variety of storage methods including mapped files, trees, hashing,
and databases are evaluated to determine the most efficient method
of storing cubes in the data warehouse. Each storage method's ability
to insert and retrieve data points as well as slice, dice, and roll-up a
cube is evaluated. The amount of disk space required to store
the cubes is also considered. Five test cubes of various sizes are used to 
determine which method being evaluated is most efficient. The results lead
to various storage methods being efficient, depending on properties of the 
cube and the requirements of the user.


Linda Sales
Graduate Studies Program
Administrative Assistant
Faculty of Computer Science
University of New Brunswick
540 Windsor Street
Fredericton, NB
E3B 5A3

Phone: 506-458-7285
Fax: 506-453-3566

Multidimensional OLAP Server for Linux as Open Source Software

Jedox will release a free open source Linux MOLAP server by the end of the year. A pre-release of the software is expected by mid of 2005.

All data is stored entirely in memory. Data can not only be read from but also written back to the cubes. Like in a spreadsheet, all calculations and consolidations are carried out within milliseconds in the server memory while they are written back to the cube.

I sure hope that by “memory” they include “external” memory because otherwise, their cubes are going to have to be quite small. Normally, you’d at least memory map large files as Lemur OLAP does.

Open-Source MOLAP with PHP and Linux

There is a now an online demo of the open souce MOLAP engine Palo running a PHP application on a Linux-based Palo Server. The application is somewhat limited, but it is a proof-of-concept. This is a very small step for open source Business Intelligence, but a step forward nonetheless.

(Our very own Lemur OLAP once had an online demo in Python, but it has since died.).

What are the computer langage people waiting for?

The glorious time when people could design a new insightful computer language is gone.

Or is it? In our Data Warehousing and OLAP classes, we cover MDX and various APIs for OLAP. Arguably, MDX is de facto the standard OLAP language. But as far as languages go, it is just ugly. Microsoft chose to mimick closely SQL and yet, extend it dramatically into a multidimensional setting with a large dose of abstraction. I’ve never designed computer languages, but I’ve used them and just like a painter can recognize a bad brush even if he can’t design a brush, I just don’t like MDX.

But even if I’m wrong, you can’t hope to teach MDX to a busy decision maker even if he has sufficient programming experience:

I believe that OLAP using MDX with Mondrian requires expert language knowledge and it would be very difficult for a user, with only domain knowledge, to be able to issue correct queries. (Hazel Webb)

What is needed is a simpler, easier langage. Something someone who knows about control structures (loops and if clauses) and has a basic understanding of what a data cube is (drilling-down, rolling-up, slicing and so on), can quickly pick up and use, say within a day.

Would make a great Ph.D. thesis.

Slashdot: Why Is Data Mining Still A Frontier?

Slashdot asks “Why Is Data Mining Still A Frontier?” The article itself is not very exciting, but the comments are great. Here are some I like:

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie.

Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn’t even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

To sum it up:

  • Data Mining requires hard and fancy Mathematics.
  • Data cleaning and integration is hard.
  • There are infinitely many ways to mine data and it is not obvious a priori what is useful.

I think Data Mining is a beautiful research topic. However, as the comments indicate, it is very hard and it requires a wide ranging expertise.

Science in an exponential world

Many predict dramatic changes to the way science is done, and suspect that few traditional processes will survive in their current form by 2020. (…) The wireless sensors that were US$300 a year ago are $100 today, and will be $30 next year. A similar phenomenon occurred with DNA chips and gene sequencers. It is important to recognize this pattern; it is universal. And so although some sub-disciplines may reach a plateau in data generation, other technological innovations will take their place. Scientists in 2020 will continue to work in an exponential world.

Jim Gray, Alex Szalay, Science in an exponential world, Nature , V. 440.23, 23 March 2006. (HTML)

AI requires huge volumes of data to exist: what about learning?

This has been around for quite some time, but it keeps on popping up left and right. “Google (…) believes that strong AI requires huge data volumes to really exist.”

Since nobody knows what is required for strong AI to exist, this is a currently non-falsifiable conjecture. One thing is for sure is that it takes several years, for a human being to hope passing the famous Turing test. My 4 months old baby can’t pass the Turing test.

So, there is strong evidence that you need lots and lots of data before intelligence, as defined by the Turing test, can emerge.

Now, what does it say about “learning”? It seems to imply that to “learn”, you need to be exposed to lots and lots of data. This suggests, maybe, that the web is the real future of learning because, face it, there is only so much an instructor can convey to a group while spending hours in front of a black board. I can look up facts and theories much faster through the web, though this is recent as, until a few years ago, the black board was still a more efficient way to gather data and, in some instances, like mathematics, it still is.

One interesting conclusion though is that broadband ought to be very useful to learning. If being exposed to lots and lots of data is required, then you need broadband. What am I doing here, in my basement, with my cable modem? I need a T1 stat! Oh! Right! I’d still be limited by how fast others can deliver the information.

Theorem A large data output is necessary for having a rich learning experience.

So, if you have online content for a given course, the relative performance of the server does matter. Multimedia content does matter.

Or does it? Notice I didn’t attempt to prove my theorem. So, let’s call it the “Lemire conjecture” for now.

Giving Efficient Distance Lectures on a Budget

As part of CS6905, I lecture from my basement near Montreal, to New Brunswick in Easter Canada. Sounds crazy? Well, according to my collaborators, this works well for everyone involved. We use webhuddle to broadcast the lectures. We are short of a video stream, but it seems that it is not a critical flaw.

If, like me, you prepare your slides using LaTeX, then you have a small problem because webhuddle expects either PowerPointish files or a zip file containing JPG or GIF images. Don’t go for JPG images as they are too big when your slides are mostly text and a flat background. Here’s a script to convert a PDF file into a bunch of GIF files (it can be improved upon):

pdftoppm $1 $1ppm
for i in $1ppm*.ppm
echo "converting file " $i
convert -resize 800x600 $i $i.gif
listoffiles=$i".gif "$listoffiles
echo "you can now delete "$i
echo "now zipping "$listoffiles
zip -9 $ $listoffiles
echo "zip file is "$

Technorati allows time-based text mining

Matthew is reporting that technorati now allows you to plot word usage frequency over time in the blogosphere. Here’s the usage of the word “segmentation” over time:

I think BlogPulse has been offering this sort of things for some time. I’m confused by the relationship between these various services. However, these services could benefit from OLAPish concepts (shameless plug):

Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, APICS 2005, Wolfville, Canada, October 2005.

JOLAP versus the Oracle Java API

Some years ago (in 2000), the Java OLAP (JOLAP) spec. was proposed and it was finally ratified by all parties (including Oracle, Sun, Apple but not IBM and Microsoft). One point that has been puzzling me is why JOLAP wasn’t more widely adopted, at least partially. (Update: though the Final JOLAP Draft was approved, the spec. was never released and there is no license available right now allowing anyone to implement the Final Draft.) For our course CS6905 Advanced Technologies for E-Business, I prepared what must be too many slides on the JOLAP and Oracle Java API. I didn’t find a comparison between the Oracle OLAP API and JOLAP, so here’s my own analysis:

  • Firstly, Oracle doesn’t implement the Common Warehouse Model (CWM). I have no experience working with CWM, but it seems like CWM is quite complex. Maybe they figured it wasn’t worth the trouble?
  • Secondly, in its OLAP API, Oracle doesn’t implement the J2EE Connector model, or anything having to do with J2EE. I suspect Oracle is not eager to depend on J2EE.
  • Thirdly, the Oracle OLAP API doesn’t have Cube and Edge objects. To me, this is a shame because I really like the Edge objects JOLAP defines. Anyone knows why Oracle didn’t integrate those in its revised 10g API?

So, we are left in an OLAP world where Microsoft’s MDX is the sole cross-vendor query language. How ironic!

Grad Course by Kaser and Lemire: Data Warehousing and OLAP

For the third time, this term, we are teaching CS6905 Advanced Technologies for E-Business: Data Warehousing and OLAP. The course material is mostly original content gathered from OLAP and Data Warehousing papers. This is probably a unique course and I’m very proud of its quality though there is at least one similar course out there offered by Dimitra Vista from Drexel University. Here’s the summary:

On-line Analytical Processing (OLAP) aims to accelerate typical queries in large databases or data warehouses so that on-line performance is possible. OLAP constitutes one of the core data-mining technologies used in industry. This section of the course will target the core technology behind OLAP.

This is the second half of a two parts course, the first part is on web services (mostly SOAP-related stuff), but Yuhong has not had time yet to post here new course content but you can check out the 2004 version.

Eventually, I’d be interested in either offering the course in Montreal, making a tutorial out of it, or making an online version of it.

Java OLAP Interface (JOLAP) is dead?

It looks like JOLAP is dead. The final specification has been approved on June 15th 2004. However, to this day, except for Mondrian and Xelopes, I know of no implementation of JOLAP. According this this thread, Oracle has no intention of ever supporting JOLAP.

On the other hand, Oracle doesn’t support nor does it plan to support MDX or derived technologies such as XML for Analysis (XMLA) and more recent specifications. But, you can get MDX support in Mondrian and in SQL Server standard edition or better. I am pretty sure IBM supports MDX and maybe XMLA, but with recent changes in their OLAP product line, I must admit I’m a bit confused.

This leaves us with no cross-platform OLAP query standard. After all these failed attempts, it is very depressing.

Update: Daniel Guerrero from Ideasoft correctly pointed out to be that the current JOLAP spec. has not been published yet as a Final Release, but only as a Final Draft. The Final Draft has been approved in June 2004 (though IBM abstained), and normally, the Final Draft ought to be a Final Release by now, but this didn’t happen. The difference is significant because, right now, the JOLAP license, granted by Hyperion, is for evaluation purposes only. This means you can’t go out and implement JOLAP without risking legal troubles. We can imagine many scenarios on what is happening, but I’ll vote for an Intellectual Property issue.

IBM, Oracle and Microsoft freeing their databases

Oracle has recently made available their Oracle Database 10g Express Edition. Its limitations are that it can only run servers with one processor, with 4GB of disk space and 1GB of memory. It is not sufficient for even a small data warehousing project, but it is great for teaching a class. It is available for Linux and Windows.

Microsoft recently made available for free its SQL Server 2005 Express Edition. Obviously only available under Windows. It lacks enterprise features, it is limited to one CPU, 1GB of memory and 4GB of disk space: basically the same limitations as the Oracle Database 10g Express Edition.

IBM is thinking about doing the same with DB2. Currently, it offers the free Java-based Cloudscape database running on any standard Java Virtual Machine (JVM). They also offer a free PHP-bound version of DB2 called Zend Core available for Linux and AIX, and to be available for Windows.

However, it is not like you are limited to what IBM, Oracle and Microsoft have to offer or have to accept the limitations of their “free” products. There are many good free and open source databases such as MySQL, PostgreSQL, MaxDB, Firebird or Ingres. None of these free alternatives is as powerful as an Oracle database, but if you compare what you can buy with zero dollars, the big guys don’t necessarily come on top.

Mondrian to partner up with Pentaho for Open Source Business Intelligence

In an earlier post, I asked whether Open Source was ready for Business Intelligence. As it turns out, yesterday, the Mondrian team announced that they were partnering up with Pentaho which they claim to be the world’s leading provider of open source Business Intelligence (BI).

The average of averages is not the average

A fact that we teach in our OLAP class is that you can’t take the average of averages and hope it will match the average. This is a common enough mistake for people working with databases and doing number crunching. It is only true if all of the averages are computed over sets having the same cardinality, otherwise it is false. In fancy terms, the average is not distributive though it is algebraic. This phenomenon has a name: the fact that the average of averages is not the average is an instance of Simpson’s Paradox.

Here is an example, consider the following list of numbers:

  • 3
  • 4
  • 6
  • 5
  • 4.5

The average is 4.5. However, we can split the list in two:
The average of the first list is 3.5:

  • 3
  • 4

The average of the second list is approximately 5.2:

  • 6
  • 5
  • 4.5

However, the average of the two average is (5.2 +3.5)/2 which is less than 4.5!

This only works if the two sets have a different number of elements.

MySQL 5.0 Now Available for Production Use

MySQL 5.0 is out. It now supports:

  • Stored Procedures and SQL Functions (about time!);
  • Triggers (about time!);
  • Views (about time!);
  • Cursors (about time!);
  • Information Schema — to provide easy access to metadata (I don’t know what this is);
  • XA Distributed Transactions — supports complex transactions across multiple databases in heterogeneous environments (sounds good);
  • SQL Mode — provides server-enforced data integrity for new and existing data (about time!).

However, I wouldn’t switch over any serious Enterprise project to MySQL with Oracle buying InnoDB and all, but if you are already using MySQL, this is good news indeed.