I’m giving a talk next week at the Text Analysis and Machine Learning Group (TAMALE) seminar at the University of Ottawa. I will talk on Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. It is not directly related to text and machine learning, but many of the ideas from time series data mining port over to text processing. After all, a sequence is a sequence. I see Joel Martin wil also give a talk there this Spring on “Libminer”. Here’s the abstract for my talk:

Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.

Some of you who tried to access my web site in recent days have noticed that it was getting increasingly sluggish. In an earlier post, I reported that Google accounted for 25% of my page hits, sometimes much more. As it turns out, these two issues are related. Google was eating all my bandwidth.

I investigated the matter and found out that Google was spending a lot of time spidering some posting boards I host. So, I did two things: I created a robots.txt file which tells Google to stop indexing the content of the posting boards, and I deleted all messages older than 90 days in these posting boards (which resulted in the deletion of 200,000+ messages). Both of these actions are bad for the web. I wanted people to have access to these archives. I wanted to keep them. I have gigabytes of storage, but I’m far more limited on bandwidth!

I’ll report here about how it goes, but this tells me that Google has reached the limits of freshness and exhaustivity. And no, I’m not the only one worrying about Google using up too much of my bandwidth. If we get to a point where Google accounts for 25% of all web traffic, what are we going to do collectively?

I don’t believe the solution lies in the webmasters. I don’t want to have to tell Google, in details, what to index and when to index it. However, I could imagine a standard by which Google could query a web service and determine what content, if any, has changed. Similarly, given a directory of static HTML page, there is got to be a way for Apache to tell Google what files have changed in the recent past. I’m amazed there isn’t a standard way to do this yet.

I know Robin will tell me to use Sitemaps, but from the look of it, while it looks easy to create a Google Sitemap for static content, creating a Sitemap for a complex site made of static content, wordpress pages, posting boards and so on, is far more daunting. I don’t want to spend the next week working on such a stupid project. This has to be automated.

Matthew is reporting that technorati now allows you to plot word usage frequency over time in the blogosphere. Here’s the usage of the word “segmentation” over time:

Technorati Chart

I think BlogPulse has been offering this sort of things for some time. I’m confused by the relationship between these various services. However, these services could benefit from OLAPish concepts (shameless plug):

Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, APICS 2005, Wolfville, Canada, October 2005.

Some years ago (in 2000), the Java OLAP (JOLAP) spec. was proposed and it was finally ratified by all parties (including Oracle, Sun, Apple but not IBM and Microsoft). One point that has been puzzling me is why JOLAP wasn’t more widely adopted, at least partially. (Update: though the Final JOLAP Draft was approved, the spec. was never released and there is no license available right now allowing anyone to implement the Final Draft.) For our course CS6905 Advanced Technologies for E-Business, I prepared what must be too many slides on the JOLAP and Oracle Java API. I didn’t find a comparison between the Oracle OLAP API and JOLAP, so here’s my own analysis:

  • Firstly, Oracle doesn’t implement the Common Warehouse Model (CWM). I have no experience working with CWM, but it seems like CWM is quite complex. Maybe they figured it wasn’t worth the trouble?
  • Secondly, in its OLAP API, Oracle doesn’t implement the J2EE Connector model, or anything having to do with J2EE. I suspect Oracle is not eager to depend on J2EE.
  • Thirdly, the Oracle OLAP API doesn’t have Cube and Edge objects. To me, this is a shame because I really like the Edge objects JOLAP defines. Anyone knows why Oracle didn’t integrate those in its revised 10g API?

So, we are left in an OLAP world where Microsoft’s MDX is the sole cross-vendor query language. How ironic!

I just stumbled on the Java 4K Game Programming Contest. This looks like an excellent contest for programmers out there trying to get into the video game industry or just trying to prove their hacking skills:

The Java 4K Programming Contest is the ultimate byte-squeezing Java challenge! Using only 4096 bytes, competitors use every trick up their sleeve to create an entire game. The current Java 4K is running from December 1st, 2005 – March 1st, 2006.

I really like the size limit. In the good old days, we really just had 4KB of internal memory. I programmed a few cool games back when computers still had green on black screens, including a full Othello implementation (including the AI), and it ran very well on probably less than 4K.

And I think 4K ought to be enough for a great game prototype. Myself, I would try to do it without using obfuscated code and try to get points for programming elegance.

« Previous PageNext Page »

Powered by WordPress