I’m giving a talk next week at the Text Analysis and Machine Learning Group (TAMALE) seminar at the University of Ottawa. I will talk on Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation. It is not directly related to text and machine learning, but many of the ideas from time series data mining port over to text processing. After all, a sequence is a sequence. I see Joel Martin wil also give a talk there this Spring on “Libminer”. Here’s the abstract for my talk:
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.
I don’t usually post abstracts of papers other than my own here, but this one is particularly significant to me though I don’t know the author.
This paper gives optimal algorithms for determining real-valued univariate unimodal regressions, that is, for determining the optimal regression which is increasing and then decreasing. Such regressions arise in a wide variety of applications. They are a form of shape-constrained nonparametric regression, closely related to isotonic regression. For the L2 metric our algorithm requires only O(n) time for regression on n points, while for the L1 metric it requires O(n logn) time. Previous algorithms only considered the L2 metric and required (n^2) time. All previous algorithms used multiple calls to isotonic regression, and our major contribution is to organize these into a prefix isotonic regression, whereby one computes the regression on all initial segments. The prefix approach utilizes the solution for one initial segment to aid in the solution of the next, which considerably reduces the total time required. Our prefix isotonic regression algorithm for
the L1 metric also supplies the first O(n log n) algorithm for L1 isotonic regression.
Steven will be presenting our paper Analyzing Large Collections of Electronic Text Using OLAP at APICS 2005. This work is based on an idea by Owen Kaser: what happens if we apply multidimensional databases (OLAP) to literary research?
Data Mining and Information Retrieval techniques are used routinely for literary research or processing text in general, but decision support techniques commonly used in the business world (sometimes called “Business Intelligence”) have not seen much use yet in text processing. The main difference between decision support systems and data mining is the fact that in decision support, the user remains in control, thus simple yet extremely efficient algorithms are favoured over sophisticated, but possibly expensive algorithms. Ideally, all decision support algorithms should be O(1) after accounting for precomputations. With infinite storage almost available now, decision support research is due for a technological and scientific boom.
Computer-assisted reading and analysis of text has various applications in the humanities and social sciences. The increasing size of many electronic text archives has the advantage of a more complete analysis but the disadvantage of taking longer to obtain results. On-Line Analytical Processing is a method used to store and quickly analyze multidimensional data. By storing text analysis information in an OLAP system, a user can obtain solutions to inquiries in a matter of seconds as opposed to minutes, hours, or even days. This analysis is user-driven allowing various users the freedom to pursue their own direction of research.