Saturday, September 15, 2007

Trying to resist Scala...must...not...succumb...

I've frequently been engaged in discussions regarding future Java language extensions (closures, etc) on forums like Javalobby and blogs like Neal Gafter's, and my co-workers aren't as big fans of Java as I am (being more inclined to Python and C), but moving away from Java is hard because of the incredible amount of useful libraries available, mature VM, and our favorite IDE - IntelliJ IDEA.

I would like to have a language, right now, that included a better static type system, first class functions, type inferencing, etc, but also allowed me to use all of my existing Java code, and run on top of a mature VM. Having Java-like performance would be a big plus.

And Scala seems to be it. I had looked at Scala a long time ago, but I hadn't realized how much progress they had made. Recently, I took another look at it, and boy oh boy, is it tempting. There seems to be almost no downside to writing future code in Scala, as it will run in the same VM as my other code, and seamlessly interoperate with Java classes, yet offers similar performance. And don't even get me started on Scala's DSL creation capability, which exceeds even Ruby (though still not Scheme)

However, two missing wishlist features gave me the excuse to leave it alone for now. For one, I told my coworkers that I am not switching to any new language which does not have a REPL. I had assumed that Scala was compiler-only. Secondly, no IntelliJ support is a big turnoff.

To my surprise, Scala *does have* a REPL interpreter. Whoops.

Only one excuse left. Please JetBrains, do not add first-class Scala support to IDEA. Do not tempt me to switch languages, I'm begging you! :)


Thursday, September 13, 2007

Knowledge extraction with 2 dimensional regular expressions?

Some of my best and worst ideas come when I'm sitting on the toilet. I pretty much have an informal bookshelf next to it, which I use to refresh and reinforce my memories of galois theory, probability, or in this case, automata from time to time. I'm not getting any younger (almost 36) -- it's been ages since college -- so whatever time I can use to reinforce those neurons and keep those skills from slipping away is golden.

So, there I was, automata book in hand, thinking about how I could simplify and modularize, Tardis, Timepedia's web crawler, automated knowledge discovery extractor, and temporal query engine. One part of what Tardis does is look for time related information in documents, and extract timeseries and timeline information.

How it does this exactly, I won't say at this time, but it uses a large battery of heuristics and other classical algorithms. The problem is, this logic is cemented inside of Java code, a sort of implicit 'expert system' with a huge number of cases, and complex interactions. Extending this to support newly discovered rules is hard enough with the source code in hand, but impossible for the end user.

Wouldn't a DSL be better? But if so, which one? An expert system like CLIPS/JESS? Those risk alot of complexity as well as scalability issues. If this was simple text extraction, I'd give the users the ability to author custom regexp match/rewrite rules for particular documents, sort of a mini-online-SED. As you can imagine, a lot of time related information is two dimensional in nature, stored in documents in tabular form, so ordinary regexp's don't capture the problem.

Aha! What about two dimensional regular expressions? Do they even exist? By the time I had gotten off the toilet, I had already devised replacements for regular expression concatenation operators in the form of row/column concatenation operators. I knew however that I was not the first to think of this, so a quick Google Scholar search led me to this paper: Two-Dimensional Languages (1997), Dora Giammarresi, Antonio Restivo, Handbook of Formal Languages

At first glance, this looks like a fertile field, previously applied for recognizing graphical features in pictures. Could I extend it to my needs? More on that later. First, let's look at an example.

As you may recall, a classical regular expression consists of operations like concatenation, union, and Kleene star. Concatenation is usually represented by placing two symbols next to one another, such as "ab" which means "a + b" where + is the concatenation operation. Union is typically represented via the pipe character "|", so "a|b" means "a union b". Kleene star of course is the union of all possible concatenations, e.g. "(a|b)*" = e, a, b, a+b, a+a, b+b, aa+b, ab+a, etc.

The two dimensional analogues of these would row concatenation, column concatenation, and row/column Kleene star operators. Let us denote column concatenation as "+", and row concatenation as "/", so "a + b" recognizes "ab" as long as they are in the same row, and "a/b" recognizes "ab" as long as they are in the same column. Column-wise Kleene star as "+*" and row wise Kleene star as "/*".

Now let's try a problem, recognize the set of all "chessboards", encoded with the symbols 'b' for black, and 'w' for white squares. Here's an example chessboard:

b w b w b w b w
w b w b w b w b
b w b w b w b w
w b w b w b w b

First, we match the first row with "(b + w)+*". Next, we match the second row with "(w + b)+*", and then we concatenate these two expressions row wise, to form

"(b + w)+* / (w + b)+*"

This says match "bwbwbw..." followed on the next row by "wbwbwb...". Now all we've got to do is repeat this pattern row rise.

"( (b + w)+* / (w + b)+) )/*"

This will match the above chessboard pattern. I realize it may be confusing to represent column wise concatenation via '+' instead of the empty string as is traditional, but I wanted to make the separate row and column operators explicit and equal in importance.

Now, I was delighted that something like this could be done, and furthermore that theoretical results for NFAs, DFAs, and closure properties existed for such languages, but I still faced the problem that knowledge extraction seems to require processing values semantically, not just lexically, so I tabled it as just a nifty and cool area of research, but with no application to what we were doing.

That was, until I showed it to Mat, who came up with an additional step that would enabled 2D regexps such as those described above to be used by Tardis for time related extraction. And that story, I will leave for a future article. :) The lesson I learned however, is not to give up on an idea that may seem impractical until you've got input from other people.

In the mean time, check out the paper I linked.


Wednesday, September 12, 2007

When algorithms work better than you dream...

So, Timepedia is building a time machine, right? It sounds pretentious, but for us, it's really a geeky moniker of love for our project, after all, is Google's "search engine" really an "engine"? How much horsepower does it have? :)

One part of Timepedia, readers of this blog are already familar with: Chronoscope. With Chronoscope, we are attempting to build an open platform of visualization tools for time oriented data, in much the same way that Google Maps and Google Earth deal with spatial data.

However, what good is a time machine, if you don't know where to go, or don't understand what you're looking at? Timepedia has another platform, aimed at data mining time related information, called Everett (owned and implemented by another Timepedia founder, Mat). Everett is a collection of many algorithms for both data mining, and forecasting, some of them bleeding edge academic research. When we started, we weren't sure which of them would work, or how well they would work, we only knew that they had promising features, so Everett was less of a end user product, and more of a research platform.

One of the tools of Everett is an algorithm that lets us find hidden recurring patterns in data, even in the presence of noise, or scaling. Last week, we tested the algorithm on real life data for the first time, and had one of those "holy cow!" moments, which don't occur too often for me personally, where your own code surprises you.

To give you an example, I fed Everett an 18,000 data point series of federal funds rates over the last few decades, and it identified a pattern that occured 3 times in history. Visualizing this in another tool we call Timelord (A Chronoscope married to Everett and other server-side services), I was puzzled as to the significance of these three sequences. My co-founder Shawn spent about 1 hour Googling, until he found the correlation: These sequences corresponded to international financial/currency crises (such as the Mexican currency crises), in which the Fed was forced to take action. The leadup to the crises appeared identical each time. A fluke? It sure the hell was very interesting.

I was worried it was a fluke, so I tried something more mundane. A time series of unemployment benefit expenditures in Indiana, and once again, Everett identified a series of puzzling repetitive sequences. What were they? The dates looked very familar, 1980-81, 1990-91, 2000-1...were they recessions? To check, I used Timelord to overlay a National Bureau of Economic Research official measure of economic expansions and contractions, and sure enough, these patterns intersected with NBER recessions. One other interesting property stood out, the patterns returned prefixed the recessions, that is, Everett was showing us a pattern that leads to a recession.

How cool is that? Ambition got the best of me, I went for broke: I tried a historical time series of average hurricane strength (saffir-simpson scale), as well as a yearly count. There appears to be good evidence that a 40-60 cyclical hurricane season exists, and I was hoping that Everett could find these patterns, but alas, it did not.

Still, the initial results are promising, and we hope that Everett will give average users an ability to query time in ways that have not been previously available.

So, if you're wondering why I haven't released Chronoscope yet, it's because I've been working on integrating Timelord with Everett. :)

p.s. Timelord is another GWT application, making it our 4th major GWT application. Everett is C++ coupled via JNI a Java/GWT RPC interface, since performance is absolutely critical in Everett.

Converting to Guice, easier than I ever imagined.

I've been working the past month on revising the data layer of Timepedia, which has, shall we say, somewhat interesting storage requirements. However, the RDBMS related portions had become an enormous eyesore, with tons of handcoded hibernate DAO methods. I have been admiring Guice from afar for awhile, but delayed the pain of refactoring the RDBMS code until I had no choice.

Surprisingly, there was no pain at all!

I took a chance, and used Wideplay's warp-persist framework for Guice, deleted all of my DAO implementation classes in favor of Dynamic Finders, and wrote one Guice module (5 lines of Hibernate config code mostly), compiled, deployed, and prayed. Amazingly, it worked the first time. I was done, total conversion time: 30 minutes.

Now, it helps that I wasn't using Spring nor J2EE (I dislike bulky over designed frameworks with hideous XML configurations), but I think this is a good result for Guice. It reduced the number of lines of code (especially the Dynamic Finders) dramatically as well as the amount of configuration.