Day 2 of my tour of column based storage brings me on to Infobright Community Edition (ICE). The first impressive point was that based on my blog post of yesterday then I already had an email from Mark in Community Relations at Infobright offering help and advice - despite me calling him the wrong name (I was having a bad day!) then he was immediately helpful and also offered to get some of his team to look into my queries.
As an aside, John from Calpont was also kind enough to drop by to respond to some of my points - to me this gives me a warm fuzzy feeling that both Infobright and Calpont are taking the community seriously - I guess for these products to gain traction they need to make sure people can get motoring with them to improve adoption.
Infobright is “designed to deliver a scalable data warehouse optimized for analytic queries” - which is exactly what we’re looking for. The main shortcoming at first glance for us is the lack of DML in the Community Edition (it is available in the Enterprise edition, a free trial is available). Similar to InfiniDB then there is a slightly more restricted range of data types compared to MySQL - but currently more flexible than InfiniDB. For example - Infobright can support longer varchars - a feature that John from Calpont says is on it’s way in InfiniDB.
To get started we downloaded the 64bit RPM direct from the Infobright website - at the time of testing this was ICE v3.2.2. Install was straightforward with the RPM installing first go. Infobright installed onto a different port to MySQL by default, so there was no conflict with the existing MySQL installation. I guess it’s also worth pointing out that we’re just running with the default configurations for these systems to begin with.
On my Centos 5 system I had to manually do
chkconfig --add mysql-ib and was then able to use
service mysql-ib start to crank up Infobright. After tweaking our table definitions to take account of the supported data types and values then we were able to create our tables. We were also able to dump the data from our MySQL tables and import them using the Infobright loader with relative ease.
With our tables populated we were ready to get querying. It’s only fair to point out that this is based on taking our MySQL tables straight into Infobright with no tweakage - as such it probably won’t take full advantage of the features on offer. Our cold run figures are based on an average of 5 hopefully cold runs - i.e. we restart MySQL/Infobright in the hope of clearing the caches.
SELECT soh.co, h.ca, o.i, o.n, so.r, so.a, so.w, soh.lhi, soh.lca, soh.lsfhi, soh.lsfhca, h.c, h.r, h.c, h.sti, h.st FROM soh INNER JOIN so ON (so.si = soh.si AND so.oi = soh.oi) INNER JOIN o ON o.i = soh.oi INNER JOIN h ON soh.lsfhi = h.i WHERE soh.si = 125 AND soh.co BETWEEN "2009-01-01" AND "2009-10-10" ORDER BY soh.co DESC, soh.lsfhca DESC LIMIT 0,30;
I’ve removed the column names to protect the innocent, on a cold run our InnoDB (XtraDB) table responded in 1.84 secs vs 8.94 secs for Infobright. A second (warm) run gave 1.51 secs vs 4.40 secs.
SELECT SQL_CALC_FOUND_ROWS soh.co, soh.lca, soh.oi, soh.lhi, soh.lsfhi, soh.lsfhca FROM soh WHERE soh.si = 125 AND soh.co BETWEEN "2009-01-01" AND "2009-10-10" ORDER BY soh.co DESC, soh.lsfhca DESC LIMIT 0,30;
A cold run clocked MySQL at 1.62 secs vs 0.08 secs for Infobright. It’s also worth pointing out that
SQL_CALC_FOUND_ROWS is here to make a note - i.e. Infobright doesn’t support it although it doesn’t choke on it if included. Removing it made no significant difference to the times in either database.
SELECT soh.co, soh.lca, soh.oi, soh.lhi, soh.lca, soh.lsfhi, soh.lsfhca FROM soh WHERE soh.si = 125 AND soh.co BETWEEN "2009-01-01" AND "2009-10-10" AND soh.oi NOT IN (68633,4) ORDER BY soh.co DESC, soh.lsfhca DESC LIMIT 0,30;
Cold runs of MySQL gave 1.93 secs vs 0.53 for Infobright.
SELECT t1.oi, t1.co, t1.pv, t1.rv, t1.tpv, t1.tnv, t1.trv FROM soh AS t1 INNER JOIN ( SELECT oi, Max(co) AS mco FROM soh WHERE si = 125 AND co <= '2009-10-24 08:12:23' GROUP BY oi ) AS t2 ON t1.oi = t2.oi AND t1.co = t2.mco WHERE t1.oi IN ( 1420,13501,85778,49753, 288095,21883,21623,1198, 123438,54699,145,50293, 38747,63918,1997,35299, 97864,321139,138088,38578, 105140,5440,73439,132415, 452,180846,4384,29049, 206296,113606 ) AND t1.si = 125;
MySQL finished in 1.39 secs vs 0.57 secs for Infobright
It was a real bonus to be able to get up and running with minimal effort - queries were pretty much able to run unmodified for testing, although both the schema and queries may benefit from optimisation to get the best out of Infobright. As a drop in replacement for a struggling MySQL DB it also holds promise - most of the queries we ran were at least twice as quick as MySQL - the exception being the first one which was slower, so we’ll need to work out why.
These current tests are in no way exhaustive - as a first pass I’m just getting to the point of getting things installed and in the position to play with so I can see what they’re like to work with, and also to work out which ones are worth investing more time in. For a more thorough review on a more easily explainable dataset then checkout Baron’s post on the MySQL Performance Blog.
The only downside for us right now is the lack of insert/update/delete in the community edition - it means we’d have to move toward a batch data load and re-architecting to support this would of course involve effort - which then brings questions around using Hadoop back into the thought processes.
I’m also considering taking a look at Greenplum as they now offer the single node edition for free. Otherwise, it’s off to MonetDB-land today and my RPMs have just finished compiling for Centos 5 so I’m ready to install!