With MariaDB ColumnStore 1.1 Beta now released I quickly checked Twitter today to see what the response so far has been. I noticed that someone had posted up a benchmark comparison of MariaDB ColumnStore against a couple of other databases with data that doesn’t quite add up.
You can see the blog post here. I can’t say much for the other two databases but the results and testing itself for ColumnStore did not seem right so I thought I would try them myself here.
The author was using a Ubuntu 14.04 VM with 8GB RAM. To make this test as fair as possible I decided to take my 2012 Intel i5 Lenovo X220 and run the VM I gave to participants at a recent QuruLabs workshop I gave (ignore the cancelled status, it was an issue with Meetup). The VM is CentOS 7 with 6GB RAM and only configured to use only 1 CPU core with a stock MariaDB ColumnStore 1.0.11 RPM install. This is pretty crippled and we don’t recommend running it on such a low specification.
The data set used is the MovieLens database which is about 1GB of data uncompressed (although the parts of the test used is only about 2/3rds of it). This is an incredibly small data set for MariaDB ColumnStore, it is designed for tens/hundreds of GB and TB ranges of data. But when giving laptop demos on limited amounts of RAM it is not a bad data size.
Now, the schemas used by the author are a little strange. DECIMAL types when INTs should be used for example, and one of the tables the author created has a column which isn’t in the CSV files. So for this test I used the following (the original schemas won’t make much difference, this just seems more sane):
CREATE TABLE `dim_links` ( `movieID` int(11) DEFAULT NULL, `imdbID` int(11) DEFAULT NULL, `tmdbID` int(11) DEFAULT NULL ) ENGINE=Columnstore; CREATE TABLE `dim_movies` ( `movieID` int(11) DEFAULT NULL, `title` varchar(200) DEFAULT NULL, `genres` varchar(120) DEFAULT NULL ) ENGINE=Columnstore; CREATE TABLE `dim_tags` ( `userID` int(11) DEFAULT NULL, `movieID` int(11) DEFAULT NULL, `tag` varchar(50) DEFAULT NULL, `tstamp` bigint(20) DEFAULT NULL ) ENGINE=Columnstore; CREATE TABLE `fact_ratings` ( `userID` int(11) DEFAULT NULL, `movieID` int(11) DEFAULT NULL, `rating` double DEFAULT NULL, `tstamp` bigint(20) DEFAULT NULL ) ENGINE=Columnstore;
For my testing I used cpimport but LOAD DATA would have also worked with a small performance hit. Loading the four tables took 24.83 seconds. Much lower than the author’s 4 minutes.
I then ran the four queries the author posted. I modified the first query to be the AVG on “rating” since “fact_id” isn’t even a column and I believe that is what was intended. The rest of the queries are the same:
-- Query 1 SELECT AVG(rating) FROM fact_ratings; -- Query 2 SELECT rating, count(*) from fact_ratings group by rating order by rating; -- Query 3 select dim_movies.title, avg(fact_ratings.rating) from dim_movies join fact_ratings on dim_movies.movieId = fact_ratings.movieID group by dim_movies.title DESC order by dim_movies.title limit 20; -- Query 4 select dim_movies.title, fact_ratings.rating, dim_links.imdbid from dim_movies left outer join fact_ratings on dim_movies.movieID = fact_ratings.movieID left outer join dim_links on dim_movies.movieID = dim_links.movieID where fact_ratings.rating > (select avg(fact_ratings.rating) from fact_ratings) order by fact_ratings.rating DESC limit 20;
This gives the following resulting times:
Now, these times are not great, but I’m doing this on a very limited VM inside 5 year old hardware and with no optimisation/tuning. We also ran them on a modern i7 laptop with more cores enabled and more RAM, these times were around 1/3rd of the above. But the original author was observing times of more than 5 minutes for the last two queries.
So why did the original author observe such high execution times? My best guess is they didn’t set the engine to ColumnStore on CREATE TABLE so it defaulted to using InnoDB (I haven’t tried with InnoDB to compare). Also we don’t publish an Ubuntu 14.04 build so they may have created their own debug build instead which does give a performance penalty. There are not enough details to see how they built/installed it (or even what version they are running).
The moral of this story is don’t trust other people’s benchmarks. Yes, I may be biased so have deliberately tried to cripple myself where it counts. MariaDB ColumnStore definitely should not be that slow on that setup. I highly recommend trying MariaDB ColumnStore for themselves to see how well it works with your analytical use case. MariaDB has fantastic consulting and support teams if you need help setting up and optimising your database and queries.
I recently acquired an Amiga 1200 motherboard in a spares/repairs condition for about £100 recently.…
Whilst I do like working with STM32 development boards, some basic information I need can…
When I last left this blog series, the first of Stoo Cambridge's A4000s had gone…
At the beginning of November, I started working for wolfSSL. Now that I'm a week…
In my previous post, I had the Amiga 4000 known as Jools mostly repaired, there…
In my last post, I created a list of things I needed to repair on…
View Comments