Coding

Building BLOBs in MariaDB ColumnStore

My team and I are working on finalizing the feature set for MariaDB ColumnStore 1.1 right now and I wanted to take a bit of time to talk about one of the features I created for ColumnStore 1.1: BLOB/TEXT support.

For those who don’t know, MariaDB ColumnStore is a fork of InfiniDB which has been brought up to date by making it work with MariaDB 10.1 instead of MySQL 5.1 and has many new feature and bug fixes.

ColumnStore’s storage works by having columns of a fixed size of 1, 2, 4 or 8 bytes. These are then stored in 8KB blocks (everything in ColumnStore is accessed using logical block IDs) inside extents of ~8M rows. This is fine until you want to store some data that is longer than 8 bytes such as CHAR/VARCHAR.

To solve this for columns greater than VARCHAR(7) and CHAR(8) we have the concept of a dictionary column. That is a column that has 8KB blocks containing CHAR/VARCHAR data and has an additional 8 byte wide column that stores the pointer to the values called a “token”.

The token has the following format (link to code):

    struct Token {
        uint64_t op       :  10;   // ordinal position within a block
        uint64_t fbo      :  36;   // file block number
        uint64_t spare    :  18;   // spare

So, we have a 10 bit “op” which contains a pointer to the offset inside a block, the 36 bit “fbo” block number (a logical block ID) which points to which block the data is stored in and the spare 18 bits are used as a bitmask when the block is read an processed.

This is great, but it limits us to 8KB of data (minus some block header information) per block. ColumnStore is designed to read and write at the block level so trying to read something larger than 1 block would be a disaster. This is why ColumnStore limits CHAR/VARCHAR to 8000 bytes.

So, for 1.1 one of the first things I did was to change the token structure to this:

    struct Token {
        uint64_t op       :  10;   // ordinal position within a block
        uint64_t fbo      :  36;   // file block number
        uint64_t bc       :  18;   // block count

You can see here that the spare bits have been changed to a block count, this is the number of blocks an entry is consuming. This means that an entry can now use roughly 2^18 * 8 KB = 2 GB. The actual figure is a little less than that because the blocks contain header information such as length of the data in the block. Although this doesn’t meet the LONGBLOB/LONGTEXT specification of 4 GB it is actually a lot more than the 1 GB the MySQL/MariaDB protocol maximum for transferring a row.

With a bit if modification I was able to make sure this didn’t collide with the bitmasking when the blocks were read and wrote code to read more blocks and stitch them together when a block count was encountered.

A lot of the parts of ColumnStore needed modification so that the new data types were understood and passed correctly. For example when a storage engine returns BLOB/TEXT data to the MariaDB server it actually returns a pointer in memory to where the data is stored and the length of the data.

In addition when serializing a row to send it to the various parts of ColumnStore our string data storage could only cope with a maximum of 64 KB per entry. There has been two passes at improving this code. The first worked well but had a performance penalty for small strings. The second attempt is much better and does not have the performance penalty.

It is hard to estimate how much code had to be modified to add support for this feature because it was split over several pull requests but the initial pull request was a modification of 45 source code files.

So far the feature appears to work very well and has had a lot of testing internally. We have a remaining issue/feature I’m working on for BLOB/TEXT support so that ORDER BY works correctly for longer entries. This will hopefully make the first beta release.

If you want to try the BLOB/TEXT feature you can check out develop branch of the ColumnStore Engine on GitHub. Please note that the develop branch should be currently considered Alpha quality.

With 1.1 we really will be able to store Big Data!

Image Credit: Eindhoven: Building of the “Blob” by harry_nl, used under a Creative Commons license

LinuxJedi