Implementing A Fast Queryable Storage with Apache Avro and Azure Block Blobs

  • We need to allow querying on a given sensor for a given time interval
    Each sensor on a wind turbine has a unique sensor id. When you need to query this storage, you always query for a sensor id in a specific time interval. This is the only query type we need to support.
  • We need to allow incremental updates of the data
    Sometimes sensors go offline or turbines are a little slow to catch up. This means that for a given interval, we’re not sure when we’ll have all the given sensor data. Our solution needs to be able to handle updating the data efficiently.
  • Our solution must be cost-efficient
    We also need to keep an eye towards cost. While we could fulfill all the other criteria by dumping all of our data inside a managed time-series database, it’s going to end up being very expensive.

Avro Row

After the header the avro file consists of multiple rows. Each row contains encoded data as specified by the header schema. At the end of each block is the sync marker - so we can tell where one row ends and another begins. These are officially called data blocks, but as we have another type of block later, we'll stick to calling them rows.

The structure of an avro file

The structure of an Azure Block Blob

The next component we’re using is the Azure Blob Storage, which is a blob-based storage service. However it has a few more tricks up its sleeve than just being able to upload and download arbitrary blobs. It has three different types of blobs, each with their own characteristics. The ones we’ll be using are called block blobs.

The structure of an Azure Block Blob

Writing files

Writing a file via the block blob interface consists of two parts.

Stage your blocks

The first step is staging the new blocks you want to be part of your blob. You upload some bytes, tag them with a block id, and you're golden. You can upload any number of blocks, but they're considered uncommitted, which means they don't really do anything.

Commit the blocks

The next step is committing the blocks which creates a fully formed file. You commit the blocks by sending a list of block ids to be committed. The file will then consist of these blocks, in the order that the block ids are provided. The block ids can either be uncommitted blocks we’ve staged via the previous steps or blocks that already exist in the file.

  • Stage the block with a unique id
  • Ask the blob for all the current block ids
  • Put our new block id at the appropriate place in the list of current block ids
  • Commit the list of block ids with our new block id in it

Reading from block blobs

We can query an Azure Block blob for the list of blocks and how many bytes are in each.

  • One block for the avro header. The block id is something unique such as $$$_HEADER_$$$
  • One block for each row in the avro file. The block id here is the sensor id.

Querying For Specific Sensor Data

We allow querying for a time interval on a specific sensor id. First of, we have to calculate which files are relevant, when given an interval to search for. Depending on the interval you search for this could be multiple .avro files.

  • Use the block list to figure out which block contains the avro row
  • Fetch the header block and the block containing the sensor data
  • Stitch the two blocks together to form a complete avro file.
  • Parse the avro file and return the result

Writing and updating

The first time we’re writing data to an interval is pretty straightforward. We generate a header and stage that in the first block. Afterwards we stage each row of our avro files in a separate block. We then commit all the blocks we just staged.

  • We fetch the header block from the existing file, and extract the schema and sync marker
  • We stage each new row with the sensor id as the block id, using the schema and sync marker retrieved from above.
  • We commit the old block ids along with the newly staged ones. This allows us to update the old file with the new data, without ever touching the old data.

In Summary

Our old system was based on the same principles of saving avro files to a blob storage, but didn’t use any of the fancy block functionality. This meant that when we had to query for a specific tag, we had to fetch the entire file, and then parse it on the client.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gustav Wengel

Gustav Wengel

Software Developer at SCADA Minds