Implementing A Fast Queryable Storage with Apache Avro and Azure Block Blobs

  • We need to allow querying on a given sensor for a given time interval
    Each sensor on a wind turbine has a unique sensor id. When you need to query this storage, you always query for a sensor id in a specific time interval. This is the only query type we need to support.
  • We need to allow incremental updates of the data
    Sometimes sensors go offline or turbines are a little slow to catch up. This means that for a given interval, we’re not sure when we’ll have all the given sensor data. Our solution needs to be able to handle updating the data efficiently.
  • Our solution must be cost-efficient
    We also need to keep an eye towards cost. While we could fulfill all the other criteria by dumping all of our data inside a managed time-series database, it’s going to end up being very expensive.

Avro Row

The structure of an avro file

The structure of an Azure Block Blob

The structure of an Azure Block Blob

Writing files

Stage your blocks

Commit the blocks

  • Stage the block with a unique id
  • Ask the blob for all the current block ids
  • Put our new block id at the appropriate place in the list of current block ids
  • Commit the list of block ids with our new block id in it

Reading from block blobs

  • One block for the avro header. The block id is something unique such as $$$_HEADER_$$$
  • One block for each row in the avro file. The block id here is the sensor id.

Querying For Specific Sensor Data

  • Use the block list to figure out which block contains the avro row
  • Fetch the header block and the block containing the sensor data
  • Stitch the two blocks together to form a complete avro file.
  • Parse the avro file and return the result

Writing and updating

  • We fetch the header block from the existing file, and extract the schema and sync marker
  • We stage each new row with the sensor id as the block id, using the schema and sync marker retrieved from above.
  • We commit the old block ids along with the newly staged ones. This allows us to update the old file with the new data, without ever touching the old data.

In Summary




Software Developer at SCADA Minds

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

🚣‍♀️ MongoDB — SQL ROWNUM pseudo column

Development will become simpler

Easy Guide to Understand Order of Logical SQL Query vs Physical SQL Query With Examples

Leetcode 937. Reorder Data in Log Files

Detail Guide About Ionic Platform with its Pros and Cons

Why Choose MobiFi?

2020 Cloud Trends: Ideas That Could Influence Your Critical Data Management Decisions

Environment Variables in Cypress

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gustav Wengel

Gustav Wengel

Software Developer at SCADA Minds

More from Medium

Peer Less, Sync More

Spinbox — Spinnaker in a box

Hotfix Pipelines

Your Future in Azure Cloud Technology.