Thinking
Big Data: How Should I Store it?
When faced with the prospect of collecting large sets of data, you have to make a decision about how you’re going to store it. There are three things to consider before getting to the data analysis.
When faced with the prospect of collecting large sets of data, you have to make a decision about how you’re going to store it. There are three things to consider before getting to the data analysis.
Consideration #1: What type of data needs to be stored?
Your first step is to determine which pieces of data you want and what format you want them in. While taking a “kitchen sink” approach, storing every measurement or piece of metadata you get is a safe option for future-proofing; the trade-off is that you’ll be storing much more data. If storage space is a concern, think critically about what you think you will actually need and then filter the data accordingly.
You can also consider a change of data type: would a boolean or integer work better than a string (e.g., “True” or “False”)? If a piece of data will only take on a relatively small subset of values, mapping it first to an enumerated data type may be worthwhile. Look at the likely ranges numeric data will occupy: could you scale them to make use of more compact representations, such as single-precision instead of double-precision or 8-bit integers instead of 16?
Consideration #2: Is storage space prohibitive?
When working in limited space environments, or with a huge amount of data, no amount of filtering or data type conversions may help. Human-readable text-based formats, like CSV or JSON, will generally require more space than binary formats, like Parquet. Some file formats, like HDF5, Avro, and Parquet, allow for compression of the data, although the degree of compression they provide will often depend on the type and nature of the data being stored. Some benchmarking with representative sample data may be necessary to determine which format provides the best compression ratio. Additionally, some file formats can be split (like Avro), meaning that they can easily be broken into smaller files even when compression is applied.
Consideration 3: Is write speed or is read (query) speed the most critical?
File formats differ in how quickly data can be written to and read from. Compressed formats may slow down both write and read operations so that you may face the common “speed versus space” tradeoff. The storage format within the file can affect this as well. For example, Avro files are stored in a row-based format (all data fields for a record are stored contiguously in memory). In contrast, Parquet files are stored in a flat columnar format (the data fields of a given column across all records are stored contiguously in memory). This will have implications for the type of data that will be inserted into or queried from the file. If you frequently collect all values from a given column across all records, a Parquet file would be better suited than an Avro file to provide quick data access.
Summary
You should address these considerations before you start analyzing your data. Once you’ve addressed them, you’ll be well on your way to selecting the best file format for your big data processing needs.
Hellebore combines machine learning and big data to build world-class analysis tools. Let us know how we can help guide or implement a big data solution for you.

Hellebore
2900 Presidential Drive, Suite 155
Beavercreek, OH 45324
(833) 694 8496
info@hellebore.com