Like what you hear?
Join forces with Hellebore today!Contact Us
When faced with the prospect of collecting large sets of data, you have to make a decision about how you’re going to store it. There are three things to consider before getting to the data analysis.
Your first step is to determine which pieces of data you want, and what format you want them in. While taking a "kitchen sink" approach, storing every measurement or piece of metadata you get is a safe option for future-proofing, the trade-off is that you’ll be store much more data. If storage space is a concern, think critically about what you think you will actually need and then filter the data accordingly.
You can also consider a change of data type: would a boolean or integer work better than a string (e.g. "True" or "False")? If a piece of data will only take on a relatively small subset of values, mapping it first to an enumerated data type may be worthwhile. Look at the likely ranges numeric data will occupy: could you scale them to make use of more compact representations, such a single-precision instead of double-precision, or 8-bit integers instead of 16?
When working in limited space environments, or with a huge amount of data, no amount of filtering or data type conversions may help. Human-readable text-based formats, like CSV or JSON, will generally require more space than binary formats, like Parquet. Some file formats, like HDF5, Avro, and Parquet, allow for compression of the data, although the degree of compression they provide will often depend on the type and nature of the data being stored. Some benchmarking with representative sample data may be necessary to determine which format provides the best compression ratio. Additionally, some file formats can be split (like Avro), meaning that they can easily be broken into smaller files even when compression is applied.
File formats differ in how quickly data can be written to and read from. Compressed formats may slow down both write and read operations, so you may face the common "speed versus space" tradeoff. The storage format within the file can affect this as well. For example, Avro files are stored in a row-based format (all data fields for a record are stored contiguously in memory) while Parquet files are stored in a flat columnar format (the data fields of a given column across all records are stored contiguously in memory). This will have implications for the type of data that will be inserted to or queried from the file. If you were frequently collecting all values from a given column across all records, a Parquet file will be better suited than an Avro file to provide quick data access.
You should address these considerations before you start analyzing your data. Once you’ve addressed them, you’ll be well on your way to selecting the best file format for your big data processing needs.
Hellebore combines machine learning and big data to build world-class analysis tools. Let us know how we can help guide or implement a big data solution for you.