in python data-science contribution ~ read.

Comparison of compression libs on HDF in pandas

I needed to compare storage options for the company I am currently working for. They have quite specific data - very sparse (density is around 10%), very wide (10k of columns) with small datatypes (int8 or float16). Hence, I didn't believe benchmarks which can be found on the internet to decide what to use in my application.

The most important for me was processing speed (during compression/decompression), but because data will be downloaded and uploaded to s3, size is also a concern. Since our data are still small-enough-to-fit-into-memory-but-with-caution, I also measured memory usage.

I used this code (which you can probably use as well for the benchmark):

import os  
from time import time  
import pandas as pd  
from memory_profiler import memory_usage

FILENAME='compressed_df'

def get_size(flnm):  
    return round(os.path.getsize(flnm) / (1024*1024), 2)

def store_df(original_df: pd.DataFrame, flnm: str, clib: str):  
    original_df.to_hdf(flnm, key='df', complib=clib, complevel=9)

def benchmark(original_df: pd.DataFrame):  
    res = {}
    for clib in ['zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4', 
                 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd']:
        flnm = f'{FILENAME}_{clib}.hdf'
        def strdf():
            return store_df(original_df, flnm, clib)
        started = time()
        memus = memory_usage(strdf, interval=1)
        res[clib] = {'time [s]': time() - started, 'size [MB]': get_size(flnm), 'memory_usage': memus}
    return res

And the results:

Algo time [s] size [MB] max_memory_usage ratio read_time [s]
blosc 61.828629 244.00 4908.906250 6.452162 3.54
blosc:blosclz 59.289034 244.00 4907.546875 6.452162 4.01
blosc:lz4 59.681940 174.00 4906.246094 9.047859 3.41
blosc:snappy 3.978290 394.06 1812.039062 3.995147 4.43
bzip2 115.019357 69.70 4963.253906 22.587196 24.89
lzo 63.976084 171.00 4977.160156 9.206594 7.64
zlib 719.349377 84.60 5798.046875 18.609072 8.77
no-comp 4.300000 1545.00 1813.000000 1.018982 4.00
csv-no-comp 800.000000 1743.00 1813.000000 0.800000 600.00
  • no-comp stands for HDF without compression
  • csv-no-comp is variant with CSV with no compression (you can e.g. use gzip, but it is already slow...)
  • blosclz and blosc are identical (blosclz will be an alias for blosc or vice versa)
  • numbers for CSV are rounded by me to whole numbers, it just took ages

Conclusion

So for my specific case, I am going with snappy or lz4. The former has brilliant speed (it's actually faster than plain HDF!?), but is relatively big compared to lz4. Nevertheless, since we are going to download from Amazon s3 on AWS machine, the difference is not that big compared to the 15 times faster compression time. Furthermore, it doesn't need any more memory than what is the size of the dataset.

It is clear that using CSV doesn't make any sense for this kind of dataset. The file is big enough that readability - the advantage of CSV - is ruined anyway. And you are not going to open it in a text editor also. Plus, you effectively loose metadata about columns, such as datatypes. Reading time is also terrible.