I needed to compare storage options for the company I am currently working for. They have quite specific data - very sparse (density is around 10%), very wide (10k of columns) with small datatypes (int8 or float16). Hence, I didn't believe benchmarks which can be found on the internet to decide what to use in my application.

The most important for me was processing speed (during compression/decompression), but because data will be downloaded and uploaded to s3, size is also a concern. Since our data are still small-enough-to-fit-into-memory-but-with-caution, I also measured memory usage.

I used this code (which you can probably use as well for the benchmark):

import os
from time import time
import pandas as pd
from memory_profiler import memory_usage

FILENAME='compressed_df'

def get_size(flnm):
    return round(os.path.getsize(flnm) / (1024*1024), 2)

def store_df(original_df: pd.DataFrame, flnm: str, clib: str):
    original_df.to_hdf(flnm, key='df', complib=clib, complevel=9)

def benchmark(original_df: pd.DataFrame):
    res = {}
    for clib in ['zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4', 
                 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd']:
        flnm = f'{FILENAME}_{clib}.hdf'
        def strdf():
            return store_df(original_df, flnm, clib)
        started = time()
        memus = memory_usage(strdf, interval=1)
        res[clib] = {'time [s]': time() - started, 'size [MB]': get_size(flnm), 'memory_usage': memus}
    return res

And the results:

Algo	time [s]	size [MB]	max_memory_usage	ratio	read_time [s]
blosc	61.828629	244.00	4908.906250	6.452162	3.54
blosc:blosclz	59.289034	244.00	4907.546875	6.452162	4.01
blosc:lz4	59.681940	174.00	4906.246094	9.047859	3.41
blosc:snappy	3.978290	394.06	1812.039062	3.995147	4.43
bzip2	115.019357	69.70	4963.253906	22.587196	24.89
lzo	63.976084	171.00	4977.160156	9.206594	7.64
zlib	719.349377	84.60	5798.046875	18.609072	8.77
no-comp	4.300000	1545.00	1813.000000	1.018982	4.00
csv-no-comp	800.000000	1743.00	1813.000000	0.800000	600.00

no-comp stands for HDF without compression
csv-no-comp is variant with CSV with no compression (you can e.g. use gzip, but it is already slow...)
blosclz and blosc are identical (blosclz will be an alias for blosc or vice versa)
numbers for CSV are rounded by me to whole numbers, it just took ages

Conclusion

So for my specific case, I am going with snappy or lz4. The former has brilliant speed (it's actually faster than plain HDF!?), but is relatively big compared to lz4. Nevertheless, since we are going to download from Amazon s3 on AWS machine, the difference is not that big compared to the 15 times faster compression time. Furthermore, it doesn't need any more memory than what is the size of the dataset.

It is clear that using CSV doesn't make any sense for this kind of dataset. The file is big enough that readability - the advantage of CSV - is ruined anyway. And you are not going to open it in a text editor also. Plus, you effectively loose metadata about columns, such as datatypes. Reading time is also terrible.