Comparison of compression libs on HDF in pandas
I needed to compare storage options for the company I am currently working for. They have quite specific data - very sparse (density is around 10%), very wide (10k of columns) with small datatypes (int8
or float16
). Hence, I didn't believe benchmarks which can be found on the internet to decide what to use in my application.
The most important for me was processing speed (during compression/decompression), but because data will be downloaded and uploaded to s3
, size is also a concern. Since our data are still small-enough-to-fit-into-memory-but-with-caution
, I also measured memory usage.
I used this code (which you can probably use as well for the benchmark):
import os
from time import time
import pandas as pd
from memory_profiler import memory_usage
FILENAME='compressed_df'
def get_size(flnm):
return round(os.path.getsize(flnm) / (1024*1024), 2)
def store_df(original_df: pd.DataFrame, flnm: str, clib: str):
original_df.to_hdf(flnm, key='df', complib=clib, complevel=9)
def benchmark(original_df: pd.DataFrame):
res = {}
for clib in ['zlib', 'lzo', 'bzip2', 'blosc', 'blosc:blosclz', 'blosc:lz4',
'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd']:
flnm = f'{FILENAME}_{clib}.hdf'
def strdf():
return store_df(original_df, flnm, clib)
started = time()
memus = memory_usage(strdf, interval=1)
res[clib] = {'time [s]': time() - started, 'size [MB]': get_size(flnm), 'memory_usage': memus}
return res
And the results:
Algo | time [s] | size [MB] | max_memory_usage | ratio | read_time [s] |
---|---|---|---|---|---|
blosc | 61.828629 | 244.00 | 4908.906250 | 6.452162 | 3.54 |
blosc:blosclz | 59.289034 | 244.00 | 4907.546875 | 6.452162 | 4.01 |
blosc:lz4 | 59.681940 | 174.00 | 4906.246094 | 9.047859 | 3.41 |
blosc:snappy | 3.978290 | 394.06 | 1812.039062 | 3.995147 | 4.43 |
bzip2 | 115.019357 | 69.70 | 4963.253906 | 22.587196 | 24.89 |
lzo | 63.976084 | 171.00 | 4977.160156 | 9.206594 | 7.64 |
zlib | 719.349377 | 84.60 | 5798.046875 | 18.609072 | 8.77 |
no-comp | 4.300000 | 1545.00 | 1813.000000 | 1.018982 | 4.00 |
csv-no-comp | 800.000000 | 1743.00 | 1813.000000 | 0.800000 | 600.00 |
no-comp
stands for HDF without compressioncsv-no-comp
is variant with CSV with no compression (you can e.g. use gzip, but it is already slow...)blosclz
andblosc
are identical (blosclz
will be an alias forblosc
or vice versa)- numbers for CSV are rounded by me to whole numbers, it just took ages
Conclusion
So for my specific case, I am going with snappy
or lz4
. The former has brilliant speed (it's actually faster than plain HDF!?), but is relatively big compared to lz4
. Nevertheless, since we are going to download from Amazon s3 on AWS machine, the difference is not that big compared to the 15 times faster compression time. Furthermore, it doesn't need any more memory than what is the size of the dataset.
It is clear that using CSV doesn't make any sense for this kind of dataset. The file is big enough that readability - the advantage of CSV - is ruined anyway. And you are not going to open it in a text editor also. Plus, you effectively loose metadata about columns, such as datatypes. Reading time is also terrible.