I went from CSV to HDF storage, because I was able to get much faster read/write times and hence much better feedback loop when testing my analysis.
Here is a very short list of current limitations of both
table formats I found very frustrating.
- Can't store category format
- Can't use usecols or specific ranges of query (like from row 100th to 200th)
- Can store category format, have usecols and query of specific ranges
- Supports only non-wide dataframes. Snippet below will end with very uninformative error:
import pandas as pd df = pd.DataFrame(columns=['Some Teribly Long String With Many Characters' + str(i) for i in range(10000)]) df.loc = 3 df.to_hdf('test.hdf', 'main', format='table')