in python data-science contribution ~ read.

Reducing dtype size of a Numpy/Pandas array

I run into memory problem when processing very large dataframe.

The problem is that Pandas use float64 and int64 numpy dtypes by default even in cases when it is totally unnecessary (you have e.g. only binary values). Furthermore, it is not even possible to change this default behaviour.

Hence, I wrote a function which finds the smallest possible dtype for a specific array.

import numpy as np  
import pandas as pd  
def safely_reduce_dtype(ser):  # pandas.Series or numpy.array  
    orig_dtype = "".join([x for x in if x.isalpha()]) # float/int
    mx = 1
    for val in ser.values:
        new_itemsize = np.min_scalar_type(val).itemsize
        if mx < new_itemsize:
            mx = new_itemsize
    new_dtype = orig_dtype + str(mx * 8)
    return new_dtype # or converts the pandas.Series by ser.astype(new_dtype)

So, e.g.:

>>> import pandas
>>> serie = pd.Series([1,0,1,0], dtype='int32')
>>> safely_reduce_dtype(serie)

>>> float_serie = pd.Series([1,0,1,0])
>>> safely_reduce_dtype(float_serie)
dtype('float8')  # from float64  

Using this you can reduce the size of your dataframe significantly up to factor 4.


There is pd.to_numeric(series, downcast='float') in Pandas 0.19. The above was written before it was out and can be used in old versions.