您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

如何将大于RAM限制的gzip文件导入Pandas DataFrame?“杀9”使用HDF5?

如何将大于RAM限制的gzip文件导入Pandas DataFrame?“杀9”使用HDF5?

我会这样:

filename = 'filename.gzip'      # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6               # you may want to adjust it ...

store = pd.hdfstore(hdf_fn)

for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
    # don't index data columns in each iteration - we'll do it later
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in hdfstore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()
其他 2022/1/1 18:48:22 有484人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶