熊猫对HDFStore中的大数据进行“分组依据”查询？

这是一个完整的例子。

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# Could aggregate
with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

    # get the groups
    groups = store.select_column('df','A').unique()
    print "\ngroups:%s" % groups

    # iterate over the groups and apply my operations
    l = []
    for g in groups:

        grp = store.select('df',where = [ 'A=%s' % g ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())


    print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

输出量

store:
<class 'pandas.io.pytables.hdfstore'>
File path: groupby.h5
/df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])

df:
      A    B      C         D         E         F
0   foo  one   dull -0.815212 -1.195488 -1.346980
1   foo  one   dull -1.111686 -1.814385 -0.974327
2   foo  one  shiny -1.069152 -1.926265  0.360318
3   foo  two   dull -0.472180  0.698369 -1.007010
4   bar  one   dull  1.329867  0.709621  1.877898
5   bar  one  shiny -0.962906  0.489594 -0.663068
6   bar  one  shiny -0.657922 -0.377705  0.065790
7   bar  two   dull -0.172245  1.694245  1.374189
8   foo  two  shiny -0.780877 -2.334895 -2.747404
9   foo  two  shiny -0.257413  0.577804 -0.159316
10  foo  one  shiny  0.737597  1.979373 -0.236070

groups:Index([bar, foo], dtype=object)

result:
bar  D   -0.463206
     E    2.515754
     F    2.654810
foo  D   -3.768923
     E   -4.015488
     F   -6.110789
dtype: float64

一些警告：

1）如果您的组密度相对较低，则此方法很有意义。大约数百或数千个组。如果获得的收益更多，则效率更高（但方法更复杂），并且您正在应用的函数（在这种情况下sum）将变得更加严格。

本质上，您将按块对整个商店进行迭代，然后按组进行分组，但将组仅进行半折叠（想象一下是做一个均值，因此您需要保持运行总数和运行计数，然后在最后进行除法）。因此，某些操作会有些棘手，但可能会处理许多组（而且速度非常快）。

2）可以通过保存坐标（例如组位置，但是稍微复杂一点）来提高效率。

3）上面的方案无法进行多重分组（有可能，但是需要一种类似于2的方法）

4）您要分组的列，必须是data_column！

5）您可以在选择的btw中组合您想要的任何其他过滤器（这是进行多组btw的时髦方法，您仅在它们的乘积上形成2个唯一的group和iterator列表，如果有很多，效率就不是很高组，但可以工作）

高温超导

让我知道这是否适合您

其他 2022/1/1 18:43:29 有490人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

熊猫对HDFStore中的大数据进行“分组依据”查询？

撰写回答

推荐问题

将JSON文件读取为熊猫数据框？

将缺失的日期添加到熊猫数据框

Python-通过熊猫DataFrame分组并选择最常用的值

使用SQLAlchemy，to_sql用熊猫写入MySQL数据库

Python-检测并排除熊猫数据框中的异常值

Python-错误：熊猫哈希表密钥错误

Python-使用pd.eval（）在熊猫中进行动态表达评估

在Docker Alpine中安装熊猫

Python-漂亮地打印整个熊猫系列/ DataFrame

Python-使用熊猫对同一工作簿的多个工作表进行pd.read_excel（）

apply（）和aggregate（）函数之间的熊猫区别

熊猫：在DataFrame中创建汇总列

创建带有float类型NaN的空熊猫DataFrame的绝佳方法

熊猫：如何使用df.to_dict（）轻松共享示例数据帧？

熊猫分层抽样

根据另一个列熊猫数据框提取列值

日期时间索引和时间戳之间的熊猫工作日数

熊猫：多层列名称

R函数（例如str（），summary（）和head（））的Python熊猫等效项是什么？

对于不规则的分隔符，如何使熊猫read_csv中的分隔符更灵活wrt空格？

分类汇总

您的鼓励是对我最大的支持