您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

性能:Python pandas DataFrame.to_csv附加逐渐变慢

性能:Python pandas DataFrame.to_csv附加逐渐变慢

在这种情况下,您应该 (以查看哪些函数调用花费最多的时间),这样您就可以凭经验检查它确实read_csv比其他地方慢。

通过查看您的代码:首先,这里有很多复制和很多循环(没有足够的向量化)…每当您看到循环时,都在寻找一种删除它的方法。其次,当您使用诸如zfill之类的东西时,我想知道是否要to_fwf(固定宽度格式)而不是to_csv

进行一些健全性测试:某些文件是否比其他文件大很多(这可能导致您碰到交换文件)?您确定最大的文件只有1200行吗?你检查了吗?例如使用wc -l

IMO,我认为这不太可能是垃圾回收。(如其他答案所示)。

这是对代码的一些改进,可以改善运行时间。

列是固定的,我将提取列计算并向量化实数,子项和其他归一化。使用apply而不是迭代(对于zfill)。

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
    if count % 100 == 0:
        t2 = datetime.Now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    # use apply here rather than iterating
    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
    df.index = df.ConcatIndex

    #DataFrame Normalization:
    dftemp = df.very_deep_copy()  # don't *think* you need this

    # drop all excludes
    dftemp.drop(columns_to_drop), axis=1, inplace=True)

    # normalize real cols
    m = dftemp[real_cols_suffix].max()
    m.index = real_cols
    dftemp[real_cols] = dftemp[real_cols] / m

    # normalize child cols
    m = dftemp[child_cols_desc].max()
    m.index = child_cols
    dftemp[child_cols] = dftemp[child_cols] / m

    # normalize remaining
    remaining = list(remaining - child)
    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

    # if this case is important then discard the rows of m with .max() is 0
    #if max != 0:
    #    dftemp[string] = dftemp[string]/max

    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
    # dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

从风格上讲,我可能会选择将这些部分包装成函数,这也意味着如果这确实是问题,那么可以进行更多的事情…

一个更快的选择是使用pytables(HDF5Store),如果您不需要将结果输出为csv(但我希望您这样做的话)…

The best thing to do is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).

Ah ha, I’d missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let’s keep the file open:

# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

    if picklefile == '0000.p':
        dftemp.to_csv(f)
    else:
        dftemp.to_csv(f, mode='a', header=False)
...

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it Could be that this is the expensive (I don’t see why this should be bad, but keeping it open removes the need to do this).

python 2022/1/1 18:48:42 有350人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶