性能：Python pandas DataFrame.to_csv附加逐渐变慢

在这种情况下，您应该（以查看哪些函数调用花费最多的时间），这样您就可以凭经验检查它确实read_csv比其他地方慢。

通过查看您的代码：首先，这里有很多复制和很多循环（没有足够的向量化）…每当您看到循环时，都在寻找一种删除它的方法。其次，当您使用诸如zfill之类的东西时，我想知道是否要to_fwf（固定宽度格式）而不是to_csv？

进行一些健全性测试：某些文件是否比其他文件大很多（这可能导致您碰到交换文件）？您确定最大的文件只有1200行吗？你检查了吗？例如使用wc -l。

IMO，我认为这不太可能是垃圾回收。（如其他答案所示）。

这是对代码的一些改进，可以改善运行时间。

列是固定的，我将提取列计算并向量化实数，子项和其他归一化。使用apply而不是迭代（对于zfill）。

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
    if count % 100 == 0:
        t2 = datetime.Now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    # use apply here rather than iterating
    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
    df.index = df.ConcatIndex

    #DataFrame Normalization:
    dftemp = df.very_deep_copy()  # don't *think* you need this

    # drop all excludes
    dftemp.drop(columns_to_drop), axis=1, inplace=True)

    # normalize real cols
    m = dftemp[real_cols_suffix].max()
    m.index = real_cols
    dftemp[real_cols] = dftemp[real_cols] / m

    # normalize child cols
    m = dftemp[child_cols_desc].max()
    m.index = child_cols
    dftemp[child_cols] = dftemp[child_cols] / m

    # normalize remaining
    remaining = list(remaining - child)
    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

    # if this case is important then discard the rows of m with .max() is 0
    #if max != 0:
    #    dftemp[string] = dftemp[string]/max

    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
    # dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

从风格上讲，我可能会选择将这些部分包装成函数，这也意味着如果这确实是问题，那么可以进行更多的事情…

另一个更快的选择是使用pytables（HDF5Store），如果您不需要将结果输出为csv（但我希望您这样做的话）…

The best thing to do is to profile your code. e.g. with %prun in ipython e.g. see http://pynash.org/2013/03/06/timing-and-profiling.html. Then you can see it definitely is read_csv and specifically where (which line of your code and which lines of pandas code).

Ah ha, I’d missed that you are appending all these to a single csv file. And in your prun it shows most of the time is spent in close, so let’s keep the file open:

# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

    if picklefile == '0000.p':
        dftemp.to_csv(f)
    else:
        dftemp.to_csv(f, mode='a', header=False)
...

f.close()

Each time the file is opened before it can append to, it needs to seek to the end before writing, it Could be that this is the expensive (I don’t see why this should be bad, but keeping it open removes the need to do this).

python 2022/1/1 18:48:42 有350人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

性能：Python pandas DataFrame.to_csv附加逐渐变慢

撰写回答

推荐问题

spring启动性能问题

未使用的using指令如何影响性能？

ElasticSearch-分片如何影响索引性能？

通配符导入和必需的类导入之间的性能差异

以jar运行时的spring-boot-starter-ws性能

Java：比较整数和字符串-性能

打开/关闭标签和性能？

mysql-创建行与列的性能

JPA / Hibernate提高批处理插入性能

调用ToList（）会对性能产生影响吗？

PHP / MySQL中的地理搜索（距离）（性能）

性能启用拆解调用图

String.Replace .NET Framework的内存效率和性能

mgo-查询性能似乎持续缓慢（500-650ms）

ElasticSearch搜索性能

LINQ扩展方法的顺序不影响性能吗？

快速性能：map（）和reduce（）与for循环

为什么cgo的性能这么慢？我的测试代码有问题吗？

将查询限制为一条记录是否可以提高性能

不抛出异常时，try / catch块是否会损害性能？

分类汇总

您的鼓励是对我最大的支持