您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

python – 从文件中随机抽样

5b51 2022/1/14 8:22:49 python 字数 6413 阅读 585 来源 www.jb51.cc/python

我有一个大约40gb和1800000行的csv文件. 我想随机抽样10,000行并将它们打印到一个新文件. 现在,我的方法是使用sed作为: (sed -n '$vars' < input.txt) > output.txt 其中$vars是随机生成的行列表. (例如:1p; 14p; 1700p; ……; 10203p) 虽然这有效,但每次执行大约需要5分钟.这不是一个很大的时间,但我想知道是否

概述

我想随机抽样10,000行并将它们打印到一个文件.

现在,我的方法是使用sed作为:

(sed -n '$vars' < input.txt) > output.txt

其中$vars是随机生成的行列表. (例如:1p; 14p; 1700p; ……; 10203p)

虽然这有效,但每次执行大约需要5分钟.这不是一个很大的时间,但我想知道是否有人对如何更快地提出想法?

寻求将使用具有不等行长度的文件(例如,UTF-8编码中的非ascii字符),但是需要对该方法进行微小的修改.如果您有不相等的线,您可以搜索估计的位置,然后扫描到下一行的开头.这仍然是非常有效的,因为你需要为每个~20KB的内容跳过~40MB.由于您将选择字节位置而不是行位置,因此您的采样均匀性会受到轻微影响,并且您无法确定您正在读取的行号.

您可以使用生成行号的Python代码直接实现解决方案.以下是如何处理所有具有相同字节数的行的示例(通常为ascii编码):

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000

file_size = getsize(file_name)
with open(file_name) as file:
    # Read the first line to get the length
    file.readline()
    line_size = file.tell()
    # You don't have to seek(0) here: if line #0 is selected,# the seek will happen regardless later.

    # Assuming you are 100% sure all lines are equal,this might
    # discard the last line if it doesn't have a trailing newline.
    # If that bothers you,use `math.round(file_size / line_size)`
    line_count = file_size // line_size
    # This is just a trivial example of how to generate the line numbers.
    # If it doesn't work for you,just use the method you already have.
    # By the way,this will just error out (ValueError) if you try to
    # select more lines than there are in the file,which is ideal
    selection_indices = random.sample(range(line_count),selection_count)
    selection_indices.sort()

    # Now skip to each line before reading it:
    prev_index = 0
    for line_index in selection_indices:
        # Conveniently,the default seek offset is the start of the file,# not from current position
        if line_index != prev_index + 1:
            file.seek(line_index * line_size)
        print('Line #{}: {}'.format(line_index,file.readline()),end='')
        # Small optimization to avoid seeking consecutive lines.
        # Might be unnecessary since seek probably already does
        # something like that for you
        prev_index = line_index

如果您愿意牺牲(非常)少量的行号分布均匀性,您可以轻松地将类似的技术应用于行长度不等的文件.您只需生成随机字节偏移,并跳过偏移后的下一个完整行.在以下实现中,假设您知道没有行的长度超过40KB.如果您的CSV具有以UTF-8编码的非ascii unicode字符,则必须执行此类操作,因为即使这些行包含相同数量的字符,它们也将包含不同数量的字节.在这种情况下,您必须以二进制模式打开文件,否则当您跳到随机字节时,如果该字节碰巧是中间字符,则可能会遇到解码错误

import random
from os.path import getsize

# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes,not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000

file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples,at least `max_line_bytes` apart from each
# other,in the range [0,file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count,file_size,max_line_bytes)
with open(file_name,'rb') as file:
    for offset in selection_offsets:
        # Skip to each offset
        file.seek(offset)
        # Readout to the next full line
        file.readline()
        # Print the next line. You don't kNow the number.
        # You also have to decode it yourself.
        print(file.readline().decode('utf-8'),end='')

这里的所有代码都是Python 3.

总结

以上是编程之家为你收集整理的python – 从文件中随机抽样全部内容,希望文章能够帮你解决python – 从文件中随机抽样所遇到的程序开发问题。


如果您也喜欢它,动动您的小指点个赞吧

除非注明,文章均由 laddyq.com 整理发布,欢迎转载。

转载请注明:
链接:http://laddyq.com
来源:laddyq.com
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。


联系我
置顶