只有一种方法可以避免顺序读取所有文件,直到采样的最后一行-令我惊讶的是,到目前为止,没有答案提到它:
如您所说,如果您具有典型的行长,则必须在文件内的任意位置读取一些字节,该值应为该值的3或4倍。然后,将在新行字符(“ \ n”)上读取的块拆分,并选择第二个字段-这是随机位置的行。
另外,为了能够始终如一地搜索文件,应以“二进制读取”模式打开文件,因此,应手动处理行尾标记的转换。
该技术无法为您提供已读取的行号,因此您将选定的行偏移量保留在文件中以避免重复:
#! /usr/bin/python
# coding: utf-8
import random, os
CHUNK_SIZE = 1000
PATH = "/var/log/cron"
def pick_next_random_line(file, offset):
file.seek(offset)
chunk = file.read(CHUNK_SIZE)
lines = chunk.split(os.linesep)
# Make some provision in case yIoU had not read at least one full line here
line_offset = offset + len(os.linesep) + chunk.find(os.linesep)
return line_offset, lines[1]
def get_n_random_lines(path, n=5):
lenght = os.stat(path).st_size
results = []
result_offsets = set()
with open(path) as input:
for x in range(n):
while True:
offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
if not offset in result_offsets:
result_offsets.add(offset)
results.append(line)
break
return results
if __name__ == "__main__":
print get_n_random_lines(PATH)