用于读取行的最佳HDF5数据集块形状

首先，我想讨论一些一般性的事情。知道每个单独的块只能整体读取或写入非常重要。默认情况下，可以避免过多的磁盘I / O的h5py的标准块高速缓存大小仅为默认值1 MB，并且在许多情况下应该增加该大小，稍后将对此进行讨论。

举个例子：

在这种情况下，读取速度不会很差（尽管块大小有点小），因为我们只读取正在使用的数据。但是，当我们在该数据集上书写时会发生什么呢？如果我们访问列，则会写入每个块的一个浮点数。这意味着我们实际上每次迭代都会写入整个数据集（25.5 GB），并每隔一段时间读取一次整个数据集。这是因为如果您修改了一个块，那么如果它没有被缓存，则必须首先读取它（我假设这里的chunk-cache- size小于25.5 GB）。

那么我们在这里可以改善什么呢？在这种情况下，我们必须在写入/读取速度与块缓存使用的内存之间做出折衷。

假设将给出不错的读/写速度：

通常没有合适的块大小或形状，这在很大程度上取决于要使用的任务。切勿在不考虑块缓存的情况下选择块的大小或形状。就随机读/写而言，RAM比最快的SSD快了几个数量级。

我只会读取随机行，不正确的chunk-cache-size是您真正的问题。

将以下代码的性能与您的版本进行比较：

import h5py as h5
import time
import numpy as np

def ReadingAndWriting():
    File_Name_HDF5='Test.h5'

    #shape = (639038, 10000)
    shape = (639038, 1000)
    chunk_shape=(100, 1000)
    Array=np.array(np.random.rand(shape[0]),np.float32)

    #We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
    f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
    d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

    #Writing columns
    t1=time.time()
    for i in range(0,shape[1]):
        d[:,i:i+1]=np.expand_dims(Array, 1)

    f.close()
    print(time.time()-t1)

    # Reading random rows
    # If we read one row there are actually 100 read, but if we access a row
    # which is already in cache we would see a huge speed up.
    f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
    d = f["Test"]
    for j in range(0,639):
        t1=time.time()
        # With more iterations it will be more likely that we hit a already cached row
        inds=np.random.randint(0, high=shape[0]-1, size=1000)
        for i in range(0,inds.shape[0]):
            Array=np.copy(d[inds[i],:])
        print(time.time()-t1)
    f.close()

我在评论中写道，在最新版本中看不到这种现象。我错了。比较以下内容：

def Writing（）：File_Name_HDF5 =’Test.h5’

#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)

# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

t1=time.time()
for i in range(shape[1]):
    d[:,i:i+1]=np.expand_dims(Array, 1)

f.close()
print(time.time()-t1)

# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")

#Writing columns
t1=time.time()
for i in range(shape[1]):
    d[:,i]=Array

f.close()
print(time.time()-t1)

对于我的硬盘，第一个版本为34秒，第二个版本为78秒。

其他 2022/1/1 18:25:44 有483人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

用于读取行的最佳HDF5数据集块形状

撰写回答

推荐问题

检测到ASP.NET设置不适用于集成托管管道模式

Java SE 6中的JPA / Hibernate，用于数据访问的最佳实践

onSave（）（用于使用Hibernate / Spring数据存储库保存的任何实体）

opencv用于查找轮廓的算法是什么？

在将Spring MVC用于REST时，如何使Jackson能够漂亮地打印呈现的JSON？

用于Eclipse靛蓝的Tomcat应用程序探查器

Hibernate JPA，MySQL和TinyInt（1）用于布尔值而不是bit或char

级联bean验证2.0不适用于Map中的嵌套对象

JavaScript使用变量“ name”不适用于JS对象

Beanshell Sampler JMeter中用于解析RegularExpressionExtractor的错误

JSTL格式编号，用于独立于语言的JSP自定义模式

jQuery-Click事件不适用于动态创建的按钮

垂直对齐不适用于嵌入式块

jQuery .ajax不适用于http://127.0.0.1:5984

用于XE2的Delphi JSON库可用于对象序列化

Z索引不适用于固定定位

Margin-Top不能用于span元素吗？

适用于iOS的Appium的代码覆盖率

在Elasticsearch中，如何将时区应用于脚本化日期操作？

sed：-i可能不适用于Mac OS X上的stdin

分类汇总

您的鼓励是对我最大的支持