用Python编写HDF5文件的最快方法？

我将避免对数据进行分块，并将数据存储为一系列单数组数据集（以及本杰明所建议的内容）。我刚刚将一直在研究的企业应用程序的输出加载到HDF5中，并且能够将大约45亿个复合数据类型打包为450,000个数据集，每个数据集包含10,000个数据数组。现在，读写似乎是瞬时的，但是当我最初尝试对数据进行分块时，速度非常慢。

只是一个想法！

更新：

这些是从我的实际代码（我使用C与Python进行编码，但您应该了解我在做什么）中摘录的几段代码，并进行了修改以使其更加清晰。我只是在数组中写长的无符号整数（每个数组10,000个值），并在需要实际值时读回它们

这是我典型的编写者代码。在这种情况下，我只是将长的无符号整数序列写入数组序列中，并在创建它们时将每个数组序列加载到hdf5中。

//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dspace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i < NUMDATASETS; i++){
    for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){
        kValues[j] = k;
        k += 1UL;
    }
    //Create the data set.
    dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    //Write data to the data set.
    H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
    //Close the data set.
    H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dspace);
//Close the data files.
H5Fclose(ssdb);

这是我的阅读器代码的略微修改版本。有很多更优雅的方法可以做到这一点（即，我可以使用超平面来获得价值），但是对于我训练有素的Agile / BDD开发过程而言，这是最干净的解决方案。

unsigned long int getValueByIndex(unsigned long int nnValue){
    //NUMPERDATASET = 10,000
    unsigned long int ssValue[NUMPERDATASET];
    //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
    //to avoid index out of range error 
    unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
    //Open the data file in read-write mode.
    hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
    //Create the data set. In this case, each dataset consists of a array of 10,000
    //unsigned long int and is named according to its integer division value of i divided
    //by the number per data set.
    hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
    //Read the data set array.
    H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
    //Close the data set.
    H5Dclose(dSet);
    //Close the data file.
    H5Fclose(db);
    //Return the indexed value by using the modulus of i divided by the number per dataset
    return ssValue[i % NUMPERDATASET];
}

主要要点是编写代码中的内部循环以及整数除法和mod操作，以获取数据集数组的索引和该数组中所需值的索引。让我知道这是否足够清楚，以便您可以在h5py中组合类似或更好的内容。在C语言中，这非常简单，与分块数据集解决方案相比，它使我的读写时间明显缩短。另外，由于无论如何我都无法对化合物数据集使用压缩，因此分块的明显好处是有争议的，因此所有化合物都以相同的方式存储。

python 2022/1/1 18:30:38 有202人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

用Python编写HDF5文件的最快方法？

撰写回答

推荐问题

如何使用PHP检查目录是否为空？

使用PHP的MySQL中的查询时间结果

如何使用PHP和Mysql DB下载文件

iphoneX不能调用preferredsStatusBarHidden

ImportError：没有使用Python2的名为mysql.connector的模块

使用perlbrew和cpm安装模块-在docker build期间perlbrew开关不会更改@INC

使用popen（）通过套接字执行命令

使用PHP或JavaScript提交表单失败后，是否可以重新填充文件输入？

我应该在CSS中使用px或rem值单位吗？

用PDO和准备好的语句替换mysql_ *函数

我们可以以某种方式重命名使用puppeteer下载的文件吗？

如何有效地使用PHP中的try…catch块

使用PHP / Apache上传文件夹的适当权限是什么？

如何使用PHP ping服务器端口？

休眠使用PostgreSQL序列不会影响序列表

如何使用Play Framework通过SSL连接到远程MySQL数据库？

用PHP替换\ r \ n

如何使用PHP跳过XML文件中的无效字符

使用PHP将html转换为word / excel / powerpoint

使用php变量创建动态mysql查询

分类汇总

您的鼓励是对我最大的支持