The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 Award. HDF5 simplifies the file structure to include only two major types of object.
The HDF5 plugin software is provided for convenience and is composed of the following registered (compression) filters contributed by users: BZIP2, JPEG, LZF, BLOSC, MAFISC, LZ4, Bitshuffle, and ZFP The registered third-party filters extend HDF5 to support compression and other filters not included in the HDF5 library. HDF5 is a great mechanism for storing large numerical arrays of homogenous type, for data models that can be organized hierarchically and benefit from tagging of datasets with arbitrary metadata. It’s quite different from SQL-style relational databases. HDF5 files can be read and written in R using the rhdf5 package, which is part of the Bioconductor collection of packages. HDF4 files can also be handled via the rgdal package, but the process is more cumbersome. The HDFView is a Java-based tool for browsing and editing NCSA HDF4 and HDF5 files. HDFView allows users to browse through any HDF4 and HDF5 file; starting with a tree view of all top-level objects in an HDF file's hierarchy. HDFView allows a user to descend through the hierarchy and navigate among the file's data objects.
HDF5 is a node module for reading and writing the HDF5 file format.
Documentation
API documentation is available at http://hdf-ni.github.io/hdf5.node
See http://hdf-ni.github.io/hdf5.node/doc/install-setup.html for the native requirements and details. If your native hdf5 libraries aren't at the default you can set the path with --hdf5_home_linux switch on this project as well as dependent projects.
For mac and windows the switches are --hdf5_home_mac & --hdf5_home_win
To install with yarn first need to configure so it knows where the libraries are:
Note: If node-gyp isn't installed
Quick start to open and read from an h5 file
Notes on Recent Releases
Note: Release v0.3.5
- Strictly for building on Windows. Tested with VS 2017
Note: Release v0.3.4
- Reference attributes on datasets and groups are now available properties.
- Reserved propeties such as type, rank, rows etc. are now settable in options for dataset functions.
- Typescript definition files now available.
- For static native linkng, link_type command line switch is provided in binding.gyp(darwin,win untested).
- Added custom 64 bit signed (Int64) and unsigned (Uint64) integer attributes read/write since they aren't yet supported by javascript.
- Added a file method enableSingleWriteMultiRead (if native version older than 1.10.x it is a noop).
- Synchronous iterate and visit is now available on file and group children.
- Bug fixes on dimensioning have been made
Note: asynchronous i/o is coming but not in this release
Note: Release v0.3.3 Minor fix on fixed length of strings in array. Handles the situation where the strings are contiguous without all having null bytes.
Note: Release v0.3.2 tested with nodejs v6.11.2, v7.5.0 and v8.4.0. Code was changed to allow v8.4.0 to work while still working with the earlier versions. It may work back as far as v4.2.1 let me know if you have a version in between that needs testing. Variable length array of strings can be be read as regions.
Note: Release v0.3.1 is based on new V8 API changes coming with nodejs v7.
Note: Release v0.1.0 was built with nodejs v4.2.1. If you want nodejs v0.12.x then stay with release v0.0.20. npm will continue with nodejs v4.x.x line and any fixes or features needed by prior versions will be from github branches.
Note: Release v0.0.20 is for prebuilts with hdf5-1.8.15-patch1. If you want hdf5-1.8.14 stay with v0.0.19.
Philosophy
This module, hdf5.node, is intended to be a pure API for reading and writing HDF5 files. Graphical user interfaces or other layers should be implemented in separate modules.
Unlike other languages that wrap hdf5 API's this interface takes advantage of the compatibility of V8 and HDF5. The result is a direct map to javascript behavior with the least amount of data copying and coding tasks for the user. Hopefully you won't need to write yet another layer in your code to accomplish your goals.
Other Feature Notes
The node::Buffer and streams are being investigated so native hdf5 data's only destination is client browser window or client in general.
Dimension Scales
Mostly implemented (missing H5DSiterate_scales[ found a way to make callback functions from te native side and looking to finish this and use the technique for other h5 iterators])
High-level Functions for Region References, Hyperslabs, and Bit-fields
Writing an interface based on the standard hdf5 library. Currently you can write and read a subset from a two rank dataset. Other ranks may work yet are untested. See tutorial http://hdf-ni.github.io/hdf5.node/tut/subset_tutorial.html for example applied to node Buffers.
Filters and Compression
Testing filters and compression. Have the gzip filter working. For some applications getting the uncompressed data from the h5 would reduce the number of compressions and decompressions. For example an image could be sent to client before unzipping and rezipping on the server side.
Third party filters can be used. Those do take separate compiled libraries yet are independent. They get picked up by native hdf5 from the HDF5_PLUGIN_PATH.
Image
The h5im namespace is being designed to meet the Image Spec 1.2 http://www.hdfgroup.org/HDF5/doc/ADGuide/ImageSpec.html. Hyperslabs/regions of images can now be read.
Contributors
- Christian Nienhaus (@NINI1988) added typescript definitions and many pull requests and bug fixes for hdf5 native calls.
- John Shumway (@shumway) refurbished the documentation when the project was split into an organization.
Current Tags
- 0.3.5 ... latest (2 years ago)
22 Versions
- 0.3.5 ... 2 years ago
- 0.3.4 ... 2 years ago
- 0.3.3 ... 3 years ago
- 0.3.2 ... 3 years ago
- 0.3.1 ... 4 years ago
- 0.2.1 ... 5 years ago
- 0.2.0 ... 5 years ago
- 0.1.0 ... 5 years ago
- 0.0.20 ... 5 years ago
- 0.0.19 ... 5 years ago
- 0.0.18 ... 6 years ago
- 0.0.17 ... 6 years ago
- 0.0.12 ... 6 years ago
- 0.0.11 ... 6 years ago
- 0.0.10 ... 6 years ago
- 0.0.8 ... 6 years ago
- 0.0.7 ... 6 years ago
- 0.0.6 ... 6 years ago
- 0.0.5 ... 6 years ago
- 0.0.4 ... 6 years ago
- 0.0.3 ... 6 years ago
- 0.0.2 ... 6 years ago
Introduced in release: 1.18.
Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data 1. Originally developed at the National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF 2.
This plugin enables Apache Drill to query HDF5 files.
Configuring the HDF5 Format Plugin
There are three configuration variables in this plugin and which are tabled below.
Option | Default | Description |
---|---|---|
type | (none) | Set to “hdf5” to make use of this plugin |
extensions | “.h5” | This is a list of the file extensions used to identify HDF5 files. Typically HDF5 uses .h5 or .hdf5 as file extensions. |
defaultPath | null | The default path defines which path Drill will query for data. Typically this should be left as null in the configuration file. Its usage is explained below. |
Example Configuration
For most uses, the configuration below will suffice to enable Drill to query HDF5 files.
Usage
Since HDF5 can be viewed as a file system within a file, a single file can contain many datasets. For instance, if you have a simple HDF5 file, a star query will produce the following result:
The actual data in this file is mapped to a column called int_data. In order to effectively access the data, you should use Drill’s FLATTEN()
function on the int_data
column, which produces the following result.
apache drill> select flatten(int_data) as int_data from dfs.test.dset.h5
;
Once the data is in this form, you can access it similarly to how you might access nested data in JSON or other files.
However, a better way to query the actual data in an HDF5 file is to use the defaultPath
field in your query. If the defaultPath
field is defined in the query, or via the plugin configuration, Drill will only return the data, rather than the file metadata.
Note
Once you have determined which data set you are querying, it is advisable to use this method to query HDF5 data.
Note
Datasets larger than 16MB will be truncated in the metadata view.
You can set the defaultPath
variable in either the plugin configuration, or at query time using the table()
function as shown in the example below:
This query will return the result below:
If the data in defaultPath
is a column, the column name will be the last part of the path. If the data is multidimensional, the columns will get a name of <data_type>_col_n
. Therefore a column of integers will be called int_col_1
.
Attributes
Occasionally, HDF5 paths will contain attributes. Drill will map these to a map data structure called attributes
, as shown in the query below.
Hdf5 For Mac Download
You can access the individual fields within the attributes
map by using the structure table.map.key
. Note that you will have to give the table an alias for this to work properly.
Known Limitations
There are several limitations of the HDF5 format plugin in Drill.
- Drill cannot read unsigned 64 bit integers. When the plugin encounters this data type, it will write an INFO message to the log.
- While Drill can read compressed HDF5 files, Drill cannot read individual compressed fields within an HDF5 file.
- HDF5 files can contain nested data sets of up to
n
dimensions. Since Drill works best with two dimensional data, datasets with more than two dimensions are reduced to 2 dimensions. - HDF5 has a
COMPOUND
data type. At present, Drill supports readingCOMPOUND
data types that contain multiple datasets. At present Drill does not supportCOMPOUND
fields with multidimesnional columns. Drill will ignore multidimensional columns withinCOMPOUND
fields.
Hdf5 For Mac Pro
https://en.wikipedia.org/wiki/Hierarchical_Data_Format ↩
https://www.hdfgroup.org ↩