Load an HDF5 file in Python
HDF5 (Hierarchical Data Format V5) is a highly flexible binary file format that stores data contiguously in memory, making memory-mapping highly effective.
It’s frequently used for storing numerical data, and can be manipulated using both the h5py
library, as well as numpy and vaex.
Because hdf5 (.h5) files are so flexible, the processing logic will necessarily change per file. In this tutorial, we will show you how you can understand your hdf5 file, the layout, and walk through a specific example of how to upload some data.
Nominal is the leading platform for operationalizing test data.
This guide describes an automatable pipeline for uploading HDF5 files to Nominal.
Connect to Nominal
Get your Nominal API token from your User settings page.
See the Quickstart for more details on connecting to Nominal from Python.
Download sample data
For convenience and educational purposes, Nominal hosts test sample data on the Nominal Hugging Face account.
Let’s download a 3D hdf5 (h5) file containing x, y, and z coordinates.
We’ll use the hub to download our hdf5 file to disk.
You will need to make sure to install the nominal hdf5
extra: pip install 'nominal[hdf5]'
local_path
now contains the path to our hdf5 file.
To understand our data a bit more, we will look at the layout of the file to find where the data is and its schema.
Using the local_path
from the download command above, we will inspect the file.
We will see the layout of the file, with the data residing in data/1/meshes/B
in “columns” x, y, and z. Each of these
columns is a 3D matrix of shape (47, 47, 47). Column E also contains x, y, and z, but these columns are empty, so there is nothing to upload.
Upload your data to Nominal
How we chose to upload this data is up to the use case. A benefit of hdf5, beyond flexibility, is its ease of indexing and slicing. By slicing in batches, we can avoid bringing the entire dataset into memory. This can allow us to upload files larger than memory.
We will upload our dataset in batches, flattening the dataset into x1, x2, x3, y1, y2, y3, z1, z2, z3
We will:
- index the data into a reasonable size
- create our flattened structure
- create a pandas df
- upload the batch with
nm.upload_pandas
We can check out the uploaded dataset by running
That’s it!
After upload, navigate to Nominal’s Datasets page (login required). You’ll see your file at the top!
Acceptable values for timestamp_type
include:
iso_8601
,
epoch_days
,
epoch_hours
,
epoch_minutes
,
epoch_seconds
,
epoch_milliseconds
,
epoch_microseconds
,
epoch_nanoseconds
,
relative_days
,
relative_hours
,
relative_minutes
,
relative_seconds
,
relative_milliseconds
,
relative_microseconds
,
relative_nanoseconds
Timestamps in the form 2024-06-08T05:58:42.000Z
will have a timestamp_type
of iso_8601
.
Timestamps in the form 1581933170999989
will most likely be epoch_microseconds
.
epoch_
timestamps refers to timestamps in Unix format.
For more information about Nominal timestamps in Python, see the
nominal.ts
docs page.