Load an HDF5 file in Python
HDF5 (Hierarchical Data Format V5) is a highly flexible binary file format that stores data contiguously in memory, making memory-mapping highly effective.
It’s frequently used for storing numerical data, and can be manipulated using both the h5py
library, as well as numpy and vaex.
Because hdf5 (.h5) files are so flexible, the processing logic will necessarily change per file. In this tutorial, we will show you how you can understand your hdf5 file, the layout, and walk through a specific example of how to upload some data.
Nominal is the leading platform for operationalizing test data.
This guide describes an automatable pipeline for uploading HDF5 files to Nominal.
Connect to Nominal
Concepts
- Base URL: The URL through which the Nominal API is accessed (typically
https://api.gov.nominal.io/api
; shown under Settings → API keys). - Workspace: A mechanism by which to isolate datasets; each user has one or more workspace, and data in one cannot be seen from another. Note that a token / API key is attached to a user, and may access multiple workspaces.
- Profile: A combination of base URL, API key, and workspace.
There are two primary ways of authenticating the Nominal Client. The first is to use a profile stored on disk, and the second is to use a token directly.
Storing credentials to disk
Run the following in a terminal and follow on-screen prompts to set up a connection profile:
Here, “default” can be any name chosen to represent this profile (reminder: a profile represents a base URL, API key, and workspace).
The profile will be stored in ~/.config/nominal/config.yml
, and can then be used to create a client:
If you have previously used nom
to store credentials, prior to the availability of profiles, you will need to migrate your old configuration file (~/.nominal.yml
) to the new format (~/.config/nominal/config.yml
).
You can do this with the following command:
Directly using credentials in your scripts
NOTE: you should never share your Nominal API key with anyone. We therefore recommend that you not save it in your code and/or scripts.
Download sample data
For convenience and educational purposes, Nominal hosts test sample data on the Nominal Hugging Face account.
Let’s download a 3D hdf5 (h5) file containing x, y, and z coordinates.
We’ll use the hub to download our hdf5 file to disk.
You will need to make sure to install the nominal hdf5
extra: pip install 'nominal[hdf5]'
local_path
now contains the path to our hdf5 file.
To understand our data a bit more, we will look at the layout of the file to find where the data is and its schema.
Using the local_path
from the download command above, we will inspect the file.
We will see the layout of the file, with the data residing in data/1/meshes/B
in “columns” x, y, and z. Each of these
columns is a 3D matrix of shape (47, 47, 47). Column E also contains x, y, and z, but these columns are empty, so there is nothing to upload.
Upload your data to Nominal
How we chose to upload this data is up to the use case. A benefit of hdf5, beyond flexibility, is its ease of indexing and slicing. By slicing in batches, we can avoid bringing the entire dataset into memory. This can allow us to upload files larger than memory.
We will upload our dataset in batches, flattening the dataset into x1, x2, x3, y1, y2, y3, z1, z2, z3
We will:
- index the data into a reasonable size
- create our flattened structure
- create a pandas df
- upload the batch with
nominal.thirdparty.pandas.upload_dataframe
We can take a look at the uploaded dataset by running:
That’s it!
After upload, navigate to Nominal’s Datasets page (login required). You’ll see your file at the top!
Acceptable values for timestamp_type
include:
iso_8601
,
epoch_days
,
epoch_hours
,
epoch_minutes
,
epoch_seconds
,
epoch_milliseconds
,
epoch_microseconds
,
epoch_nanoseconds
,
relative_days
,
relative_hours
,
relative_minutes
,
relative_seconds
,
relative_milliseconds
,
relative_microseconds
,
relative_nanoseconds
Timestamps in the form 2024-06-08T05:58:42.000Z
will have a timestamp_type
of iso_8601
.
Timestamps in the form 1581933170999989
will most likely be epoch_microseconds
.
epoch_
timestamps refers to timestamps in Unix format.
For more information about Nominal timestamps in Python, see the
nominal.ts
docs page.