Load an HDF5 file in Python

HDF5 (Hierarchical Data Format V5) is a highly flexible binary file format that stores data contiguously in memory, making memory-mapping highly effective.

It’s frequently used for storing numerical data, and can be manipulated using both the h5py library, as well as numpy and vaex.

Because hdf5 (.h5) files are so flexible, the processing logic will necessarily change per file. In this tutorial, we will show you how you can understand your hdf5 file, the layout, and walk through a specific example of how to upload some data.

Nominal is the leading platform for operationalizing test data.

This guide describes an automatable pipeline for uploading HDF5 files to Nominal.

Connect to Nominal

Concepts
  • Base URL: The URL through which the Nominal API is accessed (typically https://api.gov.nominal.io/api; shown under Settings → API keys).
  • Workspace: A mechanism by which to isolate datasets; each user has one or more workspace, and data in one cannot be seen from another. Note that a token / API key is attached to a user, and may access multiple workspaces.
  • Profile: A combination of base URL, API key, and workspace.

There are two primary ways of authenticating the Nominal Client. The first is to use a profile stored on disk, and the second is to use a token directly.

Run the following in a terminal and follow on-screen prompts to set up a connection profile:

$$ nom config profile add default
>
># Alternatively, if `nom` is missing from the path:
>$ python -m nominal config profile add default

Here, “default” can be any name chosen to represent this profile (reminder: a profile represents a base URL, API key, and workspace).

The profile will be stored in ~/.config/nominal/config.yml, and can then be used to create a client:

1from nominal.core import NominalClient
2
3client = NominalClient.from_profile("default")
4
5# Get details about the currently logged-in user to validate authentication
6# Will display an object like: `User(display_name='your_email@your_company.com', ...)`
7print(client.get_user())

If you have previously used nom to store credentials, prior to the availability of profiles, you will need to migrate your old configuration file (~/.nominal.yml) to the new format (~/.config/nominal/config.yml).

You can do this with the following command:

$nom config migrate
>
># Or, if `nom` is missing from your path:
>python -m nominal config migrate
1from nominal.core import NominalClient
2
3# Get an instance of the client using provided credentials
4client = NominalClient.from_token("<insert api key>")
5
6# Get details about the currently logged-in user to validate authentication
7# Will display an object like: `User(display_name='your_email@your_company.com', ...)`
8print(client.get_user())

NOTE: you should never share your Nominal API key with anyone. We therefore recommend that you not save it in your code and/or scripts.

  • If you trust the computer you are on, use nom to store the credential to disk.
  • Otherwise, use a password manager such as 1password or bitwarden to keep your token safe.
If you’re not sure whether your company has a Nominal tenant, please reach out to us.

Download sample data

For convenience and educational purposes, Nominal hosts test sample data on the Nominal Hugging Face account.

Let’s download a 3D hdf5 (h5) file containing x, y, and z coordinates.

We’ll use the hub to download our hdf5 file to disk.

You will need to make sure to install the nominal hdf5 extra: pip install 'nominal[hdf5]'

1from huggingface_hub import hf_hub_download
2
3repo_id = "nominal-io/hdf5-sample"
4filename = "3d.h5"
5
6local_path = hf_hub_download(
7 repo_id=repo_id,
8 filename=filename,
9 repo_type="dataset"
10)

local_path now contains the path to our hdf5 file.

To understand our data a bit more, we will look at the layout of the file to find where the data is and its schema.

Using the local_path from the download command above, we will inspect the file.

1import h5py
2
3def print_structure(name, obj):
4 if isinstance(obj, h5py.Group):
5 print(f"Group: {name}")
6 elif isinstance(obj, h5py.Dataset):
7 print(f"Dataset: {name}, Shape: {obj.shape}, Type: {obj.dtype}")
8
9with h5py.File(local_path, 'r') as f:
10 f.visititems(print_structure)

We will see the layout of the file, with the data residing in data/1/meshes/B in “columns” x, y, and z. Each of these columns is a 3D matrix of shape (47, 47, 47). Column E also contains x, y, and z, but these columns are empty, so there is nothing to upload.

Group: data
Group: data/1
Group: data/1/meshes
Group: data/1/meshes/B
Dataset: data/1/meshes/B/x, Shape: (47, 47, 47), Type: float64
Dataset: data/1/meshes/B/y, Shape: (47, 47, 47), Type: float64
Dataset: data/1/meshes/B/z, Shape: (47, 47, 47), Type: float64
Group: data/1/meshes/E
Group: data/1/meshes/E/x
Group: data/1/meshes/E/y
Group: data/1/meshes/E/z

Upload your data to Nominal

How we chose to upload this data is up to the use case. A benefit of hdf5, beyond flexibility, is its ease of indexing and slicing. By slicing in batches, we can avoid bringing the entire dataset into memory. This can allow us to upload files larger than memory.

We will upload our dataset in batches, flattening the dataset into x1, x2, x3, y1, y2, y3, z1, z2, z3

We will:

  1. index the data into a reasonable size
  2. create our flattened structure
  3. create a pandas df
  4. upload the batch with nominal.thirdparty.pandas.upload_dataframe
1import pandas as pd
2import h5py
3import io
4
5from nominal.core import NominalClient
6from nominal.thirdparty.pandas import upload_dataframe
7
8
9client = NominalClient.from_profile("default")
10
11
12dataset = None
13num_rows = 0
14with h5py.File(local_path, 'r') as file:
15 data = file["data"]["1"]["meshes"]["B"]
16
17 batch_size = 5
18 total_size = data['x'].shape[0] # Assuming the same shape for 'x', 'y', and 'z'
19
20 for start_idx in range(0, total_size, batch_size):
21 end_idx = min(start_idx + batch_size, total_size)
22
23 batch = {}
24 for col in ["x", "y", "z"]:
25 for col_idx in [0, 1, 2]:
26 # Fetch the batch using slicing (start_idx:end_idx)
27 batch[f"{col}{col_idx+1}"] = data[col][start_idx:end_idx, :, col_idx].ravel()
28
29 df = pd.DataFrame(batch)
30 df["ts"] = list(range(num_rows, num_rows+len(df)))
31 if dataset is None:
32 # First batch, create the dataset from dataframe
33 dataset = upload_dataframe(
34 client,
35 df,
36 name="h5_flattened",
37 timestamp_column="ts",
38 timestamp_type="epoch_seconds",
39 wait_until_complete=True
40 )
41 else:
42 # Subsequent batches; convert batch dataframe to buffer and append to dataset
43 buffer = io.BytesIO()
44 df.to_csv(buffer, index=False)
45 buffer.seek(0)
46 dataset.add_from_io(
47 dataset=buffer,
48 timestamp_column="ts",
49 timestamp_type="epoch_seconds"
50 )
51 num_rows += len(df)

We can take a look at the uploaded dataset by running:

dataset.to_pandas()

That’s it!

After upload, navigate to Nominal’s Datasets page (login required). You’ll see your file at the top!

Acceptable values for timestamp_type include:

iso_8601, epoch_days, epoch_hours, epoch_minutes, epoch_seconds, epoch_milliseconds, epoch_microseconds, epoch_nanoseconds, relative_days, relative_hours, relative_minutes, relative_seconds, relative_milliseconds, relative_microseconds, relative_nanoseconds

Timestamps in the form 2024-06-08T05:58:42.000Z will have a timestamp_type of iso_8601.

Timestamps in the form 1581933170999989 will most likely be epoch_microseconds.

epoch_ timestamps refers to timestamps in Unix format.

For more information about Nominal timestamps in Python, see the nominal.ts docs page.