Load an HDF5 file in Python

HDF5 (Hierarchical Data Format V5) is a highly flexible binary file format that stores data contiguously in memory, making memory-mapping highly effective.

It’s frequently used for storing numerical data, and can be manipulated using both the h5py library, as well as numpy and vaex.

Because hdf5 (.h5) files are so flexible, the processing logic will necessarily change per file. In this tutorial, we will show you how you can understand your hdf5 file, the layout, and walk through a specific example of how to upload some data.

Nominal is the leading platform for operationalizing test data.

This guide describes an automatable pipeline for uploading HDF5 files to Nominal.

Connect to Nominal

Get your Nominal API token from your User settings page.

See the Quickstart for more details on connecting to Nominal from Python.

1import nominal.nominal as nm
2
3nm.set_token(
4 base_url = 'https://api.gov.nominal.io/api',
5 token = '* * *' # Replace with your Access Token from
6 # https://app.gov.nominal.io/settings/user?tab=tokens
7)
If you’re not sure whether your company has a Nominal tenant, please reach out to us.

Download sample data

For convenience and educational purposes, Nominal hosts test sample data on the Nominal Hugging Face account.

Let’s download a 3D hdf5 (h5) file containing x, y, and z coordinates.

We’ll use the hub to download our hdf5 file to disk.

You will need to make sure to install the nominal hdf5 extra: pip install 'nominal[hdf5]'

1from huggingface_hub import hf_hub_download
2
3repo_id = "nominal-io/hdf5-sample"
4filename = "3d.h5"
5
6local_path = hf_hub_download(
7 repo_id=repo_id,
8 filename=filename,
9 repo_type="dataset"
10)

local_path now contains the path to our hdf5 file.

To understand our data a bit more, we will look at the layout of the file to find where the data is and its schema.

Using the local_path from the download command above, we will inspect the file.

1import h5py
2
3def print_structure(name, obj):
4 if isinstance(obj, h5py.Group):
5 print(f"Group: {name}")
6 elif isinstance(obj, h5py.Dataset):
7 print(f"Dataset: {name}, Shape: {obj.shape}, Type: {obj.dtype}")
8
9with h5py.File(local_path, 'r') as f:
10 f.visititems(print_structure)

We will see the layout of the file, with the data residing in data/1/meshes/B in “columns” x, y, and z. Each of these columns is a 3D matrix of shape (47, 47, 47). Column E also contains x, y, and z, but these columns are empty, so there is nothing to upload.

Group: data
Group: data/1
Group: data/1/meshes
Group: data/1/meshes/B
Dataset: data/1/meshes/B/x, Shape: (47, 47, 47), Type: float64
Dataset: data/1/meshes/B/y, Shape: (47, 47, 47), Type: float64
Dataset: data/1/meshes/B/z, Shape: (47, 47, 47), Type: float64
Group: data/1/meshes/E
Group: data/1/meshes/E/x
Group: data/1/meshes/E/y
Group: data/1/meshes/E/z

Upload your data to Nominal

How we chose to upload this data is up to the use case. A benefit of hdf5, beyond flexibility, is its ease of indexing and slicing. By slicing in batches, we can avoid bringing the entire dataset into memory. This can allow us to upload files larger than memory.

We will upload our dataset in batches, flattening the dataset into x1, x2, x3, y1, y2, y3, z1, z2, z3

We will:

  1. index the data into a reasonable size
  2. create our flattened structure
  3. create a pandas df
  4. upload the batch with nm.upload_pandas
1import nominal as nm
2import pandas as pd
3import h5py
4import io
5
6dataset = None
7num_rows = 0
8with h5py.File(local_path, 'r') as file:
9 data = file["data"]["1"]["meshes"]["B"]
10
11 batch_size = 5
12 total_size = data['x'].shape[0] # Assuming the same shape for 'x', 'y', and 'z'
13
14 for start_idx in range(0, total_size, batch_size):
15 end_idx = min(start_idx + batch_size, total_size)
16
17 batch = {}
18 for col in ["x", "y", "z"]:
19 for col_idx in [0, 1, 2]:
20 # Fetch the batch using slicing (start_idx:end_idx)
21 batch[f"{col}{col_idx+1}"] = data[col][start_idx:end_idx, :, col_idx].ravel()
22
23 df = pd.DataFrame(batch)
24 df["ts"] = list(range(num_rows, num_rows+len(df)))
25 if dataset is None:
26 dataset = nm.upload_pandas(df, "h5_flattened", timestamp_column="ts", timestamp_type="epoch_seconds")
27 else:
28 buffer = io.BytesIO()
29 df.to_csv(buffer, index=False)
30 buffer.seek(0)
31 dataset.add_to_dataset_from_io(dataset=buffer, timestamp_column="ts", timestamp_type="epoch_seconds")
32 num_rows += len(df)

We can check out the uploaded dataset by running

dataset.to_pandas()

That’s it!

After upload, navigate to Nominal’s Datasets page (login required). You’ll see your file at the top!

Acceptable values for timestamp_type include:

iso_8601, epoch_days, epoch_hours, epoch_minutes, epoch_seconds, epoch_milliseconds, epoch_microseconds, epoch_nanoseconds, relative_days, relative_hours, relative_minutes, relative_seconds, relative_milliseconds, relative_microseconds, relative_nanoseconds

Timestamps in the form 2024-06-08T05:58:42.000Z will have a timestamp_type of iso_8601.

Timestamps in the form 1581933170999989 will most likely be epoch_microseconds.

epoch_ timestamps refers to timestamps in Unix format.

For more information about Nominal timestamps in Python, see the nominal.ts docs page.