Skip to main content

Use Neptune in HPO jobs

Open in Colab

When running a hyperparameter optimization job, you can use Neptune Scale to track all the metadata from the study and each trial.

The Training setup section contains a sample model configuration. You can then choose between two approaches:

  • Option A: Log metadata from multiple trials to a single run. Useful for analyzing end results of the overall study, not individual trials.

    • Faster, as no Neptune syncing is needed after each trial.
    • Convenient display of an entire study's metadata in a single-run dashboard.
    • Trials can't be compared across different studies.
  • Option B: Log each trial to its own run. Useful for comparing multiple trials across multiple studies.

    • Trials can be compared across different studies.
    • Slower, as Neptune needs to sync after each trial.
    • Since trial metadata is logged in different studies, it can't be displayed in a single-run dashboard.

Before you start

  • Configure your Neptune API token and project. For details, see Get started.

  • Install Neptune Scale and dependencies:

    pip install -U neptune-scale torch torchvision tqdm "numpy<2.0"

Training setup

Import libraries:

from neptune_scale import Run
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from tqdm.auto import trange, tqdm
import math
from datetime import datetime

ALLOWED_DATATYPES = [int, float, str, datetime, bool, list, set]

Set the hyperparameters and search space:

parameters = {
"batch_size": 128,
"input_size": (1, 28, 28),
"n_classes": 10,
"epochs": 3,
"device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
}

input_size = math.prod(parameters["input_size"])

learning_rates = [0.025, 0.05, 0.075] # learning rate choices

Set up the model and dataset:

class BaseModel(nn.Module):
def __init__(self, input_size, num_classes):
super(BaseModel, self).__init__()
self.fc1 = nn.Linear(input_size, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, num_classes)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.5)

def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
x = self.fc3(x)
return x


criterion = nn.CrossEntropyLoss()

data_tfms = {
"train": transforms.Compose(
[
transforms.ToTensor(),
]
)
}

trainset = datasets.MNIST(
root="mnist",
train=True,
download=True,
transform=data_tfms["train"],
)

trainloader = torch.utils.data.DataLoader(
trainset,
batch_size=parameters["batch_size"],
shuffle=True,
num_workers=0,
)

Initialize the model:

model = BaseModel(
input_size,
parameters["n_classes"],
).to(parameters["device"])

Option A: Log all trials to single run

In this approach, we create a global Neptune run that tracks metadata from all trials:

from random import random

run = Run(run_id=f"hpo-{random()}")

The run identifier must be unique within the project. The above is one way to generate a unique ID.

Passing credentials without environment variables

Although not recommended, you can also pass your Neptune API token and project name directly in the code:

from neptune_scale import Run

run = Run(
project="team-alpha/project-x", # your full project name here
api_token="h0dHBzOi8aHR0cHM6...Y2MifQ==", # your API token here
)

Instead, consider setting your API token and project path to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

Let's log the configuration that's common across all trials:

for key in parameters:
if type(parameters[key]) not in ALLOWED_DATATYPES:
run.log_configs({f"config/{key}": str(parameters[key])})
else:
run.log_configs({f"config/{key}": parameters[key]})

This creates a namespace config and, inside it, a attribute for each hyperparameter.

Next, we define a training loop:

for trial, lr in tqdm(
enumerate(learning_rates),
total=len(learning_rates),
desc="Trials",
):
# Log trial hyperparameters
run.log_configs({f"trials/{trial}/parameters/lr": lr})

optimizer = optim.SGD(model.parameters(), lr=lr)

# Initialize attributes for best values across all trials
best_acc = None

step = 0

for epoch in trange(parameters["epochs"], desc=f"Trial {trial} - lr: {lr}"):
run.log_metrics(data={f"trials/{trial}/epochs": epoch}, step=epoch)

for x, y in trainloader:
x, y = x.to(parameters["device"]), y.to(parameters["device"])
optimizer.zero_grad()
x = x.view(x.size(0), -1)
outputs = model(x)
loss = criterion(outputs, y)

_, preds = torch.max(outputs, 1)
acc = (torch.sum(preds == y.data)) / len(x)

# Log trial metrics
run.log_metrics(
data={
f"trials/{trial}/metrics/batch/loss": float(loss),
f"trials/{trial}/metrics/batch/acc": float(acc),
},
step=step,
)

# Log best values across all trials
if best_acc is None or acc > best_acc:
best_acc = acc
run.log_configs(
{
"best/trial": trial,
"best/metrics/loss": float(loss),
"best/metrics/acc": float(acc),
"best/parameters/lr": lr,
}
)

loss.backward()
optimizer.step()

step += 1

Finally, we close the run at the end:

run.close()

To explore the logged metadata, open your project in the Neptune app and navigate to the Runs section.

In the run metadata, the best namespace contains the best trial, with its metrics and parameters. The trials namespace contains metadata across all trials.

Option B: Log each trial to separate run

Create a sweep-level identifier:

import uuid

sweep_id = str(uuid.uuid4())

Create a sweep-level Neptune run:

sweep_run = Run(run_id=f"sweep-{sweep_id}")

sweep_run.add_tags(["sweep"])

To connect the sweep-level and trial-level runs, add the sweep ID as a tag:

sweep_run.add_tags([sweep_id], group_tags=True)

Log the configuration that's common across all trials:

for key in parameters:
if type(parameters[key]) not in ALLOWED_DATATYPES:
sweep_run.log_configs({f"config/{key}": str(parameters[key])})
else:
sweep_run.log_configs({f"config/{key}": parameters[key]})

This creates a namespace config and, inside it, a attribute for each hyperparameter.

Define the training loop:

# Initialize attributes for best values across all trials
best_acc = None

for trial, lr in tqdm(
enumerate(learning_rates),
total=len(learning_rates),
desc="Trials",
):
# Create a trial-level run
with Run(run_id=f"trial-{sweep_id}-{trial}") as trial_run:
trial_run.add_tags(["trial"])

# Add sweep_id to the trial-level run
trial_run.add_tags([sweep_id], group_tags=True)

# Log trial number and hyperparams
trial_run.log_configs({"trial_num": trial, "parameters/lr": lr})

optimizer = optim.SGD(model.parameters(), lr=lr)

step = 0

for epoch in trange(parameters["epochs"], desc=f"Trial {trial} - lr: {lr}"):
trial_run.log_metrics(data={"epochs": epoch}, step=epoch)

for x, y in trainloader:
x, y = x.to(parameters["device"]), y.to(parameters["device"])
optimizer.zero_grad()
x = x.view(x.size(0), -1)
outputs = model(x)
loss = criterion(outputs, y)

_, preds = torch.max(outputs, 1)
acc = (torch.sum(preds == y.data)) / len(x)

# Log trial metrics
trial_run.log_metrics(
data={
"metrics/batch/loss": float(loss),
"metrics/batch/acc": float(acc),
},
step=step,
)

# Log best values across all trials to Sweep-level run
if best_acc is None or acc > best_acc:
best_acc = acc
sweep_run.log_configs(
{
"best/trial": trial,
"best/metrics/loss": float(loss),
"best/metrics/acc": float(acc),
"best/parameters/lr": lr,
}
)

loss.backward()
optimizer.step()

step += 1

Each trial-level run is automatically stopped upon exiting the context. To stop the sweep-level run, use:

sweep_run.close()

To explore the logged metadata, open your project in the Neptune app and navigate to the Runs section.

The best trial, with its metrics and parameters, is available in the best namespace of each sweep-level run.