App version: 3.4.4

Neptune API error handling

Python package: neptune-scale

In case of metadata inconsistencies or other failures, Neptune triggers the default on_error_callback that terminates the training process. This can happen, for example, if logged metric steps don't increase.

To override the default behavior, you can provide custom callbacks for handling various error scenarios.

Should we run Neptune in a separate process?

No need, as there is a retry mechanism within the Neptune client. The client is safe to use from the main process, as its only function is to pass the data to the subprocess responsible for sending the data to the server.

If either the main process or the worker process is terminated, you get notified.

If the worker is terminated or encounters a fatal error, the main process is notified and the error callback is called.
If the main process is terminated, the child process exits, so no zombie processes remain.

The parent-child process setup lets us work around Python's GIL for better performance. Any errors that are non-fatal result in a retry, so anything that's reported as an error can be treated as fatal or exhausted retries.

Should we retry in case we encounter errors?

Neptune uses a subprocess to send data asynchronously. This subprocess propagates errors to its parent, which can handle the errors using the callbacks described on this page. The subprocess includes retry logic.

Exceptions raised in Neptune API calls

When using the Neptune Scale Python client library (Neptune API), we recommend surrounding each call with exception handling for specific situations, with the base Exception as a last-resort catch-all.

Example
try:
  run.log_metrics(...)
except NeptuneSeriesStepNonIncreasing as e:  # Specific exception takes priority
  logger.warning("Ignoring non-increasing step")
except Exception as e:  # Base exception is the last resort
  logger.error("Encountered error: %s", e)

Run initialization

The Run(...) call may throw NeptuneProjectNotProvided or NeptuneApiTokenNotProvided if the client library can't find its configuration via explicit arguments or environment variables.

For configuration help, see API tokens and Projects.

Logging data

When using the log_configs() or log_metrics() logging methods, the following validation errors can occur:

TypeError when argument types are mismatched, such as passing a string instead of a float to log_metrics().
ValueError in case of malformed arguments, such as paths that are empty or too long.
NeptuneSeriesStepNonIncreasing indicates a failure in client-side validation when the steps for a given metric are not strictly increasing.
NeptuneFloatValueNanInfUnsupported when logging a NaN or infinity value and the NEPTUNE_SKIP_NON_FINITE_METRICS environment variable is set to False.

By default, NaN and Inf are skipped with a warning.
Operating-system level errors. These are usually non-recoverable and leave the Run object in an undefined state. For example:
- Failure to enqueue logging operations in the logging methods.
- Failure to update a variable that's shared between processes and used to track in-flight operation status.

Closing a run

Closing a run can fail when an OS-level error occurs, such as failure to terminate a process or clean up resources.

Error handling in callbacks

The Run object exposes four parameters that serve as callbacks for various error or warning scenarios:

`on_network_error_callback`

Handles low-level network errors that occur during HTTP requests. These errors include:

Read, write, and connect timeouts
Malformed requests
Connection failures

Note: This callback is called only when the retry mechanism fails.

`on_warning_callback`

Called in a few specific scenarios:

You're creating a run with an ID that already exists.
You're trying to fork a run which doesn't exist.
You're sending a point to a metric which is exactly the same as the latest point in this metric.

`on_error_callback`

Umbrella callback for various issues. Includes a mix of error classes:

API authorization errors. For example, you don't have permissions to write to a project.
Errors in the lifecycle of the local process synchronizing data to Neptune. For example, the process exited unexpectedly.
Semantic errors, such as:
- You're trying to write to a run that doesn't exist.
- You're trying to fork a run, but its parent doesn't exist.
- You're trying to create a run, but the creation parameters are invalid.
- You're trying to write a point to a metric with non-increasing step or timestamp.

`on_queue_full_callback` (unused)

This additional parameter is currently unused.

Overriding the default error handling

We recommend to always provide error handling callbacks explicitly. There are two main directions to optimize for:

A) Optimize for never stopping the training process, even at the cost of some data not appearing in Neptune.

In this case, we recommend setting all callbacks to something like:
```
def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
    logger.warning(f"Encountered {exc} error")
```

B) Optimize for data correctness or completeness, even at the cost of stopping the training process.

This scenario is more complex and requires handling exceptions on case-by-case basis.

Example

def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
    if isinstance(exc, NeptuneSynchronizationStopped):
        # The process synchronizing logged data to Neptune exited
        run.terminate()
        
    elif isinstance(exc, NeptuneFloatValueNanInfUnsupported):
        # We're trying to log NaN/Inf, which is currently not supported
        logger.warning(f"Failed to log NaN/Inf metric value")

    ...
    
    else:
        run.terminate()

For the full set of possible exceptions, see the source code on GitHub.

Exceptions raised in Neptune API calls​

Run initialization​

Logging data​

Closing a run​

Error handling in callbacks​

on_network_error_callback​

on_warning_callback​

on_error_callback​

on_queue_full_callback (unused)​

Overriding the default error handling​