Neptune API error handling
Python package: neptune-scale
In case of metadata inconsistencies or other failures, Neptune triggers the default on_error_callback
that terminates the training process. This can happen, for example, if logged metric steps don't increase.
To override the default behavior, you can provide custom callbacks for handling various error scenarios.
Should we run Neptune in a separate process?
No need, as there is a retry mechanism within the Neptune client. The client is safe to use from the main process, as its only function is to pass the data to the subprocess responsible for sending the data to the server.
If either the main process or the worker process is terminated, you get notified.
- If the worker is terminated or encounters a fatal error, the main process is notified and the error callback is called.
- If the main process is terminated, the child process exits, so no zombie processes remain.
The parent-child process setup lets us work around Python's GIL for better performance. Any errors that are non-fatal result in a retry, so anything that's reported as an error can be treated as fatal or exhausted retries.
Should we retry in case we encounter errors?
Neptune uses a subprocess to send data asynchronously. This subprocess propagates errors to its parent, which can handle the errors using the callbacks described on this page. The subprocess includes retry logic.
Exceptions raised in Neptune API calls
When using the Neptune Scale Python client library (Neptune API), we recommend surrounding each call with exception handling for specific situations, with the base Exception
as a last-resort catch-all.
try:
run.log_metrics(...)
except NeptuneSeriesStepNonIncreasing as e: # Specific exception takes priority
logger.warning("Ignoring non-increasing step")
except Exception as e: # Base exception is the last resort
logger.error("Encountered error: %s", e)
Run initialization
The Run(...)
call may throw NeptuneProjectNotProvided
or NeptuneApiTokenNotProvided
if the client library can't find its configuration via explicit arguments or environment variables.
For configuration help, see API tokens and Projects.
Logging data
When using the log_configs()
or log_metrics()
logging methods, the following validation errors can occur:
-
TypeError
when argument types are mismatched, such as passing a string instead of a float tolog_metrics()
. -
ValueError
in case of malformed arguments, such as paths that are empty or too long. -
NeptuneSeriesStepNonIncreasing
indicates a failure in client-side validation when the steps for a given metric are not strictly increasing. -
NeptuneFloatValueNanInfUnsupported
when logging a NaN or infinity value and theNEPTUNE_SKIP_NON_FINITE_METRICS
environment variable is set toFalse
.By default,
NaN
andInf
are skipped with a warning. -
If configured with the
NEPTUNE_LOG_FAILURE_ACTION
environment variable,NeptuneUnableToLogData
is raised if the main process gets stuck. -
Operating-system level errors. These are usually non-recoverable and leave the Run object in an undefined state. For example:
- Failure to enqueue logging operations in the logging methods.
- Failure to update a variable that's shared between processes and used to track in-flight operation status.
Closing a run
Closing a run can fail when an OS-level error occurs, such as failure to terminate a process or clean up resources.
Error handling in callbacks
The Run
object exposes four parameters that serve as callbacks for various error or warning scenarios:
on_network_error_callback
Handles low-level network errors that occur during HTTP requests. These errors include:
- Read, write, and connect timeouts
- Malformed requests
- Connection failures
Note: This callback is called only when the retry mechanism fails.
on_warning_callback
Called in a few specific scenarios:
- You're creating a run with an ID that already exists.
- You're trying to fork a run which doesn't exist.
- You're sending a point to a metric which is exactly the same as the latest point in this metric.
on_error_callback
Umbrella callback for various issues. Includes a mix of error classes:
- API authorization errors. For example, you don't have permissions to write to a project.
- Errors in the lifecycle of the local process synchronizing data to Neptune. For example, the process exited unexpectedly.
- Semantic errors, such as:
- You're trying to write to a run that doesn't exist.
- You're trying to fork a run, but its parent doesn't exist.
- You're trying to create a run, but the creation parameters are invalid.
- You're trying to write a point to a metric with non-increasing step or timestamp.
on_queue_full_callback
(unused)
This additional parameter is currently unused.
Overriding the default error handling
We recommend to always provide error handling callbacks explicitly. There are two main directions to optimize for:
-
A) Optimize for never stopping the training process, even at the cost of some data not appearing in Neptune.
In this case, we recommend setting all callbacks to something like:
def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
logger.warning(f"Encountered {exc} error") -
B) Optimize for data correctness or completeness, even at the cost of stopping the training process.
This scenario is more complex and requires handling exceptions on case-by-case basis.
Example
def _my_callback(exc: BaseException, ts: Optional[float]) -> None:
if isinstance(exc, NeptuneSynchronizationStopped):
# The process synchronizing logged data to Neptune exited
run.terminate()
elif isinstance(exc, NeptuneFloatValueNanInfUnsupported):
# We're trying to log NaN/Inf, which is currently not supported
logger.warning(f"Failed to log NaN/Inf metric value")
...
else:
run.terminate()
For the full set of possible exceptions, see the source code on GitHub.