Crash Detector#

Overview#

The CrashDetector callback monitors for process crashes during training and logs detailed error information including PID, rank, error messages, and stack traces.

Usage#

Basic usage:

from fkat.pytorch.callbacks.monitoring import CrashDetector
import lightning as L

callback = CrashDetector()
trainer = L.Trainer(callbacks=[callback])

Custom tags:

callback = CrashDetector(
    error_tag="training_error",
    crash_info_tag="crash_details"
)

Features#

Process monitoring: Monitors main training process for crashes
Detailed crash info: Captures PID, rank, exit code, signal, and timestamp
Exception handling: Logs full stack traces for exceptions
MLflow artifacts: Automatically logs crash info to MLflow artifacts (if MLflow logger is configured)
Rank-aware: Only runs on local rank 0
Queue-based: Uses multiprocessing queue for crash reporting

Crash Information#

When a crash is detected, the following information is logged:

pid: Process ID of the crashed process
rank: Global rank of the process
exit_code: Exit code of the process
signal: Signal that terminated the process (if any)
error: Error message (for exceptions)
stacktrace: Full stack trace (for exceptions)
timestamp: UTC timestamp of the crash

MLflow Integration#

If an MLflow logger is configured, crash information is automatically logged as an artifact in the crashes/ directory. This allows you to review crash details in the MLflow UI alongside other training metrics and artifacts.

Crash Detector

Contents