Crash Detector#
Overview#
The CrashDetector callback monitors for process crashes during training and logs detailed error information including PID, rank, error messages, and stack traces.
Usage#
Basic usage:
from fkat.pytorch.callbacks.monitoring import CrashDetector
import lightning as L
callback = CrashDetector()
trainer = L.Trainer(callbacks=[callback])
Custom tags:
callback = CrashDetector(
error_tag="training_error",
crash_info_tag="crash_details"
)
Features#
Process monitoring: Monitors main training process for crashes
Detailed crash info: Captures PID, rank, exit code, signal, and timestamp
Exception handling: Logs full stack traces for exceptions
MLflow artifacts: Automatically logs crash info to MLflow artifacts (if MLflow logger is configured)
Rank-aware: Only runs on local rank 0
Queue-based: Uses multiprocessing queue for crash reporting
Crash Information#
When a crash is detected, the following information is logged:
pid: Process ID of the crashed processrank: Global rank of the processexit_code: Exit code of the processsignal: Signal that terminated the process (if any)error: Error message (for exceptions)stacktrace: Full stack trace (for exceptions)timestamp: UTC timestamp of the crash
MLflow Integration#
If an MLflow logger is configured, crash information is automatically logged as an artifact in the crashes/ directory. This allows you to review crash details in the MLflow UI alongside other training metrics and artifacts.