run#
- fkat.utils.cuda.preflight.run.check() None[source]#
Executes the current script using the system Python interpreter.
Intended as a CLI entry point for basic validation or debugging.
- fkat.utils.cuda.preflight.run.fetch_node_info() tuple[bool | str, fkat.utils.cuda.preflight.health_check.helpers.UniqueID, fkat.utils.cuda.preflight.health_check.helpers.InstanceStats, str][source]#
Gathers necessary metadata for preflight health checking.
- Returns:
- A tuple containing:
fetch success status (bool or error message),
UniqueID object,
InstanceStats object,
Job-level MLflow run ID.
- Return type:
tuple
- fkat.utils.cuda.preflight.run.isolate_bad_node() None[source]#
Checks the health status of the current instance from DynamoDB and isolates it if unhealthy.
This function: 1. Retrieves GPU hash ID and instance metadata. 2. Queries the health status record from DynamoDB using the GPU hash ID. 3. If the instance was never scanned, raises an error to indicate unexpected behavior. 4. If the instance is unhealthy, the process enters an infinite sleep to prevent further participation. 5. If the instance is healthy, it sleeps for 15 minutes to allow other nodes to complete isolation logic.
This function is typically used in orchestration flows to quarantine failed nodes.
- Raises:
RuntimeError – If the instance health record is missing in the database.
- fkat.utils.cuda.preflight.run.log_preflight_results(all_check_result: dict[str, Any], unique_id: UniqueID, instance_stats: InstanceStats) None[source]#
Logs the result of the health check to both MLflow and DynamoDB.
This function only runs on local_rank == 0.
- Parameters:
all_check_result (dict) – Health check results keyed by test name.
unique_id (UniqueID) – Cluster context and rank information.
instance_stats (InstanceStats) – Node-level configuration and test results.
- fkat.utils.cuda.preflight.run.preflight_health_check() None[source]#
Performs a preflight diagnostic to validate whether the current instance is suitable for distributed training.
Steps performed: 1. Gathers instance metadata, GPU hash ID, and cluster information. 2. Runs a GPU stress test to verify core GPU functionality. 3. Executes a single-node NVLink test to validate intra-node GPU connectivity. 4. Conditionally runs a multi-node NVLink test for inter-node GPU connectivity (if node count is even and >1). 5. Aggregates all test results and determines the node’s overall health. 6. Logs the test results and health status to MLflow and DynamoDB. 7. Cleans up any distributed process groups and MLflow state.
- Side Effects:
Updates the instance health status in MLflow and DynamoDB.
Logs diagnostic outputs and results.
Delays execution based on rank and test coordination logic.
Note
This function must be called within a properly initialized distributed environment with expected env vars: RANK, LOCAL_RANK, WORLD_SIZE, LOCAL_WORLD_SIZE, GROUP_WORLD_SIZE.
- Raises:
None directly, but will log and mark the instance as unhealthy if any test fails. –