gpu_connection_test#
- fkat.utils.cuda.preflight.health_check.gpu_connection_test.run_gpu_connection_test(mlflow_run_id: str, result_queue: Queue, dim_items: int, loops: int, master_addr: str, master_port: str, world_size: int, rank: int, device_id: Optional[int] = None, mode: str = 'single') None[source]#
Runs a GPU connectivity and communication benchmark test using NCCL and logs performance metrics to MLflow.
This function initializes a distributed process group with NCCL, performs a warm-up all_reduce, and repeatedly performs all_reduce operations to test GPU communication latency. It records timing statistics for each iteration, logs them to MLflow, and places the results in the provided queue.
- Parameters:
mlflow_run_id (str) – The ID of the MLflow run to log metrics under.
result_queue (Queue) – A multiprocessing-safe queue where timing results are pushed.
dim_items (int) – The dimension of the square tensor used for the all_reduce operation.
loops (int) – The number of all_reduce iterations to run for benchmarking.
master_addr (str) – new internal addr of the process group,
master_port (str) – port used for the process group,
world_size (int) – number of processes expected in process_group,
rank (int) – RANK of the current process in the process_group,
rail (Optional[int], optional) – The CUDA device ID to use for testing. If None, defaults to the current device.
mode (str, optional) – Mode label to tag the MLflow metrics. Defaults to “single”.
- Returns:
None