gpu_stress_test#
- fkat.utils.cuda.preflight.health_check.gpu_stress_test.run_gpu_stress_test(mlflow_run_id: str, result_queue: Queue, gpu_mem: int, max_runtime: int) None[source]#
Performs a multi-GPU stress test by executing repeated matrix multiplications and inter-GPU memory transfers.
This function: - Allocates large tensors on each GPU (assuming 8 GPUs), - Performs repeated matmul operations to stress GPU compute, - Copies results across GPUs to test memory transfer integrity, - Verifies data correctness after each transfer, - Logs metrics to MLflow regarding correctness and loop iterations, - Returns a dictionary summarizing the health of each GPU via the result queue.
- Parameters:
mlflow_run_id (str) – The MLflow run ID under which metrics are logged.
result_queue (Queue) – A multiprocessing-safe queue to place GPU health results.
gpu_mem (int) – Approximate GPU memory (in GB) to target when allocating stress test tensors.
max_runtime (int) – Maximum runtime (in seconds) to perform the stress test.
- Returns:
None