gpu_stress_test

gpu_stress_test#

fkat.utils.cuda.preflight.health_check.gpu_stress_test.run_gpu_stress_test(mlflow_run_id: str, result_queue: Queue, gpu_mem: int, max_runtime: int) None[source]#

Performs a multi-GPU stress test by executing repeated matrix multiplications and inter-GPU memory transfers.

This function: - Allocates large tensors on each GPU (assuming 8 GPUs), - Performs repeated matmul operations to stress GPU compute, - Copies results across GPUs to test memory transfer integrity, - Verifies data correctness after each transfer, - Logs metrics to MLflow regarding correctness and loop iterations, - Returns a dictionary summarizing the health of each GPU via the result queue.

Parameters:
  • mlflow_run_id (str) – The MLflow run ID under which metrics are logged.

  • result_queue (Queue) – A multiprocessing-safe queue to place GPU health results.

  • gpu_mem (int) – Approximate GPU memory (in GB) to target when allocating stress test tensors.

  • max_runtime (int) – Maximum runtime (in seconds) to perform the stress test.

Returns:

None