helpers#
- class fkat.utils.cuda.preflight.health_check.helpers.InstanceStats(instance_metadata: InstanceMetadata, gpu_info: dict[str | int, dict[str, Any]])[source]#
- class fkat.utils.cuda.preflight.health_check.helpers.UniqueID(rank: int, world_size: int, local_rank: int, num_nodes: int, node_rank: int, gpu_per_node: int, gpu_hash_id: str, master_addr: str)[source]#
- gpu_hash_id: str#
- gpu_per_node: int#
- local_rank: int#
- master_addr: str#
- node_rank: int#
- num_nodes: int#
- rank: int#
- world_size: int#
- fkat.utils.cuda.preflight.health_check.helpers.checkfunction_timeout_manager(func: Callable[[...], None], kwargs: dict[str, Any]) Any[source]#
Monitor and enforce a timeout for executing a function within a separate process.
This function runs a specified function (func) in a separate process with the provided arguments (kwargs). It continuously monitors the execution time and terminates the process if it exceeds a defined timeout (HEALTH_CHECK_TIMEOUT_SECS).
The function result is returned via a multiprocessing queue. If the timeout is reached, a TimeoutError is raised.
- Parameters:
func (Callable) – The target function to be executed in a separate process. It must accept mlflow_run_id and result_queue as its first two arguments, followed by additional kwargs.
kwargs (dict) – The keyword arguments to be passed to the function being monitored.
- Returns:
The result returned by the func via the multiprocessing queue.
- Return type:
Any
- Raises:
TimeoutError – If the function exceeds the allowed timeout (HEALTH_CHECK_TIMEOUT_SECS).
- fkat.utils.cuda.preflight.health_check.helpers.destroy_process_group_if_initialized() None[source]#
Safely destroys the PyTorch distributed process group if it is initialized.
This function checks if the torch.distributed process group is both available and initialized. If so, it calls destroy_process_group() and logs success. Otherwise, it logs a warning. Any exceptions during the process are caught and logged as errors.
- fkat.utils.cuda.preflight.health_check.helpers.fetch_gpu_info() tuple[dict[int | str, dict[str, Any]], str][source]#
Retrieve GPU information from the current EC2 instance using NVIDIA Management Library (NVML).
This function initializes NVML to gather GPU details available to PyTorch, including GPU UUIDs and Serial Numbers. Additionally, it generates a hash ID representing all GPUs’ UUIDs for easier identification.
The function logs relevant information and gracefully handles errors, shutting down NVML in all scenarios.
- Returns:
gpu_info (dict): A dictionary containing GPU information where keys are
- PyTorch device indices (int) and values are dictionaries with the following keys:
’uuid’ (str): The UUID of the GPU.
’serial’ (str): The Serial Number of the GPU.
instance_gpu_hash_id (str): A hash string representing the combined UUIDs of all GPUs.
- Return type:
tuple
- Raises:
NVMLError – If there’s an issue retrieving GPU information from NVML.
- fkat.utils.cuda.preflight.health_check.helpers.generate_gpu_uuid_hash(uuid_list: list[str]) str[source]#
Concatenates the UUIDs, computes a SHA-256 hash, and returns the first 17 hex characters.
- fkat.utils.cuda.preflight.health_check.helpers.generate_random_string(length: int) str[source]#
Generate a random string of specified length containing uppercase letters, lowercase letters, and digits.
- Parameters:
length (int) – The desired length of the generated string.
- Returns:
A randomly generated string of the specified length.
- Return type:
str
- fkat.utils.cuda.preflight.health_check.helpers.generate_test_folder_name() str[source]#
Generate a unique test folder name using the current timestamp and a random string.
The folder name is constructed by combining the current date and time (formatted as ‘YYYYMMDD_HHMMSS’) with a randomly generated string of 6 characters consisting of uppercase letters, lowercase letters, and digits.
- Returns:
A unique test folder name.
- Return type:
str
Example
>>> generate_test_folder_name() '20250324_153045_A3bX7z'
- fkat.utils.cuda.preflight.health_check.helpers.make_requests(url: str, token: str) str[source]#
Retrieve instance metadata from AWS EC2 Instance Metadata Service (IMDSv2).
- Parameters:
url (str) – The URL endpoint of the IMDSv2 metadata service.
token (str) – The authentication token required for IMDSv2 requests.
- Returns:
The retrieved instance metadata as a string.
- Return type:
str
- Raises:
requests.exceptions.RequestException – If the request fails due to network
issues, invalid URL, or failed response status. –
- fkat.utils.cuda.preflight.health_check.helpers.strip_aws_batch_id(aws_batch_id: str) str[source]#
Strip the AWS Batch ID to remove any additional node information.
- Parameters:
aws_batch_id (str) – The original AWS Batch ID, which may include a node index suffix.
- Returns:
The stripped AWS Batch ID without any node index suffix.
- Return type:
str