logger#
- fkat.utils.cuda.preflight.health_check.logger.create_instance_level_mlflow_run(unique_id: UniqueID, job_level_mlflow_run_name: str, instance_stats: InstanceStats) None[source]#
Creates a job-level MLflow run and logs instance metadata.
This function should be called once per job (typically by the global rank 0 process). It starts an MLflow run with the provided name and logs instance metadata such as type, region, and batch job information. All non-zero rank processes will wait for 2 seconds to ensure the run is created before proceeding.
- Parameters:
unique_id (str) – ID of the instance.
job_level_mlflow_run_name (str) – The name to assign to the MLflow run.
instance_stats (InstanceStats) – An object containing metadata about the instance, including type, region, and scan timestamp.
- Returns:
None
- fkat.utils.cuda.preflight.health_check.logger.create_job_level_mlflow_run(job_level_mlflow_run_name: str, instance_stats: InstanceStats) None[source]#
Create job level mlflow run, batch_id if batch job, local if local job. This will only be create one time in a job, by rank==0. All other processes wait for 5s.
- fkat.utils.cuda.preflight.health_check.logger.end_all_mlflow_active_runs() None[source]#
End all active mlflow runs.
- fkat.utils.cuda.preflight.health_check.logger.get_parent_mlflow_id() str[source]#
Initializes a two-layer MLflow run structure for organized metric and artifact tracking.
This function sets up the MLflow tracking URI and experiment based on the instance’s region. It then creates: 1. A job-level run identified by the AWS Batch Job ID (or a local fallback). 2. An instance-level run identified by the instance’s GPU hash ID.
The job-level run is created once by the global rank 0 process. The instance-level run is created by local rank 0 processes per node. All other local ranks on a node join the corresponding instance-level run.
- Parameters:
node_rank (int) – The global rank of the current node (used for job-level run creation).
instance_gpu_hash_id (str) – A unique identifier for the current instance’s GPU setup.
instance_stats (InstanceStats) – Object containing instance metadata such as type, region, and scan time.
- Returns:
The MLflow run ID of the job-level (parent) run.
- Return type:
str
- fkat.utils.cuda.preflight.health_check.logger.initialize_mlflow(unique_id: UniqueID, instance_stats: InstanceStats) str[source]#
Initial mlflow. The MLflow run will have 2 layers, index by the following: 1. batch_run_id or “local_********” 2. Instance_gpu_hash_id.
In this way metrics/parameter/artifact can be better organized.
- fkat.utils.cuda.preflight.health_check.logger.search_join_mlflow_run(run_name: str) None[source]#
Searches for the most recent active MLflow run with the specified run name and joins it.
This function looks for an active MLflow run matching the given run_name within the current region’s configured experiment. If a match is found, it starts logging to that run. If no run is found, it raises a RuntimeError.
- Parameters:
run_name (str) – The name of the MLflow run to search for.
- Returns:
The MLflow run ID of the matched run.
- Return type:
str
- Raises:
RuntimeError – If no active MLflow run with the specified name is found.