logger#

fkat.utils.cuda.preflight.health_check.logger.create_instance_level_mlflow_run(unique_id: UniqueID, job_level_mlflow_run_name: str, instance_stats: InstanceStats) None[source]#

Creates a job-level MLflow run and logs instance metadata.

This function should be called once per job (typically by the global rank 0 process). It starts an MLflow run with the provided name and logs instance metadata such as type, region, and batch job information. All non-zero rank processes will wait for 2 seconds to ensure the run is created before proceeding.

Parameters:
  • unique_id (str) – ID of the instance.

  • job_level_mlflow_run_name (str) – The name to assign to the MLflow run.

  • instance_stats (InstanceStats) – An object containing metadata about the instance, including type, region, and scan timestamp.

Returns:

None

fkat.utils.cuda.preflight.health_check.logger.create_job_level_mlflow_run(job_level_mlflow_run_name: str, instance_stats: InstanceStats) None[source]#

Create job level mlflow run, batch_id if batch job, local if local job. This will only be create one time in a job, by rank==0. All other processes wait for 5s.

fkat.utils.cuda.preflight.health_check.logger.end_all_mlflow_active_runs() None[source]#

End all active mlflow runs.

fkat.utils.cuda.preflight.health_check.logger.get_parent_mlflow_id() str[source]#

Initializes a two-layer MLflow run structure for organized metric and artifact tracking.

This function sets up the MLflow tracking URI and experiment based on the instance’s region. It then creates: 1. A job-level run identified by the AWS Batch Job ID (or a local fallback). 2. An instance-level run identified by the instance’s GPU hash ID.

The job-level run is created once by the global rank 0 process. The instance-level run is created by local rank 0 processes per node. All other local ranks on a node join the corresponding instance-level run.

Parameters:
  • node_rank (int) – The global rank of the current node (used for job-level run creation).

  • instance_gpu_hash_id (str) – A unique identifier for the current instance’s GPU setup.

  • instance_stats (InstanceStats) – Object containing instance metadata such as type, region, and scan time.

Returns:

The MLflow run ID of the job-level (parent) run.

Return type:

str

fkat.utils.cuda.preflight.health_check.logger.initialize_mlflow(unique_id: UniqueID, instance_stats: InstanceStats) str[source]#

Initial mlflow. The MLflow run will have 2 layers, index by the following: 1. batch_run_id or “local_********” 2. Instance_gpu_hash_id.

In this way metrics/parameter/artifact can be better organized.

fkat.utils.cuda.preflight.health_check.logger.search_join_mlflow_run(run_name: str) None[source]#

Searches for the most recent active MLflow run with the specified run name and joins it.

This function looks for an active MLflow run matching the given run_name within the current region’s configured experiment. If a match is found, it starts logging to that run. If no run is found, it raises a RuntimeError.

Parameters:

run_name (str) – The name of the MLflow run to search for.

Returns:

The MLflow run ID of the matched run.

Return type:

str

Raises:

RuntimeError – If no active MLflow run with the specified name is found.