ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

timxieICN · 2023-03-02T16:52:57Z

Describe the bug
It's related to several other bug reports:

I already installed the latest version of sagemaker==2.135.0 and boto3==1.26.81. However, I'm still having trouble passing the custom environment variable from Estimator to HyperparameterTuner on my custom ECR image

Now it's failing in the parameter validation:

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change: [algorithmSpecification, inputDataConfig, outputDataConfig, staticHyperParameters, roleArn, resourceConfig, stoppingCondition, vpcConfig, enableManagedSpotTraining, checkpointConfig].

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

script_preprocess = ScriptProcessor(
    image_uri=CUSTOM_BUILD_IMAGE,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name=f"{base_job_prefix}/mau-conversion-preprocess",
    sagemaker_session=sagemaker_session,
    role=role,
    env={
          "STAGE": "ALPHA",
    },
)
step_process = ProcessingStep(
    name="PreprocessMauConversionData",
    processor=script_preprocess,
    inputs=[
        ProcessingInput(
            input_name="input_training",
            destination="/opt/ml/processing/input/training",
            source=f"s3://{datalake_bucket}/{base_job_prefix}/"
            f"preprocessed/training/2022-10-27/2022-10-27T03_00_29.206816+00_00.csv.gz",
            s3_data_type="S3Prefix",
            s3_input_mode="File",
            s3_data_distribution_type="FullyReplicated",
            s3_compression_type="None",
            app_managed=False,
        ),
        ProcessingInput(
            input_name="input_prediction",
            destination="/opt/ml/processing/input/prediction",
            source=f"s3://{datalake_bucket}/{base_job_prefix}/"
            f"preprocessed/prediction/2022-10-27/2022-10-27T03_00_29.206816+00_00.csv.gz",
            s3_data_type="S3Prefix",
            s3_input_mode="File",
            s3_data_distribution_type="FullyReplicated",
            s3_compression_type="None",
            app_managed=False,
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            s3_upload_mode="EndOfJob",
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/output/validation",
            s3_upload_mode="EndOfJob",
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            s3_upload_mode="EndOfJob",
        ),
    ],
    code=os.path.join(BASE_DIR, "preprocess.py"),
)

# Tuning step for xgboost model
xgb_tune = Estimator(
    image_uri=CUSTOM_BUILD_IMAGE,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    volume_size=5,
    output_path=f"s3://{default_bucket}/{base_job_prefix}/MauConversionTuneXGB",
    base_job_name=f"{base_job_prefix}/mau-conversion-base-tune-xgb",
    sagemaker_session=sagemaker_session,
    role=role,
    entry_point=os.path.join(BASE_DIR, "preprocess.py"),
    environment={
        "STAGE": "ALPHA",
    },
    metric_definitions=[
        {
            "Name": "validation:accuracy",
            "Regex": ".*validation-accuracy:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*"
        }
    ],
)
xgb_tune.set_hyperparameters(
    verbosity=1,
    objective="binary:logistic",
    scale_pos_weight=0.5,
    rate_drop=0.2,
)
tuner_xgb_train = HyperparameterTuner(
    estimator=xgb_tune,
    objective_metric_name="validation:accuracy",
    hyperparameter_ranges={
        "max_depth": IntegerParameter(min_value=3, max_value=10, scaling_type="Linear"),
        "min_child_weight": IntegerParameter(min_value=1, max_value=5, scaling_type="Linear"),
        "eta": ContinuousParameter(min_value=0.01, max_value=0.2, scaling_type="Logarithmic"),
        "gamma": ContinuousParameter(min_value=0.1, max_value=0.3, scaling_type="Logarithmic"),
        "subsample": ContinuousParameter(min_value=0.75, max_value=1.0, scaling_type="Linear"),
    },
    strategy="Bayesian",
    objective_type="Maximize",
    max_jobs=6,
    max_parallel_jobs=2,
    base_tuning_job_name=f"{base_job_prefix}/mau-conversion-hyper-tune-xgb",
    warm_start_config={},
    metric_definitions=[
        {
            "Name": "validation:accuracy",
            "Regex": ".*validation-accuracy:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*"
        }
    ],
)
step_xgb_tune = TuningStep(
    name="TuneMauConversionXbgModel",
    tuner=tuner_xgb_train,
    description="This is tuning XGBoost model",
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
    depends_on=[step_process],
)

Screenshots or logs

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.135.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom framework on custom ECR image
Framework version: Custom framework on custom ECR image
Python version: 3.9
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Python Package

boto3==1.26.81
botocore==1.29.81
sagemaker==2.135.0

The text was updated successfully, but these errors were encountered:

timxieICN · 2023-03-02T21:27:26Z

I also tested it using a pre-built image provided by AWS and the errors are gone.
So it's very possible that the previous PR fix does not handle custom images well.

image_uri = sagemaker.image_uris.retrieve(
            framework="xgboost",
            region=region,
            version="1.0-1",
            py_version="py3",
            instance_type=training_instance_type,
        )

timxieICN · 2023-03-06T14:18:44Z

OK - it turns out the issue is due to the use of warm_start_config in HyperparameterTuner. Due to the upgraded version of sagemaker and boto3, the previous training job cannot be used as warm start configurations. When I start the HPO jobs from refresh, the errors are gone.

timxieICN added the bug label Mar 2, 2023

timxieICN closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

timxieICN commented Mar 2, 2023 •

edited

Loading

timxieICN commented Mar 2, 2023

timxieICN commented Mar 6, 2023

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

Comments

timxieICN commented Mar 2, 2023 • edited Loading

timxieICN commented Mar 2, 2023

timxieICN commented Mar 6, 2023

timxieICN commented Mar 2, 2023 •

edited

Loading