Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change #3693

Closed
timxieICN opened this issue Mar 2, 2023 · 2 comments
Labels

Comments

@timxieICN
Copy link

timxieICN commented Mar 2, 2023

Describe the bug
It's related to several other bug reports:

I already installed the latest version of sagemaker==2.135.0 and boto3==1.26.81. However, I'm still having trouble passing the custom environment variable from Estimator to HyperparameterTuner on my custom ECR image

Now it's failing in the parameter validation:

ClientError: Failed to invoke sagemaker:CreateHyperParameterTuningJob. Error Details: Only the following fields in TrainingJobDefinition are allowed to change: [algorithmSpecification, inputDataConfig, outputDataConfig, staticHyperParameters, roleArn, resourceConfig, stoppingCondition, vpcConfig, enableManagedSpotTraining, checkpointConfig].

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

script_preprocess = ScriptProcessor(
    image_uri=CUSTOM_BUILD_IMAGE,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name=f"{base_job_prefix}/mau-conversion-preprocess",
    sagemaker_session=sagemaker_session,
    role=role,
    env={
          "STAGE": "ALPHA",
    },
)
step_process = ProcessingStep(
    name="PreprocessMauConversionData",
    processor=script_preprocess,
    inputs=[
        ProcessingInput(
            input_name="input_training",
            destination="/opt/ml/processing/input/training",
            source=f"s3://{datalake_bucket}/{base_job_prefix}/"
            f"preprocessed/training/2022-10-27/2022-10-27T03_00_29.206816+00_00.csv.gz",
            s3_data_type="S3Prefix",
            s3_input_mode="File",
            s3_data_distribution_type="FullyReplicated",
            s3_compression_type="None",
            app_managed=False,
        ),
        ProcessingInput(
            input_name="input_prediction",
            destination="/opt/ml/processing/input/prediction",
            source=f"s3://{datalake_bucket}/{base_job_prefix}/"
            f"preprocessed/prediction/2022-10-27/2022-10-27T03_00_29.206816+00_00.csv.gz",
            s3_data_type="S3Prefix",
            s3_input_mode="File",
            s3_data_distribution_type="FullyReplicated",
            s3_compression_type="None",
            app_managed=False,
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name="train",
            source="/opt/ml/processing/output/train",
            s3_upload_mode="EndOfJob",
        ),
        ProcessingOutput(
            output_name="validation",
            source="/opt/ml/processing/output/validation",
            s3_upload_mode="EndOfJob",
        ),
        ProcessingOutput(
            output_name="test",
            source="/opt/ml/processing/output/test",
            s3_upload_mode="EndOfJob",
        ),
    ],
    code=os.path.join(BASE_DIR, "preprocess.py"),
)

# Tuning step for xgboost model
xgb_tune = Estimator(
    image_uri=CUSTOM_BUILD_IMAGE,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    volume_size=5,
    output_path=f"s3://{default_bucket}/{base_job_prefix}/MauConversionTuneXGB",
    base_job_name=f"{base_job_prefix}/mau-conversion-base-tune-xgb",
    sagemaker_session=sagemaker_session,
    role=role,
    entry_point=os.path.join(BASE_DIR, "preprocess.py"),
    environment={
        "STAGE": "ALPHA",
    },
    metric_definitions=[
        {
            "Name": "validation:accuracy",
            "Regex": ".*validation-accuracy:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*"
        }
    ],
)
xgb_tune.set_hyperparameters(
    verbosity=1,
    objective="binary:logistic",
    scale_pos_weight=0.5,
    rate_drop=0.2,
)
tuner_xgb_train = HyperparameterTuner(
    estimator=xgb_tune,
    objective_metric_name="validation:accuracy",
    hyperparameter_ranges={
        "max_depth": IntegerParameter(min_value=3, max_value=10, scaling_type="Linear"),
        "min_child_weight": IntegerParameter(min_value=1, max_value=5, scaling_type="Linear"),
        "eta": ContinuousParameter(min_value=0.01, max_value=0.2, scaling_type="Logarithmic"),
        "gamma": ContinuousParameter(min_value=0.1, max_value=0.3, scaling_type="Logarithmic"),
        "subsample": ContinuousParameter(min_value=0.75, max_value=1.0, scaling_type="Linear"),
    },
    strategy="Bayesian",
    objective_type="Maximize",
    max_jobs=6,
    max_parallel_jobs=2,
    base_tuning_job_name=f"{base_job_prefix}/mau-conversion-hyper-tune-xgb",
    warm_start_config={},
    metric_definitions=[
        {
            "Name": "validation:accuracy",
            "Regex": ".*validation-accuracy:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?).*"
        }
    ],
)
step_xgb_tune = TuningStep(
    name="TuneMauConversionXbgModel",
    tuner=tuner_xgb_train,
    description="This is tuning XGBoost model",
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
    depends_on=[step_process],
)

Screenshots or logs
Screen Shot 2023-03-02 at 11 49 41 AM

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.135.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom framework on custom ECR image
  • Framework version: Custom framework on custom ECR image
  • Python version: 3.9
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Python Package

boto3==1.26.81
botocore==1.29.81
sagemaker==2.135.0
@timxieICN timxieICN added the bug label Mar 2, 2023
@timxieICN
Copy link
Author

I also tested it using a pre-built image provided by AWS and the errors are gone.
So it's very possible that the previous PR fix does not handle custom images well.

image_uri = sagemaker.image_uris.retrieve(
            framework="xgboost",
            region=region,
            version="1.0-1",
            py_version="py3",
            instance_type=training_instance_type,
        )

@timxieICN
Copy link
Author

OK - it turns out the issue is due to the use of warm_start_config in HyperparameterTuner. Due to the upgraded version of sagemaker and boto3, the previous training job cannot be used as warm start configurations. When I start the HPO jobs from refresh, the errors are gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant