Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown parameter in TrainingJobDefinition: "Environment" #3627

Closed
DougTrajano opened this issue Feb 2, 2023 · 4 comments
Closed

Unknown parameter in TrainingJobDefinition: "Environment" #3627

DougTrajano opened this issue Feb 2, 2023 · 4 comments
Labels

Comments

@DougTrajano
Copy link

DougTrajano commented Feb 2, 2023

Describe the bug

It's related to another bug that I previously reported in #3598 and #3614

Now it's failing in the parameter validation

ParamValidationError: Parameter validation failed:
Unknown parameter in TrainingJobDefinition: "Environment", must be one of: DefinitionName, TuningObjective, HyperParameterRanges, StaticHyperParameters, AlgorithmSpecification, RoleArn, InputDataConfig, VpcConfig, OutputDataConfig, ResourceConfig, StoppingCondition, EnableNetworkIsolation, EnableInterContainerTrafficEncryption, EnableManagedSpotTraining, CheckpointConfig, RetryStrategy, HyperParameterTuningResourceConfig

To reproduce

from sagemaker.pytorch import PyTorch
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

checkpoint_s3_uri = f"s3://{bucket_name}/{prefix}/checkpoints"

instance_type = "ml.g4dn.xlarge" 
# 4 vCPUs, 16 GB RAM, 1 x NVIDIA T4 16GB GPU - $ 0.736 per hour

estimator = PyTorch(
    entry_point="train.py",
    source_dir="ml",
    role=params.sagemaker_execution_role_arn,
    sagemaker_session=sagemaker_session,
    py_version="py38",
    framework_version="1.12.0",
    instance_count=1,
    instance_type=instance_type,
    use_spot_instances=True,
    max_wait=10800,
    max_run=10800,
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path="/opt/ml/checkpoints",
    environment={
        "MLFLOW_TRACKING_URI": params.mlflow_tracking_uri,
        "MLFLOW_EXPERIMENT_NAME": params.mlflow_experiment_name,
        "MLFLOW_TRACKING_USERNAME": params.mlflow_tracking_username,
        "MLFLOW_TRACKING_PASSWORD": params.mlflow_tracking_password,
        "MLFLOW_TAGS": params.mlflow_tags,
        "MLFLOW_RUN_ID": mlflow.active_run().info.run_id,
        "MLFLOW_FLATTEN_PARAMS": "True"
    },
    hyperparameters={
        ## If you want to test the code, uncomment the following lines to use smaller datasets
        # "max_train_samples": 100,
        # "max_val_samples": 100,
        # "max_test_samples": 100,
        "num_train_epochs": params.num_train_epochs,
        "early_stopping_patience": params.early_stopping_patience,
        "eval_dataset": "validation",
        "batch_size": params.batch_size,
        "seed": params.seed
    }
)

tuner = HyperparameterTuner(
    estimator,
    max_jobs=18,
    max_parallel_jobs=3,
    objective_type="Maximize",
    objective_metric_name="eval_f1",
    metric_definitions=[
        {
            "Name": "eval_f1",
            "Regex": "eval_f1: ([0-9\\.]+)"
        }
    ],
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(1e-5, 1e-3),
        "weight_decay": ContinuousParameter(0.0, 0.1),
        "adam_beta1": ContinuousParameter(0.8, 0.999),
        "adam_beta2": ContinuousParameter(0.8, 0.999),
        "adam_epsilon": ContinuousParameter(1e-8, 1e-6),
        "label_smoothing_factor": ContinuousParameter(0.0, 0.1),
        "optim": CategoricalParameter(
            [
                "adamw_hf",
                "adamw_torch",
                "adamw_apex_fused",
                "adafactor"
            ]
        )
    }
)

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.131.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.12.0
  • Python version: 3.8
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N
@DougTrajano DougTrajano added the bug label Feb 2, 2023
@repushko
Copy link
Collaborator

repushko commented Feb 3, 2023

Hello Douglas!
Could you share please which version of the boto3 SDK are you using?

@DougTrajano
Copy link
Author

Hello Douglas! Could you share please which version of the boto3 SDK are you using?

boto3==1.26.32

@repushko
Copy link
Collaborator

repushko commented Feb 6, 2023

@DougTrajano according to the boto3 changelog, this functional is supported in versions >= 1.26.53. Could you try to update your boto3 version?

@DougTrajano
Copy link
Author

@DougTrajano according to the boto3 changelog, this functional is supported in versions >= 1.26.53. Could you try to update your boto3 version?

yeah! I tested with the latest version of boto3 and it worked. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants