Recommand · October 22, 2021 0

How to test a function that is running many models on SageMaker?

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 10 mins ago.

Improve this question

We built a wrapper for training ML models on AWS SageMaker and we are not sure how to test it.

Even instructions for manual testing would be great.

I am not sure what test-cases should I define.

Basic Background

We are development team, supplying tools for a few research teams.
xgboost is an ML library for training ML models.
Some of our models are a composition of xgboost models.
The user wants to be able to train a model which is potentially composed of 200 basic xgboost models, by calling a single function.
He wants the training to be fast.

What we built

The user’s interface is something like this:

composed_model = ...
token: str = composed_model.train_remotely(features_pandas_dataframe, targets_pandas_dataframe)

# wait a few hours
description: Description = Description.load(token)
description.get_status()  # returns the job's status

model = load_model(token)  # This call succeeds when training completes and the status is "Done"

Extra Background

  • Implementation-wise, we train our xgboost models on SageMaker, which is a service of AWS.
    We use an API to access this service.
    Upon request, the service starts a machine, which downloads the data from S3, trains a basic xgboost model and uploads the model to S3.

  • Under the hood, our composite model is composed of many bagging regressors (average of xgboost models).

  • Training of models on SageMaker is done with Spot instances / on-demand instances.
    For on-demand instances, we are charged the full costs.
    In Spot training, we are charged 30% of the costs – BUT AWS may automatically de-allocate your training job, pausing it in the middle,
    and resume it later when there is enough free resources again.

What have we came up with so far:

  • Testing with empty dataframes.
  • Stopping the call to "train_remotely()" half-way and checking what happens to the training jobs that were already sent.