Solving Common Issues in Distributed Hyperparameter Optimization

Easily identify issues in distributed hyperparameter optimization with SigOpt’s Monitor Dashboard
Easily identify issues in distributed hyperparameter optimization with SigOpt’s Monitor Dashboard.

You’ve spent time gathering data and feature engineering and now you’re ready to pick a model and maximize its performance. The next step is to optimize the hyperparameters of your model. You’ve chosen Bayesian optimization for hyperparameter optimization because it has been shown to achieve better results than than grid search and random search, while also requiring fewer evaluations of your model [1].

Since you probably want to complete hyperparameter optimization in a reasonable amount of time, you’re considering some method of distributing the total work across a cluster. If you’re considering distributing the model itself over the cluster, Netflix has a great blog post about the tradeoffs involved in that decision [2]. For now, let’s assume that model training is contained on a single machine, and individual workers train new models for each set of hyperparameters, in a setup similar to the one below.

Distributed hyperparameter optimization using the SigOpt API to request Suggestions (suggested sets of hyperparameters,) and report Observations (the value of the model on a given set of hyperparameters). From SigOpt’s docs on parallelization
Distributed hyperparameter optimization using the SigOpt API to request Suggestions (suggested sets of hyperparameters,) and report Observations (the value of the model on a given set of hyperparameters). From SigOpt’s docs on parallelization.

However, you still have a few problems:

  • Distributed compute management and monitoringProgress monitoring
  • Manual interventions and controlCustom hyperparameters

We’ll show you how you can use SigOpt’s API for Bayesian optimization to manage your cluster and solve both of these problems.

Problem #1: Progress Monitoring

The performance benefits of Bayesian optimization come at the cost of additional bookkeeping. Unlike grid search or random search, a master node will also need to store the model performance associated with each set of hyperparameters. Because of this added complexity, the development of a distributed Bayesian hyperparameter optimization cluster requires simultaneously debugging both your job scheduler and the health of your cluster. Spark and AWS offer cluster monitoring tools for the health of your machines, but you need an at-a-glance answer to the question “Is my hyperparameter optimization running smoothly?”

SigOpt’s answer is visualization. In your hyperparameter optimization cluster, each node will repeatedly execute the follow steps:

  1. Receive a suggested configuration of hyperparameters
  2. Evaluate the model using the suggested hyperparameters
  3. Report back an observation on the model’s performance

When using SigOpt, steps 1 and 3 are both simple API calls. We built a dashboard of useful graphs on top of information that we infer from API calls, such as the time an API call is made.

SigOpt’s web dashboards visualize the progress of your model’s performance, so you can quickly diagnose potential system failings. If a classifier’s accuracy was 80% with default hyperparameters but stayed flat at 50% after 20 model evaluations, you might have a problem.

Well-running optimization
Poorly running optimization
Model evaluation progress for well-running optimization (top) versus poorly-running optimization (bottom).

In addition, SigOpt’s web dashboards visualize the number of performance observations from step 3 that have been reported. Our API allows you to report failed performance for any reason, and we include this data in our visualization. You can also use metadata to slice your data by hostname. It’s easy to glance at a chart, see red in the column for host-2, and know that something is wrong.

Well-running optimization
Poorly running optimization
Histogram of performance observations tagged by hostname for well-running optimization (top) versus poorly-running optimization (bottom).

Problem #2: Custom Hyperparameters

Now, let’s say you’ve got one particular set of hyperparameters that you’re dying to try out because your intuition tells you it’s gonna be great. The problem is, you’ve already started the distributed hyperparameter optimization and you don’t want to shut anything down. At SigOpt, we have designed our API to act as a distributed job scheduler. We manage storing, queueing, and serving hyperparameter configurations, allowing you to run the same optimization code on every node, whether in the sequential or parallel workflow. This centralized design allowed us to easily build a feature for queuing up customer hyperparameter configurations.

From the website, navigate to an experiment’s monitoring dashboard to queue a suggestion. It will automatically be included in the distributed hyperparameter jobs that are sent out via the API, with no code changes necessary on your end. It even works on mobile.

Queue up Suggestions from the desktop or mobile website
Queue up Suggestions from the desktop or mobile website.

Conclusion

SigOpt’s API for hyperparameter optimization leaves us well-positioned to build exciting features for anyone who wants to perform Bayesian hyperparameter optimization in parallel. We offer an easy-to-use REST API with R, Java and Python clients, and all data that is passed through the API is available on both desktop and mobile websites. We’re constantly building new features and visualizations for monitoring your progress and exploring your data.

[1] Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexandra Johnson, George Ke. A Stratified Analysis of Bayesian Optimization Methods. 2016 (PDF)

[2] Alex Chen, Justin Basilico, and Xavier Amatriain. Distributed Neural Networks with GPUs in the AWS Cloud. 2014 (Link)

Alexandra works on everything from infrastructure to product features to blog posts. Previously, she worked on growth, APIs, and recommender systems at Polyvore (acquired by Yahoo). She majored in computer science at Carnegie Mellon University with a minor in discrete mathematics and logic, and during the summers she A/B tested recommendations at internships with Facebook and Rent the Runway.
Alexandra Johnson, Software Engineer