Automated Machine Learning – Institute of Artificial Intelligence

What is Hyperparameter Optimization?

Hyperparameters encode our design decisions in creating our models and pipelines. These decisions are drivers for the performance of our models but oftentimes are handpicked and heuristically set. Hyperparameter Optimization (HPO) aims at finding a well-performing hyperparameter configuration of a given machine learning model on a dataset at hand, including the machine learning model, its hyperparameters and other data processing steps in an automated, efficient and data-driven manner.

Why does Hyperparameter Optimization matter in practice?

It is well established, that good hyperparameters can boost the performance of our models, but more importantly, HPO frees the human expert from a tedious, expensive and error-prone hyperparameter tuning process.

What do we offer in Hyperparameter Optimization?

Aside from our substantial and continuous contributions in this field, we actively develop SMAC3, a versatile package for hyperparameter optimization, solving low dimensional continuous global optimization problems and configuring algorithms.

Bayesian Optimization

The loss landscape of an HPO problem is typically unknown (e.g., we need to solve a black-box function) and expensive to evaluate. Bayesian Optimization (BO) is designed as a global optimization strategy for expensive black-box functions, aimed at navigating the search space efficiently. To do so, BO first estimates the shape of the target loss landscape with a surrogate model and then suggests the configuration to be evaluated in the next iteration. By trading off exploitation and exploration based on the surrogate model, it is well known for its sample efficiency.
Combined Algorithms Selection and Hyperparameter Optimization (CASH)

An AutoML system needs to select not only the optimal hyperparameter configuration of a given model but also which model to use. This problem can be regarded as a single HPO problem with a hierarchy configuration space, where the top-level hyperparameter decides which algorithm to choose and all other hyperparameters depend on this one. To deal with such complex and structured configuration spaces, we apply for example random forests as surrogate models in Bayesian Optimization.
Multi-Fidelity HPO

The increasing data size and model complexity make it even harder to find a reasonable configuration within a limited computational or time budget. Multi-fidelity techniques aim to reduce the cost of the entire HPO process substantially by approximating the performance of an expensive “black box” model with a cheap (but maybe noisy) evaluation proxy such as smaller training set sizes or training only for a few episodes. Reducing the cost of the evaluation allows to either explore the Hyperparameter landscape more excessively, but requires carefully considering the noise and bias incurred by the approximation. For example, we can use a small subset of the dataset or train a DNN for only a few epochs. We contributed to the field e.g. in MASIF, where we learn to jointly interpret multiple cheap approximations in the form of learning curves from past experiences with a set of algorithms. Similarly, we introduced AutoPyTorch, which allows optimizing ML Pipelines and their hyperparameters, and helps in portfolio construction and ensembling using multi-fidelity.
HPO Benchmarks

AutoML research crucially depends on reliably testing the search strategies. The main challenge to establishing the significance and practical relevance of a new method is the cost of running repeated HPO experiments, where each model evaluation is already expensive. To establish reliable and reproducible research with an increased turn-around speed, we develop benchmarks like HPOBench that reduce the computational burden on researchers.

What is Neural Architecture Search?

Neural Architecture Search (NAS) automates the architecture design process of neural networks. NAS approaches optimize the topology of the networks, incl. how to connect nodes and which operators to choose. User-defined optimization metrics can include accuracy, model size or inference time to arrive at an optimal architecture for specific applications. Due to the extremely large search space, traditional evolution or reinforcement learning-based AutoML algorithms tend to be computationally expensive. Hence recent research on the topic has focused on exploring more efficient ways for NAS. In particular, recently developed gradient-based and multi-fidelity methods have provided a promising path and boosted research in these directions.

Why is Neural Architecture Search important?

How a Neural Network is wired and what kind of operations are used changes its capacity and efficiency in learning particular functions; for instance using Convolutional operations instead of using fully connected layers enables a Neural Network to learn from locally similar data like images or audio efficiently. Self-attention produces very capable models dealing with Language or Time-Series data. The discovery of such inductive biases, either tailored to a specific problem or domain has been a major driver in well-performing models. While many of these inventions have been handcrafted, Neural Architecture Search holds the promise of discovering them in a principled manner

What do we offer in Neural Architecture Search?

NAS Search Space

Unlike traditional machine learning methods that typically optimize the hyperparameters of a pre-developed machine learning model, NAS aims at optimizing the topology of the network and has the potential to find new architectures. As a NAS search space bears the potential architecture that can be found by the NAS algorithms, a small search space might simplify the search procedure. However, this might also bring human bias and prevent the NAS framework from discovering novel architectures. A typical search space is a chain-structured search space. However, skip connections have shown great success in modern deep neural network designs. This inspires the developers to introduce multi-branch search space. Finally, users can stack the same motifs (or cells) multiple times to construct a new architecture to shrink the search space into a single motif instead of the entire architecture.
NAS Search Strategy

Given a NAS search space and a series of pre-evaluated architectures, NAS search strategy will provide a new candidate architecture to be evaluated in the next step. Similar to the HPO process, NAS search strategy needs to trade-off between exploration and exploitation. Therefore, the optimization algorithms introduced by HPO can also be applied as NAS search strategy, e.g., Bayesian Optimization, Evolutionary Algorithm or Reinforcement Learning. Additionally, network weights can be optimized with gradient descent. We can apply the same approach to architecture parameters by relaxing the discrete parameters into continuous variables and optimizing them with gradient descent
NAS Evaluation Methods

In the early phase, NAS researchers only considered neural networks as a type of traditional machine learning model and only evaluated the performance of an architecture once it was trained or with mutli-fidelity evaluation. However, the weights of one architecture can be “inherited” by another architecture by coping all the weights of the existing network to the new network. Approaches that follow this idea include network morphisms that update the architecture of a network without changing the function that the network represents or the one-shot model that contains multiple sub-network and the relative ranking of these submodules are determined by their weights that are inherited from the super network.

What is Algorithm Configuration?

Similarly to HPO, the algorithm configuration problem is to determine a well-performing parameter configuration of a given algorithm, but it is concerned with finding such a configuration across a given set of problem instances such as tasks or datasets.

Why is Algorithm Configuration (AC) important?

In reality, we have had experience with many algorithms applied to a battery of problems and have some intuition about when which algorithm works well, and maybe even some faint idea about which hyperparameters might apply to a new problem. AC takes these problems off our hands and can suggest a new algorithm and its hyperparameters on the fly.

What do we offer in Algorithm Configuration?

Classical Algorithm Configuration

With SMAC3, we provide and maintain a state-of-the-art algorithm configurator, that is able to optimize hyperparameters across a set of tasks or problem instances. Furthermore, it implements special techniques that allows it to be very efficient for runtime as the optimization metric.
Dynamic Algorithm Configuration

Algorithm Configuration is a powerful approach to achieve the best performance across a set of instances. However, classical approaches to solve this problem ignore the iterative nature of many algorithms. Dynamic algorithm configuration (DAC) is capable of generalizing over prior optimization approaches, as well as handling optimization of hyperparameters that need to be adjusted over multiple time-steps as well as instances. To allow us to use this framework, we need to move from the classical view of algorithms as a black-box to more of a gray or even white-box view to unleash the full potential of AI algorithms with DAC.