Hyper-parameter optimization and fine-tuning are evolved as a prevalent and standard topic in both academic and industrial areas, beneficial to reducing the technical thresholds and constraints for common users for implementing the deep neural network. Hyperparameter optimization is a predominant constituent of deep learning in discovering optimum hyper-parameters for neural network structures and the model training process.
Selection of the finest hyper-parameter configuration for deep learning models has a great impact on the performance of the model in any sort of application. Most recently, hyper-parameter optimization turned out to be growingly mandatory owing to the increasing tendency in the development of deep learning models, including the rising of neural networks for improved accuracy and fulfilling accuracy with fewer weights and parameters using the lightweight model.
Hyper-parameter tuning plays a significant role in the case of a model with a complicated structure that denotes more hyper-parameters to be tuned, and a model with a strictly designed structure infers that every hyper-parameter must be tuned in a stringent range for the reproduction of accuracy. In addition to research, many practical application of deep learning demands an automated hyper-parameter tuning process to impart a solution to the complication of manual deep learning approaches.
Consistent Hyper-parameters in Deep Neural Networks: Hyper-parameters perceive greater importance for model structures and training in the process of hyper-parameter optimization due to their powerful effect on weights during training that facilitates more predominant neural network training. A few of the notable hyper-parameters in deep neural networks are discussed below;
Learning Rate - The learning rate is one of the essential hyper-parameters in deep learning models. Choosing the optimal learning rate or its optimum lineup is a challenging task, as it varies for specific tasks. A small learning rate effectuates low convergence and relatively prevents the model from converging when the learning rate is large. A suitable learning rate facilitates the objective function to converge to a global minimum within an acceptable time.
Optimizer - Optimizers play a critical role in enhancing accuracy and training speed. Hyper-parameters associated optimizers involve the choice of the optimizer, mini-batch size, momentum, and many more.
• Momentum speed-up gradients vectors in the right directions, consequently giving rise to faster converging for resolving the problem of oscillation.
• Root mean square property (RMSprop) is one of the most broadly utilized optimizers for the training of deep neural networks. RMSprop exhibits superior performance by accelerating the gradient descent as same as Adagrad and Adadelta.
• Adaptive momentum estimation (Adam) attains better outcomes rapidly on most neural network architectures. Adam slightly outperforms RMSprop in the late optimization stage by adding bias-correction and momentum to RMSprop, as a combination of gradient descent with momentum and RMSprop.
• Mini-batch gradient descent aims to splits the training dataset into small batches for the use of calculating model error and updating model coefficients, and it also accelerates the training process for deep learning models.
Model design-oriented hyper-parameters - The number of hidden layers is an important parameter for establishing the overall structure of neural networks, influencing the final outcomes. The number of neurons in each layer should also be attentively examined. Too low neurons in the hidden layers cause underfitting as the model lacks complexity. In contrast, too many neurons may cause overfitting and increase training time.
• Activation functions are prominent in deep learning for discovering nonlinear properties to the outputs of neurons, conducive to representing complicated features of data. Activation functions must be divergent for the computation gradients of weight and perform backpropagation.
• Dropout is a technique utilized to make the deep neural network less sensitive to the specific weights by selecting neurons randomly with a stated probability.
Hyper-parameter Optimization Methods: Hyper-parameter optimization techniques are emerged to solve many optimization problems in learning models, namely grid and random search, trial and error method, gradient-based optimization, multi-fidelity optimization, and Bayesian optimization. Some recently used methods are discussed here;
• Grid search - Grid search is the extensively used straightforward search algorithm due to its mathematical simplicity for hyperparameter optimization and making accurate predictions as long as sufficient resources are provided. It is applied to the specified hyperparameter set by performing an exhaustive search. Grid search is practically applicable for several hyper-parameters with finite search space.
• Random search - Random search is the improvised grid search method by indicating a randomized search over hyper-parameters from particular distributions for feasible parameter values. Random search generally needs more time and computational resources than other guided search algorithms.
• Bayesian optimization - Bayesian optimization method has turned popular while exploited to the global optimization problem. It aimed at discovering the global optimum with the minimum number of trials via the sequential model-based method. It also balances exploration and exploitation to avoid trapping into the local optimum. Gaussian process, expected improvement, acquisition functions, random forests, and tree parzen estimators are some of the recent variants of the Bayesian optimization method.
• Multi-fidelity optimization - Multifidelity optimization methods are widely used approaches for solving constraints of limited time and resources by utilizing a subset of the original dataset or a subset of the features. It includes low-fidelity and high-fidelity evaluations and a combination of both for practical applications of deep learning models. Bandit-based algorithms are categorized as multi-fidelity optimization, and recent algorithms are successive halving and Hyperband, which yield success in dealing with deep learning optimization problems.
Open Issues, Challenges, and Future Research Perspectives:
Even though many hyper-parameter optimization algorithms and practical frameworks are evolved, some hindrances are still required to contend with, and various aspects in this field must be improved. Some of the challenges and future research directions are discussed to further advance hyper-parameter optimization for deep learning implementations.
• Costly objective function evaluations - hyper-parameter optimization algorithms must reduce the evaluation time on large-scale datasets.
• Complex search space - hyper-parameter optimization methods should lower execution time on high-level dimensionalities as large hyper-parameter search space.
• Strong full-time performance - hyper-parameter optimization techniques can detect the optimal or near-optimal and global optimumhyper-parameters regardless of a very limited and sufficient budget.
• Comparability - While conducting hyperparameter optimization, there must endure a standard set of benchmarks to evaluate and compare various optimization algorithms equitably.
• Over-fitting and generalization - The optimal hyper-parameters determined by hyperparameter optimization methods would have generalizability to construct effective deep learning models on hidden data
• Randomness and scalability - hyperparameter optimization techniques must reduce randomness on the obtained outcomes and scalable to multiple platforms
• Continuous updating capability - hyperparameter optimization algorithms must contemplate their capacity to observe and update optimal hyper-parameter combinations on consistently-updated data.