Recently, I read Lindauer and Hutter (2020) [Best Practices for Scientific Research on Neural Architecture Search. JMLR 21(243): 1-18]. It seems like I missed the pre-print on arXiv last year but it was published only recently. They also released a NAS checklist.


Contents


Introduction

Their title is “Best Practices for Scientific Research on Neural Architecture Search”. Basically, they are aiming to provide something similar to Joelle Pineau’s Machine Learning Reproducibility Checklist while focusing on neural architecture search instead of general ML (Ironically, Joelle Pineau seems to be the editor of this paper). In general, I would consider the peer-review and the scientific publication process as close to broke beyond repair - not just in machine learning. Further, reproducibility is a problem in almost all fields of science but computer science might be the field in which this problem can be fixed/mitigated easily.

As a nice consequence of my AutoML series (mainly from 2018) I ended up implementing autoML solutions and providing guidance to customers. Therefore, I have a slightly different perspective on this matter and I want to share it with your. Further, I would like to add that in practice there is no difference between “autoML” and “NAS” as in a purely academic setting. It is about building systems with satisfactory performance.

For simplicity, I’ll simply follow the structure of the paper. NB!: some of my comments are more general related to machine learning and not necessarily to neural architecture search only.

Best Practices for Releasing Code

Reproducing someone else’s NAS experiments is often next to impossible without code

Full acknowledgement!

In general, you would expect that industrial research groups are less likely to release code than universities and if so you would expect somewhat strict licensing. Currently, the opposite seems to be true. Industry seems to be more likely to release code (and datasets) using free licenses (free as in the BSD sense). Nevertheless, I failed more than once to get released code running and reproducing something would even get close to the performance published - especially training from scratch. A final remark regarding code released by industry researchers. A very subjective impression is that the quality is a lot higher than code (covering a paper) released by academic institutions. It might also simply be a signal to noise ratio issue because so much nonsense is published.

A general remark to the main two points of releasing code of (training) pipelines and NAS methods. This is a lot trickier than it sounds. In practice you may don’t want to fix random_state to actually use some randomness. Ideally, data is shuffled after each epoch and augmentation libraries (e.g. albumentations) apply an augmentation with a certain probability. In that sense a full record would have to be provided that allows for full deterministic replay (e.g. the exact input image that is fed into the first layer of a NN etc.).

Release Code for the Training Pipeline(s) you use

Yes, indeed it would be helpful if the full training pipeline would be released. Please read my general remarks regarding this matter in the section above.

Therefore, the final performance results of paper A and paper B are incomparable unless they use the same training pipeline. Releasing your training pipeline ensures that others can meaningfully compare against your results

Yes, and no. I should mention that Lindauer and Hutter seem to include items such as choice of optimizers and activation functions as well as a few other things in the training pipeline. These certainly are items I could include in the NAS method since, ideally, NAS selects the most suitable or comes up with e.g. activation functions on its own. However, I acknowledge the importance of a proper training pipeline documentation and ideally making it available as code.

Especially the training pipeline for a dataset like CIFAR-10 should be trivial to make available, since this routinely consists of a single file relying only on open-source Tensorflow, Pytorch or MXNet code.

Ah, here is problem number one. Toy problems! Overfitting or should I say over-engineering a solution to a specific benchmark dataset. To be fair, the authors mention this in section 5 (The Need for Proper NAS Benchmarks).

Release Code for Your NAS Method

releasing the code for your NAS method allows others to also use it on new datasets.

Yes. Like I wrote above, t. You’ll hardly ever find a paper cracking down all the junk published - it certainly wouldn’t pass any review process.

As an additional motivation next to following good scientific practice: papers with available source code tend to have far more impact and receive more citations than those without, because other researchers can build upon your code base.

Number of citations hardly correlates with quality - a few “standard/reference papers excepted” and it usually took them a long time to take off/been accepted (my very personal opinion).

Don’t Wait Until You’ve Cleaned up the Code; That Time May Never Come

We encourage anyone who can do so to simply put a copy of the code online as it was used,appropriately labeled as prototype research code

Just make sure you clean up all the traces that are a risk in terms of cyber security or could cause harm otherwise (e.g. AWS tokens, passwords etc.).

Best Practices for Comparing NAS Methods

Use the Same NAS Benchmarks, not Just the Same Datasets

As long as only toy datasets are used (e.g. CIFAR 10), it is pointless to have strict boundaries for everything else (e.g. search space). To be fair, the authors mention this in section 5 (The Need for Proper NAS Benchmarks).

However, these NAS benchmarks proposed certainly impose some restrictions that make it easier to compare NAS methods. Let’s have a look at the definition of a NAS benchmark provided by the authors.

A NAS benchmark consists of a dataset (with a pre-defined training-test split), a search space, and available runnable code with pre-defined hyperparameters for training the architectures.

IMHO this restricts the possible output of a NAS. Imagine you’re training a neural network with a certain architecture that seems to benefit from adam as optimizer of choice and less of e.g. SGD. I highly recommend reading this blog post. If we consider the optimizer (and its hyperparameters) as hyperparameters in the context of NAS, then we’re are likely to restrict the possible outcomes of a NAS by restricting it inputs. In other words, the input restrictions should cause a convergence to a few possible outcomes that could be considered somewhat optimal considering the search space provided. It is unclear if these are caused by getting stuck in local maximum or if it converges to a global maximum.

Run Ablation Studies

[…] we should understand why the final results are better than before.

ML literature in general lacks a proper analysis of experimental results. There is nothing wrong to simply publish experimental results but and some point a follow-up publication should (try to) explain previous results obtained by pure experimentation.

Use the Same Evaluation Protocol for the Methods Being Compared

No comment.

Evaluate Performance as a Function of Compute Re-sources

This is certainly a very important factor (see Practicality section).

This is highly underrated. I would like to add a simple grid search (perhaps with some Bayesian aspects) to this comparison.

Perform Multiple Runs with Different Seeds

No further comment (see above).

Use Tabular or Surrogate Benchmarks If Possible

No comment.

Control Confounding Factors

Library version differences may cause major performance differences. Let me simply point to this CUDA version 11.x performance gain even for “older” architectures namely the RTX 2080 ti. I would like to add that it is somewhat funny to read this blog post and compare to the situation that many CUDA based libraries were available only with 10.x quite recently (e.g. parts of PyTorch, CuPY or RAPIDS - hardly anyone compiles these libraries themselves).

Further, we have to ask ourselves if performance differences observed are actually significant.

Best practices for reporting important details

I’m not commenting on each best practice proposed in this section. Let me simply say that what they request is something I would consider the bare minimum requirements of scientific research published in a journal.

Things they haven’t covered…

… but came to my mind rather quickly when reading the paper.

Dataset sizes

Is a performance difference of let’s say 4-5% actually statistically significant given the dataset size and sample distribution (and similarity of samples)?

Dataset difficulty

To be fair, the authors mention this in section 5 (The Need for Proper NAS Benchmarks).

In general, most people use toy benchmarks. Harder benchmarks such as the Numenta Anomaly Benchmark (NAB) or vastly different datasets in the field of computer vision are hardly ever used. Personally, I observed quite a lot of overfitting to CIFAR,Imagenet and COCO.

Practicality

This is a general criticism of many AutoML and NAS benchmarks. What run time do we allow to find an “optimal solution”. Does it require significantly more compute resources than simply implementing “best practices”? While the situation is a bit more challenging in the fields of NLP and computer vision, it is a lot different in applications with structured data such as tabular data. With tabular dataset of sizes as you’ll typically find in the UCI ML repo, something like 60 to 300 minutes is something I would consider a practical run-time limitation. Hardly any brute-force or best practice approach is likely to fail to come up with a solution that is acceptable. Infinite compute time does not exist in practice.

In NLP and CV we’re probably facing the issue that hardly any research institution, no matter if industrial or academic, is likely to have the resources to really run extensive NAS (let’s exclude DeepMind, Google Brain, FAIR etc. here).

Moreover, there is the issue of finding a model that is fast enough for deployment (real-time requirements, edge device requirements, etc.).