Don t peek

3/29/2023

I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better– without peeking at the test data. Suppose you are training a DNN and trying to optimize the hyper-parameters.

You can think of the power law exponent as a kind of information metric–the smaller, the more information is in this layer weight matrix. These results are easily reproduced with this notebook. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller across a wide range of architectures. The correlation is not perfect the smaller Resnet50 is an outlier, and Resnet152 has a s larger than FbResnet152, but they are very close. Now consider the Resnet models, which are increasing in size and have more architectural differences between them:Īcross all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4. Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. Predicting the test accuracy is a complicated task, and IMHO simple theories, with loose bounds, are unlikely to be useful in practice. The smaller ResNet models (ResNet10, 18, …) also show the reverse trend.The VGG models behave very differently, showing exactly the reverse trend !.In nearly every case, smaller is correlated with better test accuracy (i.e. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs. To compare these model versions, we can simply compute the average power law exponent, averaged across all FC weight matrices and Conv2D feature maps. 2 other ResNet implementations, CaffeResnet101 and FbResnet152.The sequence of (larger) Resnet models, including Resnet18, 34, 50, 101, & 152.

We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including And while Universality is very theoretically interesting, a more practical question isĪre power law exponents correlated with better generalization accuracies ? … YES they are! We will discuss the details and these results in a future paper.

Although compared to the FC layers, for the Conv2D layers, we do see more exponents. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.Īs with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4.

For any large, modern, pretrained DNN, this can give a large number of eigenvalues. For Conv2D layers with shape we consider all 2D feature maps of shape. We call the Power Law Universal because 80-90% of the exponents lie in rangeįor fully connected layers, we just take as is. It can nearly always be fit to a power law We call the histogram of eigenvalues the Empirical Spectral Density (ESD). That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:įor a given weight matrix, we form the correlation matrix Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. But what good is a theory (i.e VC) that is totally useless in practice ? A good theory makes predictions. What is the purpose of a theory ? To explain why something works.

0 Comments

Don t peek

Leave a Reply.

Author

Archives

Categories