CNN Architectures Summary

This is an article for summary. Because I am interested in the development of CNN architectures.

  1. ImageNet: It is not a net but a visual database developed by Feifei Li, which boosts the development of the Computer Vision. The ImageNet project runs a contest called: ImageNet Large Scale Visual Recognition Challenge (ILSVRC), we can easily tell what the content of this contest is.
  2. Below is a graph summarizing the famous models showed in this contest, we can see AlexNet firstly uses deep network, which is 8 layers, and VGG uses 19 layers, GoogleNet uses 22 layers, ResNet is the best, which uses 152 layers!

3. LeNet-5(1998)

This is a pioneering 7-level convolutional network developed by LeCun et al at 1998. LeNet-5 classifies digits, and was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel greyscale input images. The ability to process higher resolution images requires larger and more convolutional layers, so this technique is constrained by the availability of computing resources. The architecture is like that:

4. AlexNet(2012)

It is like LeNet, the main differences are that it was deeper, with more filters per layer, and with stacked convolutional layers. It consisted 11x11, 5x5,3x3, convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer.

It reduces the error rate from 25.8% to 16.4%.

AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason for why their network is split into two pipelines. AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever. See the picture below:

5. ZFNet(2013)

It was mostly an achievement by adjust the hyper-parameters of AlexNet while maintaining the same structure with additional Deep Learning elements.

6. GoogleNet/Inception(2014)

See this : Going Deeper with Convolutions

The name Inception comes from the meme "we need to go deeper". The best thing in this article is that it uses Inception modules, This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information.

GoogLeNet is a special case of Inception architecture. It is like that:


The winner of the ILSVRC 2014 competition was GoogleNet(a.k.a. Inception V1) from Google. It achieved a error rate of 6.67%! This was very close to human level performance which the organizers of the challenge were now forced to evaluate. As it turns out, this was actually rather hard to do and required some human training in order to beat GoogLeNets accuracy. After a few days of training, the human expert (Andrej Karpathy) was able to achieve a error rate of 5.1%(single model) and 3.6%(ensemble). The network used a CNN inspired by LeNet but implemented a novel element which is called an inception module. It used batch normalization, image distortions and RMSprop. This module is based on several very small convolutions in order to drastically reduce the number of parameters. Their architecture consisted of a 22 layer deep CNN but reduced the number of parameters from 60 million (AlexNet) to 4 million.

A rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage.

7. VGGNet(2014)

The runner-up at the ILSVRC 2014 competition is called VGGNet by the community and was developed by Simonyan and Zisserman . VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform architecture. Similar to AlexNet, only 3x3 convolutions, but lots of filters. Trained on 4 GPUs for 2–3 weeks. It is currently the most preferred choice in the community for extracting features from images. The weight configuration of the VGGNet is publicly available and has been used in many other applications and challenges as a baseline feature extractor. However, VGGNet consists of 138 million parameters, which can be a bit challenging to handle.

8. ResNet(2015)

At last, at the ILSVRC 2015, the so-called Residual Neural Network (ResNet) by Kaiming He et al introduced a novel architecture with “skip connections” and features heavy batch normalization. Such skip connections are also known as gated units or gated recurrent units and have a strong similarity to recent successful elements applied in RNNs. Thanks to this technique they were able to train a NN with 152 layers while still having lower complexity than VGGNet. It achieves a error rate of 3.57% which beats human-level performance on this dataset.

See: Deep Residual Learning for Image Recognition

9. Summary

10. References


2.函数式模型 - Keras中文文档

3.为什么GoogleNet中的Inception Module使用1*1 convolutions?

4.Going Deeper with Convolutions

5.Deep Residual Learning for Image Recognition








LeNet 诞生于 1994 年,是最早的卷积神经网络之一,并且推动了深度学习领域的发展。自从 1988 年开始,在许多次成功的迭代后,这项由 Yann LeCun 完成的开拓性成果被命名为 LeNet5。LeNet5 的架构基于这样的观点:(尤其是)图像的特征分布在整张图像上,以及带有可学习参数的卷积是一种用少量参数在多个位置上提取相似特征的有效方式。在那时候,没有 GPU 帮助训练,甚至 CPU 的速度也很慢。因此,能够保存参数以及计算过程是一个关键进展。这和将每个像素用作一个大型多层神经网络的单独输入相反。LeNet5 阐述了那些像素不应该被使用在第一层,因为图像具有很强的空间相关性,而使用图像中独立的像素作为不同的输入特征则利用不到这些相关性。


AlexNet是一个卷积神经网络的名字,最初是与CUDA一起使用GPU支持运行的,AlexNet是2012年ImageNet竞赛冠军获得者Alex Krizhevsky设计的。该网络达错误率大大减小了15.3%,比亚军高出10.8个百分点。AlexNet是由SuperVision组设计的,由Alex Krizhevsky, Geoffrey Hinton和Ilya Sutskever组成。