如何利用VGG-16等模型在CPU上测评各深度学习框架

本项目对比了各深度学习框架在 CPU 上运行相同模型(VGG-16 和 MobileNet)单次迭代所需要的时间。作者提供了所有的测试代码,读者可以尝试测评以完善该结果。

项目地址:https://github.com/peisuke/DeepLearningSpeedComparison

在本项目中,作者测评了流行深度学习框架在 CPU 上的运行相同模型所需要的时间,作者采取测试的模型为 VGG-16 和 MobileNet。所有的测试代码都已经加入 Docker 文件,因此测试环境将很容易设置。目前这两个网络的参数都是随机生成的,因为我们只需要测试输入数据通过神经网络的测试时间。最后的结果并不能保证绝对正确,但作者希望能与我们共同测试并更新结果。

以下是该测试涉及的深度学习架构,

  • Caffe
  • Caffe2
  • Chainer
  • MxNet
  • TensorFlow
  • NNabla

对于这些深度学习框架,作者准备了多种安装设置,例如是否带有 MKL、pip 或 build。

安装与运行

因为我们将所有试验代码和环境都加入了 Docker 文件,所以我们需要下载 Dockerfile 并运行它。

  1. $ docker build -t {NAME} .

  2. $ docker run -it --rm {NAME}

在创建的 Docker 容器中,复制该 GitHub 项目的代码库并运行测试代码。

  1. # git clone https://github.com/peisuke/dl_samples.git

  2. # cd dl_samples/{FRAMEWORK}/vgg16

  3. # python3 (or python) predict.py

当前结果

当前的结果还有很多误差,首先当前结果是模型在各个框架下的一个估计,例如第一项为单次迭代运行时间的样本均值,第二项为单次迭代时间的标准差。

从作者的测试代码可知,每一次迭代的运行时间采取的是 20 次迭代的均值,每一次迭代投入的都是批量为 1 的图片集。且每张图片都是维度为(224,224,3)的随机生成样本,且每一个生成的元素都服从正态分布。若再加上随机生成的权重,那么整个测试仅仅能测试各深度学习框架的在 CPU 上运行相同模型的时间。

以下分别展示了 20 次迭代(有点少)的平均运行时间和标准差,其中每种模型是否使用了 MKL 等 CPU 加速库也展示在结果中。

  1. caffe(openblas, 1.0)

  2. caffe-vgg-16 : 13.900894 (sd 0.416803)

  3. caffe-mobilenet : 0.121934 (sd 0.007861)

  4. caffe(mkl, 1.0)

  5. caffe-vgg-16 : 3.005638 (sd 0.129965)

  6. caffe-mobilenet: 0.044592 (sd 0.010633)

  7. caffe2(1.0)

  8. caffe2-vgg-16 : 1.351302 (sd 0.053903)

  9. caffe2-mobilenet : 0.069122 (sd 0.003914)

  10. caffe2(mkl, 1.0)

  11. caffe2-vgg-16 : 0.526263 (sd 0.026561)

  12. caffe2-mobilenet : 0.041188 (sd 0.007531)

  13. mxnet(0.11)

  14. mxnet-vgg-16 : 0.896940 (sd 0.258074)

  15. mxnet-mobilenet : 0.209141 (sd 0.060472)

  16. mxnet(mkl)

  17. mxnet-vgg-16 : 0.176063 (sd 0.239229)

  18. mxnet-mobilenet : 0.022441 (sd 0.018798)

  19. pytorch

  20. pytorch-vgg-16 : 0.477001 (sd 0.011902)

  21. pytorch-mobilenet : 0.094431 (sd 0.008181)

  22. nnabla

  23. nnabla-vgg-16 : 1.472355 (sd 0.040928)

  24. nnabla-mobilenet : 3.984539 (sd 0.018452)

  25. tensorflow(pip, r1.3)

  26. tensorflow-vgg-16 : 0.275986 (sd 0.009202)

  27. tensorflow-mobilenet : 0.029405 (sd 0.004876)

  28. tensorflow(opt, r1.3)

  29. tensorflow-vgg-16 : 0.144360 (sd 0.009217)

  30. tensorflow-mobilenet : 0.022406 (sd 0.007655)

  31. tensorflow(opt, XLA, r1.3)

  32. tensorflow-vgg-16 : 0.151689 (sd 0.006856)

  33. tensorflow-mobilenet : 0.022838 (sd 0.007777)

  34. tensorflow(mkl, r1.0)

  35. tensorflow-vgg-16 : 0.163384 (sd 0.011794)

  36. tensorflow-mobilenet : 0.034751 (sd 0.011750)

  37. chainer(2.0)

  38. chainer-vgg-16 : 0.497946 (sd 0.024975)

  39. chainer-mobilenet : 0.120230 (sd 0.013276)

  40. chainer(2.1, numpy with mkl)

  41. chainer-vgg-16 : 0.329744 (sd 0.013079)

  42. chainer-vgg-16 : 0.078193 (sd 0.017298)

以下为各深度学习框架在 CPU 上执行 VGG-16 的平均运行速度,其中 TensorFlow 的单次迭代(批量大小为 1)平均速度较快:

以下为 MobileNet 的单次迭代平均速度:

以下展示了不使用 MKL 等 CPU 加速库和使用时的速度区别,我们看到使用 MKL 加速库的各深度学习框架在平均迭代时间上有明显的降低。

以下展示了 MobileNet 的加速情况,令人惊讶的是 TensorFlow 使用 MKL CPU 加速库却令单次平均迭代时间增多了。

以上是作者在 CPU 上运行与测试各个深度学习框架的结果,其中我们还是用了 mkl 等 CPU 加速库。以下是作者使用的各个深度学习框架训练 VGG-16 和 MobileNet 的代码。

Caffe2/VGG-16

  1. import numpy as np

  2. import tqdm

  3. import os

  4. import shutil

  5. import time

  6. import caffe2.python.predictor.predictor_exporter as pe

  7. from caffe2.python import core, model_helper, net_drawer, workspace, visualize, brew

  8. core.GlobalInit(['caffe2', '--caffe2_log_level=0'])

  9. def AddLeNetModel(model, data):

  10.    conv1_1 = brew.conv(model, data, 'conv1_1', dim_in=3, dim_out=64, kernel=3, pad=1)

  11.    conv1_1 = brew.relu(model, conv1_1, conv1_1)

  12.    conv1_2 = brew.conv(model, conv1_1, 'conv1_2', dim_in=64, dim_out=64, kernel=3, pad=1)

  13.    conv1_2 = brew.relu(model, conv1_2, conv1_2)

  14.    pool1 = brew.max_pool(model, conv1_2, 'pool1', kernel=2, stride=2)

  15.    conv2_1 = brew.conv(model, pool1, 'conv2_1', dim_in=64, dim_out=128, kernel=3, pad=1)

  16.    conv2_1 = brew.relu(model, conv2_1, conv2_1)

  17.    conv2_2 = brew.conv(model, conv2_1, 'conv2_2', dim_in=128, dim_out=128, kernel=3, pad=1)

  18.    conv2_2 = brew.relu(model, conv2_2, conv2_2)

  19.    pool2 = brew.max_pool(model, conv2_2, 'pool2', kernel=2, stride=2)

  20.    conv3_1 = brew.conv(model, pool2, 'conv3_1', dim_in=128, dim_out=256, kernel=3, pad=1)

  21.    conv3_1 = brew.relu(model, conv3_1, conv3_1)

  22.    conv3_2 = brew.conv(model, conv3_1, 'conv3_2', dim_in=256, dim_out=256, kernel=3, pad=1)

  23.    conv3_2 = brew.relu(model, conv3_2, conv3_2)

  24.    conv3_3 = brew.conv(model, conv3_2, 'conv3_3', dim_in=256, dim_out=256, kernel=3, pad=1)

  25.    conv3_3 = brew.relu(model, conv3_3, conv3_3)

  26.    pool3 = brew.max_pool(model, conv3_3, 'pool3', kernel=2, stride=2)

  27.    conv4_1 = brew.conv(model, pool3, 'conv4_1', dim_in=256, dim_out=512, kernel=3, pad=1)

  28.    conv4_1 = brew.relu(model, conv4_1, conv4_1)

  29.    conv4_2 = brew.conv(model, conv4_1, 'conv4_2', dim_in=512, dim_out=512, kernel=3, pad=1)

  30.    conv4_2 = brew.relu(model, conv4_2, conv4_2)

  31.    conv4_3 = brew.conv(model, conv4_2, 'conv4_3', dim_in=512, dim_out=512, kernel=3, pad=1)

  32.    conv4_3 = brew.relu(model, conv4_3, conv4_3)

  33.    pool4 = brew.max_pool(model, conv4_3, 'pool4', kernel=2, stride=2)

  34.    conv5_1 = brew.conv(model, pool4, 'conv5_1', dim_in=512, dim_out=512, kernel=3, pad=1)

  35.    conv5_1 = brew.relu(model, conv5_1, conv5_1)

  36.    conv5_2 = brew.conv(model, conv5_1, 'conv5_2', dim_in=512, dim_out=512, kernel=3, pad=1)

  37.    conv5_2 = brew.relu(model, conv5_2, conv5_2)

  38.    conv5_3 = brew.conv(model, conv5_2, 'conv5_3', dim_in=512, dim_out=512, kernel=3, pad=1)

  39.    conv5_3 = brew.relu(model, conv5_3, conv5_3)

  40.    pool5 = brew.max_pool(model, conv5_3, 'pool5', kernel=2, stride=2)

  41.    fc6 = brew.fc(model, pool5, 'fc6', dim_in=25088, dim_out=4096)

  42.    fc6 = brew.relu(model, fc6, fc6)

  43.    fc7 = brew.fc(model, fc6, 'fc7', dim_in=4096, dim_out=4096)

  44.    fc7 = brew.relu(model, fc7, fc7)

  45.    pred = brew.fc(model, fc7, 'pred', 4096, 1000)

  46.    softmax = brew.softmax(model, pred, 'softmax')

  47.    return softmax

  48. model = model_helper.ModelHelper(name="vgg", init_params=True)

  49. softmax = AddLeNetModel(model, "data")

  50. workspace.RunNetOnce(model.param_init_net)

  51. data = np.zeros([1, 3, 224, 224], np.float32)

  52. workspace.FeedBlob("data", data)

  53. workspace.CreateNet(model.net)

  54. nb_itr = 20

  55. timings = []

  56. for i in tqdm.tqdm(range(nb_itr)):

  57.    data = np.random.randn(1, 3, 224, 224).astype(np.float32)

  58.    start_time = time.time()

  59.    workspace.FeedBlob("data", data)

  60.    workspace.RunNet(model.net.Proto().name)

  61.    ref_out = workspace.FetchBlob("softmax")

  62.    timings.append(time.time() - start_time)

  63. print('%10s : %f (sd %f)'% ('caffe2-vgg-16', np.array(timings).mean(), np.array(timings).std()))

MXNet/MobileNet

  1. import numpy as np

  2. import os

  3. import gzip

  4. import struct

  5. import time

  6. import tqdm

  7. from collections import namedtuple

  8. import mxnet as mx

  9. def conv_bn(inputs, oup, stride, name):

  10.    conv = mx.symbol.Convolution(name=name, data=inputs, num_filter=oup, pad=(1, 1), kernel=(3, 3), stride=(stride, stride), no_bias=True)

  11.    conv_bn = mx.symbol.BatchNorm(name=name+'_bn', data=conv, fix_gamma=False, eps=0.000100)

  12.    out = mx.symbol.Activation(name=name+'relu', data=conv_bn, act_type='relu')

  13.    return out

  14. def conv_dw(inputs, inp, oup, stride, name):

  15.    conv_dw = mx.symbol.Convolution(name=name+'_dw', data=inputs, num_filter=inp, pad=(1, 1), kernel=(3, 3), stride=(stride, stride), no_bias=True, num_group=inp)

  16.    conv_dw_bn = mx.symbol.BatchNorm(name=name+'dw_bn', data=conv_dw, fix_gamma=False, eps=0.000100)

  17.    out1 = mx.symbol.Activation(name=name+'_dw', data=conv_dw_bn, act_type='relu')

  18.    conv_sep = mx.symbol.Convolution(name=name+'_sep', data=out1, num_filter=oup, pad=(0, 0), kernel=(1,1), stride=(1,1), no_bias=True)

  19.    conv_sep_bn = mx.symbol.BatchNorm(name=name+'_sep_bn', data=conv_sep, fix_gamma=False, eps=0.000100)

  20.    out2 = mx.symbol.Activation(name=name+'_sep', data=conv_sep_bn, act_type='relu')

  21.    return out2

  22. def create_network():

  23.    data = mx.sym.Variable('data')

  24.    net = conv_bn(data, 32, stride=2, name='conv_bn')

  25.    net = conv_dw(net, 32, 64, stride=1, name='conv_ds_2')

  26.    net = conv_dw(net, 64, 128, stride=2, name='conv_ds_3')

  27.    net = conv_dw(net, 128, 128, stride=1, name='conv_ds_4')

  28.    net = conv_dw(net, 128, 256, stride=2, name='conv_ds_5')

  29.    net = conv_dw(net, 256, 256, stride=1, name='conv_ds_6')

  30.    net = conv_dw(net, 256, 512, stride=2, name='conv_ds_7')

  31.    net = conv_dw(net, 512, 512, stride=1, name='conv_ds_8')

  32.    net = conv_dw(net, 512, 512, stride=1, name='conv_ds_9')

  33.    net = conv_dw(net, 512, 512, stride=1, name='conv_ds_10')

  34.    net = conv_dw(net, 512, 512, stride=1, name='conv_ds_11')

  35.    net = conv_dw(net, 512, 512, stride=1, name='conv_ds_12')

  36.    net = conv_dw(net, 512, 1024, stride=2, name='conv_ds_13')

  37.    net = conv_dw(net, 1024, 1024, stride=1, name='conv_ds_14')

  38.    net = mx.symbol.Pooling(data=net, global_pool=True, kernel=(7, 7), pool_type='avg', name='pool1')

  39.    return mx.sym.softmax(net)

  40. mlp = create_network()

  41. mod = mx.mod.Module(symbol=mlp, context=mx.cpu(), label_names=None)

  42. mod.bind(data_shapes=[('data', (1, 3, 224, 224))], for_training=False)

  43. mod.init_params(initializer=mx.init.Xavier(magnitude=2.))

  44. Batch = namedtuple('Batch', ['data'])

  45. nb_itr = 20

  46. timings = []

  47. for i in tqdm.tqdm(range(nb_itr)):

  48.    data = np.random.randn(1, 3, 224, 224).astype(np.float32)

  49.    start_time = time.time()

  50.    batch = Batch([mx.nd.array(data)])

  51.    mod.forward(batch)

  52.    prob = mod.get_outputs()[0].asnumpy()

  53.    timings.append(time.time() - start_time)

  54. print('%10s : %f (sd %f)'% ('mxnet-mobilenet', np.array(timings).mean(), np.array(timings).std()))

PyTorch/MobileNet

  1. import numpy as np

  2. import tqdm

  3. import time

  4. import torch

  5. import torch.nn as nn

  6. import torch.nn.functional as F

  7. import torch.optim as optim

  8. from torchvision import datasets, transforms

  9. from torch.autograd import Variable

  10. def conv_bn(inp, oup, stride):

  11.    return nn.Sequential(

  12.        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),

  13.        nn.BatchNorm2d(oup),

  14.        nn.ReLU(inplace=True)

  15.    )

  16. def conv_dw(inp, oup, stride):

  17.    return nn.Sequential(

  18.        nn.Conv2d(inp, inp, 3, stride, 1, groups=inp, bias=False),

  19.        nn.BatchNorm2d(inp),

  20.        nn.ReLU(inplace=True),

  21.        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),

  22.        nn.BatchNorm2d(oup),

  23.        nn.ReLU(inplace=True),

  24.    )

  25. class MobileNet(nn.Module):

  26.    def __init__(self):

  27.        super(MobileNet, self).__init__()

  28.        self.model = nn.Sequential(

  29.            conv_bn(  3,  32, 2),

  30.            conv_dw( 32,  64, 1),

  31.            conv_dw( 64, 128, 2),

  32.            conv_dw(128, 128, 1),

  33.            conv_dw(128, 256, 2),

  34.            conv_dw(256, 256, 1),

  35.            conv_dw(256, 512, 2),

  36.            conv_dw(512, 512, 1),

  37.            conv_dw(512, 512, 1),

  38.            conv_dw(512, 512, 1),

  39.            conv_dw(512, 512, 1),

  40.            conv_dw(512, 512, 1),

  41.            conv_dw(512, 1024, 2),

  42.            conv_dw(1024, 1024, 1),

  43.            nn.AvgPool2d(7),

  44.        )

  45.        self.fc = nn.Linear(1024, 1000)

  46.    def forward(self, x):

  47.        x = self.model(x)

  48.        x = x.view(-1, 1024)

  49.        x = self.fc(x)

  50.        return F.softmax(x)

  51. model = MobileNet()

  52. model.eval()

  53. nb_itr = 20

  54. timings = []

  55. for i in tqdm.tqdm(range(nb_itr)):

  56.    data = np.random.randn(1, 3, 224, 224).astype(np.float32)

  57.    data = torch.from_numpy(data)

  58.    start_time = time.time()

  59.    data = Variable(data)

  60.    output = model(data)

  61.    timings.append(time.time() - start_time)

  62. print('%10s : %f (sd %f)'% ('pytorch-mobilenet', np.array(timings).mean(), np.array(timings).std()))

MobileNet 模型的结构:

首先定义两个函数:

conv_bn:卷积、batch 归一化、ReLU;

conv_dw:卷积、batch 归一化、ReLU、卷积、batch 归一化、ReLU;

然后将网络经过 1 次 conv_bn 和 13 次 conv_dw 计算,和 1 次平均池化,最后使用 softmax 函数输出。

TensorFlow/VGG-16

  1. # -*- coding: utf-8 -*-

  2. import tensorflow as tf

  3. import numpy as np

  4. import tqdm

  5. import time

  6. def vgg(x):

  7.    conv1_1 = tf.layers.conv2d(x, 64, 3, padding='same', activation=tf.nn.relu)

  8.    conv1_2 = tf.layers.conv2d(conv1_1, 64, 3, padding='same', activation=tf.nn.relu)

  9.    pool1 = tf.layers.max_pooling2d(conv1_2, 2, 2)

  10.    conv2_1 = tf.layers.conv2d(pool1, 128, 3, padding='same', activation=tf.nn.relu)

  11.    conv2_2 = tf.layers.conv2d(conv2_1, 128, 3, padding='same', activation=tf.nn.relu)

  12.    pool2 = tf.layers.max_pooling2d(conv2_2, 2, 2)

  13.    conv3_1 = tf.layers.conv2d(pool2, 256, 3, padding='same', activation=tf.nn.relu)

  14.    conv3_2 = tf.layers.conv2d(conv3_1, 256, 3, padding='same', activation=tf.nn.relu)

  15.    conv3_3 = tf.layers.conv2d(conv3_2, 256, 3, padding='same', activation=tf.nn.relu)

  16.    pool3 = tf.layers.max_pooling2d(conv3_3, 2, 2)

  17.    conv4_1 = tf.layers.conv2d(pool3, 512, 3, padding='same', activation=tf.nn.relu)

  18.    conv4_2 = tf.layers.conv2d(conv4_1, 512, 3, padding='same', activation=tf.nn.relu)

  19.    conv4_3 = tf.layers.conv2d(conv4_2, 512, 3, padding='same', activation=tf.nn.relu)

  20.    pool4 = tf.layers.max_pooling2d(conv4_3, 2, 2)

  21.    conv5_1 = tf.layers.conv2d(pool4, 512, 3, padding='same', activation=tf.nn.relu)

  22.    conv5_2 = tf.layers.conv2d(conv5_1, 512, 3, padding='same', activation=tf.nn.relu)

  23.    conv5_3 = tf.layers.conv2d(conv5_2, 512, 3, padding='same', activation=tf.nn.relu)

  24.    pool5 = tf.layers.max_pooling2d(conv5_3, 2, 2)

  25.    flat5 = tf.contrib.layers.flatten(pool5)

  26.    d1 = tf.layers.dense(flat5, 4096)

  27.    d2 = tf.layers.dense(d1, 4096)

  28.    out = tf.layers.dense(d2, 1000)

  29.    return tf.nn.softmax(out)

  30. # tf Graph input

  31. X = tf.placeholder("float", [None, 224, 224, 3])

  32. Y = vgg(X)

  33. init = tf.initialize_all_variables()

  34. config = tf.ConfigProto()

  35. config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

  36. sess = tf.Session(config=config)

  37. sess.run(init)

  38. nb_itr = 20

  39. timings = []

  40. for i in tqdm.tqdm(range(nb_itr)):

  41.    batch_xs = np.random.randn(1, 224, 224, 3).astype(np.float32)

  42.    start_time = time.time()

  43.    ret = sess.run(Y, feed_dict={X: batch_xs})

  44.    timings.append(time.time() - start_time)

  45. print('%10s : %f (sd %f)'% ('tensorflow-vgg-16', np.array(timings).mean(), np.array(timings).std()))

VGG-16 模型的结构:

2 个卷积层,1 个池化层;

2 个卷积层,1 个池化层;

3 个卷积层,1 个池化层;

3 个卷积层,1 个池化层;

3 个卷积层,1 个池化层;

1 个 flatten 层;

然后是 1 个 3 层全连接神经网络;

最后用 softmax 函数输出。

激活函数都是 ReLU 函数。

和 TensorFlow 相比,PyTorch 由于不需要定义计算图,非常接近 Python 的使用体验,其函数的定义过程和模型运算要简洁得多,代码格式也更加清晰明了。

入门深度学习框架工程Caffe2TensorFlow
2
暂无评论
暂无评论~