Skip to content

Latest commit

 

History

History
520 lines (354 loc) · 27.6 KB

File metadata and controls

520 lines (354 loc) · 27.6 KB

Deep Learning with Custom GoogleNet and ResNet in Keras and Xilinx Vitis AI

Current status

  1. Tested with Tensorflow 1.15 within Vitis AI 1.4 on an Ubuntu 18.04.5 Desktop PC.

  2. Tested in hardware on ZCU102 and ZCU104 boards, respectively with xilinx-zcu102-dpu-v2021.1-v1.4.0.img.gz and xilinx-zcu104-dpu-v2021.1-v1.4.0.img.gz sd cards.

  3. Tested in hardware on VCK190 ES1 board with xilinx-vck190-dpu-v2020.2-v1.4.0.img.gz sd card.

Date: 4 October 2021

1 Introduction

In this Deep Learning (DL) tutorial, you will quantize in fixed point some custom Convolutional Neural Networks (CNNs) and deploy them on the Xilinx® ZCU102, ZCU104 and VCK190 boards using Vitis AI, which is a set of optimized IP, tools libraries, models and example designs valid for AI inference on both Xilinx edge devices and Alveo cards (see the Vitis AI Product Page for more information).

This tutorial deals with:

  • four custom CNNs, from the simplest LeNet and miniVggNet to the intermediate miniGoogleNet and the more complex miniResNet, as described in the custom_cnn.py file;
  • two different datasets, Fashion-MNIST and CIFAR-10, each one with 10 classes of objects.

Once the selected CNN has been correctly trained in Keras, the HDF5 file of weights is converted into a TF checkpoint and inference graph file, such floating point frozen graph is then quantized by the Vitis AI Quantizer that creates an 8-bit (INT8) fixed point graph from which the Vitis AI Compiler generates the xmodel file of micro instructions for the Deep Processor Unit (DPU) of the Vitis AI platform. The final C++ application is executed at run time on the ZCU102 target board, which is the default one adopted in this tutorial (all the flow works transparently also for ZCU104 and VCK190 boards). The top-1 accuracy of the predictions computed at run time is measured and compared with the simulation results.

2 Prerequisites

1. "Vitis AI Overview" in Chapter 1 with DPU naming and guidelines to download the tools container available from [docker hub](https://hub.docker.com/r/xilinx/vitis-ai/tags) and the Runtime Package for edge (MPSoC) devices.
2. "Installation and Setup" instructions of Chapter 2 for both host and target;
3. "Quantizing the Model" in Chapter 3 and "Compiling the Model" in Chapter 4.
4. "Programming with VART" APIs in Chapter 5.
5. "Setting Up the Target" board as described in [Vitis-AI/demo/VART](https://github.com/Xilinx/Vitis-AI/blob/master/demo/VART/README.md).  
  • A Vitis AI target board such as either:

  • Familiarity with Deep Learning principles.

Dos-to-Unix Conversion

In case you might get some strange errors during the execution of the scripts, you have to pre-process -just once- all the*.sh shell and the python *.py scripts with the dos2unix utility. In that case run the following commands from your Ubuntu host PC (out of the Vitis AI docker images):

sudo apt-get install dos2unix
cd <WRK_DIR> #your working directory
for file in $(find . -name "*.sh"); do
  dos2unix ${file}
done

Working Directory

In the following of this document it is assumed you have installed Vitis AI 1.4 somewhere in your file system and this will be your working directory <WRK_DIR>, for example in my case <WRK_DIR> is set to ~/ML/VAI1.4. You have also created a folder named tutorials under such <WRK_DIR> and you have copied this tutorial there and renamed it VAI-KERAS-CUSTOM-GOOGLENET-RESNET:

VAI1.4   # your WRK_DIR
.
├── code_vaiq
│   └── tools
├── data
├── demo
│   ├── VART
│   ├── Vitis-AI-Library
│   └── Whole-App-Acceleration
├── docs
├── dsa
├── examples
├── external
├── models
│   └── AI-Model-Zoo
├── setup
├── tools
│   ├── AKS
│   ├── Vitis-AI-Library
│   ├── Vitis-AI-Profiler
│   ├── Vitis-AI-Quantizer
│   └── Vitis-AI-Runtime
└── tutorials # created by you
       |
       ├── VAI-KERAS-CUSTOM-GOOGLENET-RESNET # this repo
           ├── files
           |
           ...

3 Before starting with Vitis AI 1.4

You have to know few things about Docker in order to run the Vitis AI smoothly on your host environment.

3.1 Installing Docker Client/Server

To install docker client/server for Ubuntu, execute the following commands:

sudo apt-get remove docker docker-engine docker.io containerd runc
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo docker run hello-world
docker versionconda activate vitis-ai-tensorflow

Once done, in my case, I could see the following:

Client: Docker Engin Syntax | Description |
| --- | ----------- |
| Header | Title |
| Paragraph | Text |e - Community
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:18:15 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:16:12 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

3.2 Build the Docker GPU Image

Download the Vitis AI 1.4 and execute the docker_build_gpu.sh script.

Once done that, to list the currently available docker images run:

docker images # to list the current docker images available in the host pc

and you should see something like in the following text:

REPOSITORY            TAG                               IMAGE ID       CREATED         SIZE
xilinx/vitis-ai-gpu   latest                            7623d3de1f4d   6 hours ago     27.9GB

Note that docker does not have an automatic garbage collection system as of now. You can use this command to do a manual garbage collection:

docker rmi -f $(docker images -f "dangling=true" -q)

3.3 Launch the Docker Image

To launch the docker container with Vitis AI tools, execute the following commands from the <WRK_DIR> folder:

cd <WRK_DIR> # you are now in Vitis_AI subfolder
./docker_run.sh xilinx/vitis-ai-gpu:latest
conda activate vitis-ai-tensorflow
cd /workspace/tutorials/
cd VAI-KERAS-CUSTOM-GOOGLENET-RESNET/files #your working directory

Note that the container maps the shared folder /workspace with the file system of the Host PC from where you launch the above command, which is <WRK_DIR> in your case. This shared folder enables you to transfer files from the Host PC to the docker container and vice versa.

The docker container does not have any graphic editor, so it is recommended that you work with two terminals and you point to the same folder, in one terminal you use the docker container commands and in the other terminal you open any graphic editor you like.

4 The Main Flow

The main flow is composed of seven major steps. The first five steps are executed from the tools container on the host PC by launching the script run_all.sh, which contains several functions. The sixth and seventh step can be executed directly on the target board. Here is an overview of each step.

  1. Organize the data into folders, such as train for training, val for validation during the training phase, test for testing during the inference/prediction phase, and cal for calibration during the quantization phase, for each dataset. See Organize the Data for more information.

  2. Train the CNNs in Keras and generate the HDF5 weights model. See Train the CNN for more information.

  3. Convert into TF checkpoints and inference graphs. See Create TF Inference Graphs from Keras Models for more information.

  4. Freeze the TF graphs to evaluate the CNN prediction accuracy as the reference starting point. See Freeze the TF Graphs for more information.

  5. Quantize from 32-bit floating point to 8-bit fixed point and evaluate the prediction accuracy of the quantized CNN. See Quantize the Frozen Graphs for more information.

  6. Run the compiler to generate the xmodel file for the target board From the quantized pb file. See Compile the Quantized Models for more information.

  7. Use either VART C++ or Python APIs to write the hybrid application for the ARM CPU, then compile it. The application is called "hybrid" because the ARM CPU is executing some software routines while the DPU hardware accelerator is running the FC, CONV, ReLU, and BN layers of the CNN that were coded in the xmodelfile. Assuming you have archived the target_zcu102 folder and transferred the related target_zcu102.tar archive from the host to the target board with scp utility, now you can run the hybrid application. See Build and Run on the ZCU102 Target Board for more information.

All explanations in the following sections are based only on the CIFAR-10 dataset; the commands for the Fashion-MNIST dataset are very similar: just replace the sub-string "cifar10" with "fmnist".

Step 2, training, is the longest process and requires GPU support. In order to save storage space in this repository and at the same time to allow you to skip the training process itself, you can follow the flow by launching the script run_miniVggNet.sh (instead of run_all.sh) which works on the available miniVggNet floating point model (trained only with CIFAR-10 dataset).

4.1 Organize the Data

As Deep Learning deals with image data, you have to organize your data in appropriate folders and apply some pre-processing to adapt the images to the hardware features of the Vitis AI Platform. The first lines of script run_all.sh call other python scripts to create the sub-folders train, val, test, and cal that are located in the dataset/fashion-mnist and dataset/cifar10 directories and to fill them with 50000 images for training, 5000 images for validation, 5000 images for testing (taken from the 10000 images of the original test dataset) and 1000 images for the calibration process (copied from the training images).

All the images are 32x32x3 in dimensions so that they are compatible with the two different datasets.

4.1.1 Fashion MNIST

The MNIST dataset is considered the hello world of DL because it is widely used as a first test to check the deployment flow of a vendor of DL solutions. This small dataset takes relatively less time in the training of any CNN. However, due to the poor content of all its images, even the most shallow CNN can easily achieve from 98% to 99% of top-1 accuracy in Image Classification.

To solve this problem, the Fashion-MNIST dataset has been recently created for the paper Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. It is identical to the MNIST dataset in terms of training set size, testing set size, number of class labels, and image dimensions, but it is more challenging in terms of achieving high top-1 accuracy values.

Usually, the size of the images is 28x28x1 (gray-level), but in this case they have been converted to 32x32x3 ("false" RGB images) to be compatible with the "true" RGB format of CIFAR-10.

4.1.2 CIFAR-10

The CIFAR-10 dataset is composed of 10 classes of objects to be classified. It contains 60000 labeled RGB images that are 32x32 in size and thus, this dataset is more challenging than the MNIST and Fashion-MNIST datasets. The CIFAR-10 dataset was developed for the paper Learning Multiple Layers of Features from Tiny Images.

4.2 Train the CNN

Irrespective of the CNN type, the data is processed, using the following Python code, to normalize it from 0 to 1. Such code has to be mirrored in the C++ application that runs in the ARM® CPU of the target board.

# scale data to the range of [0, 1]
x_train = x_train.astype("float32") / cfg.NORM_FACTOR
x_test  = x_test.astype("float32") / cfg.NORM_FACTOR

# normalize
x_train = x_train -0.5
x_train = x_train *2
x_test  = x_test  -0.5
x_test  = x_test  *2

4.2.1 LeNet

The model scheme of LeNet has 6,409,510 parameters as shown in the following figure:

figure

For more details about this custom CNN and its training procedure, read the "Starter Bundle" of the Deep Learning for Computer Vision with Python books by Dr. Adrian Rosebrock.

4.2.2 miniVggNet

miniVggNet is a less deep version of the original VGG16 CNN customized for the smaller Fashion-MNIST dataset instead of the larger ImageNet-based ILSVRC. For more information on this custom CNN and its training procedure, read Adrian Rosebrock's post from the PyImageSearch Keras Tutorials. miniVggNet is also explained in the "Practitioner Bundle" of the Deep Learning for CV with Python books.

The model scheme of miniVggNet has 2,170,986 parameters as shown in the following figure:

figure

4.2.3 miniGoogleNet

miniGoogleNet is a customization of the original GoogleNet CNN. It is suitable for the smaller Fashion-MNIST dataset, instead of the larger ImageNet-based ILSVRC.

For more information on miniGoogleNet, read the "Practitioner Bundle" of the Deep Learning for CV with Python books by Dr. Adrian Rosebrock.

The model scheme of miniGoogleNet has 1,656,250 parameters, as shown in the following figure:

figure

4.2.4 miniResNet

miniResNet is a customization of the original ResNet-50 CNN. It is suitable for the smaller Fashion-MNIST small dataset, instead of the larger ImageNet-based ILSVRC.

For more information on miniResNet, read the "Practitioner Bundle" of the Deep Learning for CV with Python books.

The model scheme of miniResNet has 886,102 parameters, as shown in the following figure:

figure

4.3 Create TF Inference Graphs from Keras Models

The function 2_cifar10_Keras2TF() gets the computation graph of the TF backend representing the Keras model which includes the forward pass and training related operations.

The output files of this process, infer_graph.pb and float_model.chkpt.*, will be stored in the folder tf_chkpts. For example, in the case of miniVggNet, the TF input and output names that will be needed for Freeze the TF Graphs are named conv2d_1_input and activation_6/Softmax respectively.

4.4 Freeze the TF Graphs

The inference graph created in Create TF Inference Graphs from Keras Models is first converted to a GraphDef protocol buffer, then cleaned so that the subgraphs that are not necessary to compute the requested outputs, such as the training operations, can be removed. This process is called "freezing the graph".

The routines 3a_cifar10_freeze() and 3b_cifar10_evaluate_frozen_graph() generate the frozen graph and use it to evaluate the accuracy of the CNN by making predictions on the images in the test folder.

It is important to apply the correct input node and output node names in all the shell scripts, as shown in the following example with parameters when related to the miniVggNet case study:

--input_node  conv2d_1_input --output_node activation_6/Softmax

This information can be captured by the following python code:

# Check the input and output name
print ("\n TF input node name:")
print(model.inputs)
print ("\n TF output node name:")
print(model.outputs)

4.5 Quantize the Frozen Graphs

The routines 4a_cifar10_quant() and 4b_cifar10_evaluate_quantized_graph() generate the quantized graph and use it to evaluate the accuracy of the CNN by making predictions on the images in the test folder.

4.6 Compile the Quantized Models

The 5_cifar10_vai_compile_zcu102() routine generates the xmodel file for the embedded system composed by the ARM CPU and the DPU accelerator in the ZCU102 board.

This file has to be loaded at run time from the C++ (or Python) application directly on the target board OS environment. For example, in case of LeNet for Fashion-MNIST, the xmodel file is named LeNet.xmodel. A similar nomenclature is applied for the other CNNs.

Note that the Vitis AI Compiler tells you the names of the input and output nodes of the CNN that will be effectively implemented as a kernel in the DPU, therefore whatever layer remains out of such nodes it has to be executed in the ARM CPU as a software kernel, for example in the case of LeNet CNN:

Input Node(s)             (H*W*C)
conv2d_2_convolution(0) : 32*32*3

Output Node(s)      (H*W*C)
dense_2_MatMul(0) : 1*1*10

4.7 Build and Run on ZCU102 Target Board

You can compile the application directly on the SD card once the target board is turned on. In fact this is what the script run_all_cifar10_target.sh indeed does, once you will launch it from the target board.
Make an archive with the following commands:

cd <WRK_DIR>/tutorials/VAI-Keras-GoogleNet-ResNet/files
tar -cvf target_zcu102.tar ./target_zcu102 # to be copied on the SD card

Assuming you have transferred the target_zcu102.tar archive from the host to the target board with the scp utility, you can now run the following command directly on the target board:

tar -xvf target_zcu102.tar
cd target_zcu102
bash ./run_all_cifar10_target.sh

4.7.1 The C++ Application with VART APIs

The C++ code for image classification main.cc is independent of the CNN type, thanks to the abstraction done by the VART APIs; it was derived from the Vitis AI resnet50 VART demo.

It is very important that the C++ code for pre-processing the images executes the same operations that you applied in the Python code of the training procedure. This is illustrated in the following C++ code fragments:

/*image pre-process*/
Mat image2 = cv::Mat(inHeight, inWidth, CV_8SC3);
resize(image, image2, Size(inHeight, inWidth), 0, 0, INTER_NEAREST);
for (int h = 0; h < inHeight; h++) {
  for (int w = 0; w < inWidth; w++) {
    for (int c = 0; c < 3; c++) {
      imageInputs[i * inSize + h * inWidth * 3 + w * 3 + c] = (int8_t)( (image2.at<Vec3b>(h, w)[c])/255.0f - 0.5f)*2) * input_scale ); //if you use BGR
      //imageInputs[i * inSize + h * inWidth * 3 + w * 3 +2-c] = (int8_t)( (image2.at<Vec3b>(h, w)[c])/255.0f - 0.5f)*2) * input_scale ); //if you use RGB

    }
  }
}

📌 NOTE The DPU API apply OpenCV functions to read an image file (either png or jpg or whatever format) therefore the images are seen as BGR and not as native RGB. All the training and inference steps done in this tutorial threats images as BGR, which is true also for the above C++ normalization routine. A mismatch at this level would prevent the computation of the correct predictions at run time on the target board.

4.7.2 Running the four CNNs

Turn on your target board and establish a serial communication with a putty terminal from Ubuntu or with a TeraTerm terminal from your Windows host PC.

Ensure that you have an Ethernet point-to-point cable connection with the correct IP addresses to enable ssh communication in order to quickly transfer files to the target board with scp from Ubuntu or pscp.exe from Windows host PC. For example, you can set the IP addresses of the target board to be 192.168.1.100 while the host PC is 192.168.1.101 as shown in the following figure:

figure

Once a tar file of the target_zcu102 folder has been created, copy it from the host PC to the target board. For example, in case of an Ubuntu PC, use the following command:

scp target_zcu102.tar root@192.168.1.100:~/

From the target board terminal, run the following commands:

tar -xvf target_zcu102.tar
cd target_zcu102
bash -x ./run_all_fmnist_target.sh
bash -x ./run_all_cifar10_target.sh

With this command, the fmnist_test.tar file with the 5000 test images will be uncompressed. The single-thread application based on VART C++ APIs is built with the build_app.sh script and finally launched for each CNN, the effective top-5 classification accuracy is checked by a python script like check_runtime_top5_fmnist.py.

Another script like fps_fmnist.sh launches the multi-thread application based on VART Python APIs to measure the effective fps.

5 Summary

The following Excel table summarizes the CNN features for each dataset and for each network in terms of:

  • elapsed CPU time for the training process
  • number of CNN parameters and number of epochs for the training processed
  • TensorFlow output node names
  • top-1 accuracies estimated for the TF frozen graph and the quantized graph
  • top-1 accuracies measured on ZCU102 at run time execution
  • frames per second (fps) -measured on ZCU102 at run time execution- including reading the images with OpenCV function from ARM CPU (while in the real life case these images will be stored into DDR memory and so their access time should be negligible as seen from the DPU IP core).

figure

Note that in the case of CIFAR-10 dataset, being more sophisticated than the Fashion-MNIST, the top-1 accuracies of the four CNNs are quite different with miniResNet being the most accurate.

To save storage space, the folder target_zcu102 contains only the xmodel files for the CIFAR10 dataset, being more challenging and interesting than the Fashion-MNIST dataset.

6 References

Appendix

A1 PyImageSearch Permission

From: Adrian at PyImageSearch [mailto:a.rosebrock@pyimagesearch.com]
Sent: Thursday, February 20, 2020 12:47 PM
To: Daniele Bagni <danieleb@xilinx.com>
Subject: Re: URGENT: how to cite / use your code in my new DL tutorials

EXTERNAL EMAIL
Hi Daniele,

Yes, the MIT license is perfectly okay to use. Thank you for asking :-)

All the best,


From: Adrian at PyImageSearch <a.rosebrock@pyimagesearch.com>
Sent: Friday, April 12, 2019 4:25 PM
To: Daniele Bagni
Cc: danny.baths@gmail.com

Subject: Re: how to cite / use your code in my new DL tutorials

Hi Daniele,
Thanks for reaching out, I appreciate it! And yes, please feel free to use the code in your project.
If you could attribute the code to the book that would be perfect :-)
Thank you!
--
Adrian Rosebrock
Chief PyImageSearcher

On Sat, Apr 6, 2019 at 6:23 AM EDT, Daniele Bagni <danieleb@xilinx.com> wrote:

Hi Adrian.

...

Can I use part of your code in my tutorials?
In case of positive answer, what header do you want to see in the python files?

...


With kind regards,
Daniele Bagni
DSP / ML Specialist for EMEA
Xilinx Milan office (Italy)

Copyright© 2020 Xilinx