Vitis AI Tutorials

TensorFlow AI Optimizer example using low-level coding style

Current Status

Tested with Vitis-AI 1.4.1 (do not use Vitis-AI 1.4)
Tested on ZCU102 evaluation board and Alveo U50 accelerator card

Tutorial Overview

The Vitis-AI Optimizer can optimize convolutional neural networks (CNN) by exploiting redundancies and near-zero parameters to reduce the number of mathematical operations required to execute the network. This process is often known as 'pruning'. The reduction of mathematical operations leads to several benefits:

Increased throughput.
Reduced latency.
Reduced memory footprint of the compiled model.
Reduced number of accesses to memory.

This tutorial will show you how to use Xilinx's AI Optimizer for TensorFlow to prune an AlexNet CNN by 80% whilst maintaining the original accuracy.

The scripts provided in this design will allow users to either run a complete pruning flow or to run a 'baseline' design without pruning. Having the baseline design is useful to provide a comparison of performance and accuracy with the pruned design.

The run_no_prune.sh shell script creates the baseline design whereas run_prune.sh will prune the AlexNet design by approximately 80%. Users are invited to open both scripts with a text editor to get an idea of the differences between the two flows.

The remainder of this tutorial is dedicated to running the pruning flow.

TensorFlow AI Optimizer design steps

Pruning a CNN is an iterative process and requires an initial analysis phase in which the trained CNN is analysed using the test dataset and then is pruned in steps. The following diagrams summarize the complete TensorFlow AI Optimizer flow:

Steps 0 to 4

Step 5

Steps 6 to 12

The numbering of the steps in the above flow diagrams correspond to the numbering of the paragraphs below and to the numbering the of the shell scripts. For example, step 4 -pruna analysis step in the diagram corresponds to the Step 4 - Run Pruning Analysis paragraph and to the source 4_prune_analysis.sh shell script.

Before You Begin

The host machine has several requirements that need to be met before we begin. You will need:

A license for the AI Optimizer - contact your Xilinx sales representative to obtain one.
An x86 host machine with a supported OS and either the CPU or GPU versions of the Vitis-AI docker installed - see System Requirements.
The host machine will require Docker to be installed and the Vitis-AI CPU or GPU docker image to be built - see Getting Started.
A GPU card suitable for ML training - a GPU with at least 8GB of memory is recommended.
If you plan to use the ZCU102 evaluation board, it should be prepared with the board image as per the Step2: Setup the Target instructions. Hints on how to connect the various cables to the ZCU102 are also available here.
For the Alveo U50, follow the Setup Alveo Accelerator Card instructions.

For more details, refer to the latest version of the Vitis AI User Guide (UG1414).

This tutorial assumes the user is familiar with Python3, TensorFlow and has some knowledge of machine learning principles.

Step 0 - Setting up the workspace, dataset and AI Optimizer license

Copy the repository by doing either of the following:
- Download the repository as a ZIP file to the host machine, and then unzip the archive.
- From a terminal, use the git clone command.
Download the Kaggle dogs-vs-cats dataset.
- Go to the Kaggle website and register a new account if necessary.
- Download the dataset.
- Move dogs-vs-cats.zip into the files folder in the design repository, which is the same folder that contains the python (.py) and shell (.sh) scripts.
The Kaggle dog-vs-cats dataset consists of 25,000 images of varying dimensions, divided into two classes: cat and dog. Each image is intrinsically labelled or classified in its filename (for example, cat.12.jpg).

There is a set of unlabelled images which were part of the original Kaggle dogs-vs-cats challenge, but we will not use it in this tutorial. Only the 25000 images that are contained in the train.zip archive will be used.
Place a copy of the AI Optimizer license file on your host machine - it needs to be placed under the 'files' folder so that it is visible to the Vitis-AI docker. The exact location is defined in the 0_setenv_pr.sh script:

# Xilinx pruning licenses
export XILINXD_LICENSE_FILE=AI_OPTIMIZER.lic

Open a linux terminal, cd to the repository folder, and then cd to the files folder.

Start the Vitis AI GPU docker:

# navigate to densenet tutorial folder
cd <path_to_densenet_design>/files

# to start GPU docker container
./docker_run.sh xilinx/vitis-ai-gpu:latest

The docker container will start and after accepting the license agreement, you should see something like this in the terminal:

 ```shell
 ==========================================
 
 __      ___ _   _                   _____
 \ \    / (_) | (_)            /\   |_   _|
  \ \  / / _| |_ _ ___ ______ /  \    | |
   \ \/ / | | __| / __|______/ /\ \   | |
    \  /  | | |_| \__ \     / ____ \ _| |_
     \/   |_|\__|_|___/    /_/    \_\_____|
 
 ==========================================
 
 Docker Image Version:  latest 
 Build Date: 2021-08-04
 VAI_ROOT: /opt/vitis_ai
 
 For TensorFlow 1.15 Workflows do:
      conda activate vitis-ai-tensorflow 
 For Caffe Workflows do:
      conda activate vitis-ai-caffe 
 For Neptune Workflows do:
      conda activate vitis-ai-neptune 
 For PyTorch Workflows do:
      conda activate vitis-ai-pytorch 
 For TensorFlow 2.3 Workflows do:
      conda activate vitis-ai-tensorflow2 
 For Darknet Optimizer Workflows do:
      conda activate vitis-ai-optimizer_darknet 
 For TensorFlow 1.15 Optimizer Workflows do:
      conda activate vitis-ai-optimizer_caffe 
 For TensorFlow 1.15 Workflows do:
      conda activate vitis-ai-optimizer_tensorflow 
 Vitis-AI /workspace > 
 ```

💡 If you get a "Permission Denied" error when starting the docker container, it is almost certainly because the docker_run.sh script is not set to be executable. You can fix this by running the following command:
 chmod +x docker_run.sh

The remainder of this tutorial shows a step-by-step pruning flow however users can just run the complete flow using source run_prune.sh if they wish. The baseline flow can be run with source run_no_prune.sh

Use setenv scripts to set up all the environment variables.

To set up the environment variables used in pruning flow, run the environment setup script: source 0_setenv_pr.sh.

The 0_setenv_np.sh script is only used in the non-pruning flow,

The 0_setenv_common.sh is called by both 0_setenv_pr.sh and 0_setenv_np.sh and defines all the paths and names of files and directories that the user can edit (as required). It also defines the GPUs to be used, which might require modification to match the user's system, such as:

export CUDA_VISIBLE_DEVICES="0"

0_setenv_common.sh also defines some of the hyperparameters and users can also modify these as required:

export BATCHSIZE=250
export INIT_LR=0.001
export TRAIN_EPOCHS=200

Step 1 - Convert Dataset Images to numpy Files

Run source 1_create_dataset.sh, which will take some time to complete.

The 1_create_dataset.sh script will call the create_dataset.py Python script, which in turn will:

Unzip the dogs-vs-cats.zip archive.
Split the full set of images into training and test subsets.
Resize and center crop the images to the size specified by the --input_height and --input_width arguments.
Normalize the images, so that pixel values are in the range 0 to 1.
Pack the resized, normalized train and test images and their associated labels into numpy files, which will be used as input for training and test in TensorFlow.
Put all the test images in a separate folder (test_images), so that they can be used later to create the files and folders to be run on the ZCU102 evaluation board.

The create_dataset.py Python script has three arguments.

Argument	Data Type	Default	Description
`--dataset_dir` or `-d`	string	./dataset	Path to folder where `.npz` files and test images will be created
`--input_height` or `-ih`	integer	224	All images will be resized to this height
`--input_width` or `-iw`	integer	224	All images will be resized to this width

After this script is completed, you should find two numpy files in the dataset folder, trainData.npz and testData.npz, and a subfolder called test_images that contains the resized test images.

Step 2 - Initial Training

Run source 2_train.sh, which calls the train_ft.py Python script.

The train_ft.py script is used at two different points in the complete flow — initial training of the original network and also in fine-tuning the pruned checkpoints.

If an input checkpoint is specified using the --input_ckpt argument, the script will assume that it is fine-tuning a pruned checkpoint and sets the pruning mode. For fine-tuning it is necessary to call the tf.set_pruning_mode() API which enables a 'sparse training' mode that keeps the weights of pruned channels at a zero value during fine-tuning:

# if an input checkpoint is specified, we are doing pruning fine-tune
if (input_ckpt!=''):
  tf.set_pruning_mode()

Then the script will restore that checkpoint in the TensorFlow session.

# if input checkpoint specified, restore it
if (input_ckpt!=''):
  saver.restore(sess, input_ckpt)

There is a minimal amount of image augmentation done during training; the images are randomly flipped from left to right around the vertical axis.

# random flip
if (random.randint(0,1)==1):
  np.fliplr(x_batch[j])

The initial learning rate is set by the INIT_LR environment variable, which you will find in 0_setenv_common.sh.

export INIT_LR=0.001

This learning rate will be used for the first half of the training and will then be divided by 10 for the second half of the training.

At the end of each training epoch, the script evaluates the accuracy of the network using the test dataset. A checkpoint will be saved at the end of each epoch only if the accuracy improves over the current best score. The final accuracy should be approximately 92%.

After training has completed, TensorBoard can be launched using the command reported in the console and log file.

The train_ft.py Python script has the following arguments.

Argument	Data Type	Default	Description
`--target_acc` or `-a`	float	1.0	Minimum accuracy level that causes training/fine-tune to exit
`--dataset_dir` or `-d`	string	'./dataset'	Path to folder containing `testData.npz` and `trainData.npz` files
`--input_ckpt` or `-i`	string	''	Path to input checkpoint for fine-tuning. Empty string means run initial training
`--epochs` or `-e`	integer	1	Number of epochs for training or fine-tuning
`--batchsize` or `-b`	integer	100	Batch size for training or fine-tuning
`--init_lr` or `-il`	float	0.1	Initial learning rate for optimizer
`--output_ckpt` or `-o`	string	output.ckpt	Path to output checkpoint
`--tboard_logs` or `-tb`	string	'./tb_logs'	Path to folder where TensorBoard event logs will be saved
`--input_height` or `-ih`	integer	224	Input height size
`--input_width` or `-iw`	integer	224	Input width size
`--input_chan` or `-ic`	integer	3	Number of input channels
`--gpu` or `-g`	string	'0'	List of GPUs to be used for training or fine-tuning

The --target_acc option can be used to cause the training/fine-tuning to exit early if the minimum specified accuracy is reached. This is mainly intended for use during fine-tuning - one strategy for fine-tuning would be to set --target_acc to the same accuracy value obtained during initial training.

The training/fine-tuning will either run until the target accuracy or maximum epochs (--epochs) is reached.

For initial training, --target_acc would normally be left at the default value of 1.0 (100% accuracy target).

Step 3 - Export Inference Graph

Run source 3_export_inf.sh, which calls the export_inf_graph.py Python script.

An inference graph is required for the pruning. The export_inf_graph.py Python script will write out an inference graph in the text protobuf format. The AlexNet is instantiated with dropout rate set to 0 and the is_training argument set to False.

network = alexnet(net_in=input_tensor,classes=2,drop_rate=0.0,is_training=False)

Note: This inference graph is not the same as a frozen graph because it does not contain any values.

The export_inf_graph.py Python script has the following arguments.

Argument	Data Type	Default	Description
`--output_file` or `-o`	string	''	Full path name of inference graph file to be created
`--input_nodes` or `-i`	string	''	List of input nodes

Before moving on, ensure that the output file has been created.

Step 4 - Run Pruning Analysis

Run source 4_prune_analysis.sh, which will take some time to complete.

The pruning analysis command analyzes the trained checkpoint and inference graph and writes the analysis results into a file named .ana in the folder indicated by the --workspace argument.

The --eval_fn_path argument must point to a Python file that contains a function named model_fn(). This function will be called once for every analysis batch, the number of analysis batches is set by the --max_num_batches argument.

The model_fn() must instantiate the network (for inference, not training) and provide input data (usually the test dataset) and an evaluation metric. In this case, you calculate accuracy.

After the script completes, before proceeding to the next step, ensure that you have a file named .ana in the folder indicated by the --workspace argument.

Note: You may need to enable viewing of hidden files to see the .ana file as Linux usually treats any file that begins with '.' as a hidden file.

Step 5 - Run Pruning

Run source 5_pruning.sh.

The 5_pruning.sh shell script invokes the commands necessary to iteratively prune the network. The pruning loop will run eight pruning iterations, each one reduces the network by approximately 10%.

Inside the pruning loop, the sparsity value (i.e. the amount of pruning) is incremented by 0.1 each loop iteration. Pruning is run using the sparsity value then fine-tuning (basically re-training) is run on the pruned checkpoint.

For the first 7 fine-tuning iterations, the target accuracy is set to 90% to speed up the flow, the eighth and last iteration has a target accuracy of 92% to match the accuracy obtained after initial training.

Step 6 - Create Dense Checkpoint

Run source 6_transform.sh.

This step will convert the sparse checkpoint that is created by the pruning/fine-tune iterations into a dense checkpoint.

The checkpoint created in step 5 contains exactly the same number of parameters as there were in the original model but mnay of them are now 'zeroed out'. The vai_p_tensorflow --action transform command called in the source 6_transform.sh script will remove the zero values to reduce the checkpoint.

Step 7 - Freeze Pruned Graph and Checkpoint

Run source 7_freeze.sh.

The Vitis AI tools require a TensorFlow frozen graph as the input to the quantization stage. The 7_freeze.sh shell script will create the frozen graph from the dense checkpoint. The frozen graph is in the binary protobuf format and is gets the name because all variables are converted into constants and graph nodes associated with training, such as the optimizer and loss function, are stripped out.

Step 8 - Evaluate the Frozen Graph

Run source 8_eval_frozen.sh.

This is an optional step because the frozen graph is still in floating-point format and should give almost identical accuracy results as the evaluation done during the training phase (step 2). All images of the test set are passed through the frozen model and the accuracy is calculated.

Step 9 - Quantize

Run source 9_quant.sh.

The DPU accelerator IP executes all calculations in 8bit integer format, so we must quantize our floating-point frozen graph. This is done by the Vitis AI tools, in particular by the vai_q_tensorflow quantize command. This command can be seen in the 9_quant.sh script and has several arguments that you must provide values for.

Argument	Description
`--input_frozen_graph`	Path and name of the input .pb frozen graph
`--input_fn`	Name of input function used in calibration pre-processing
`--output_dir`	Name of the output folder where the quantized models are saved
`--input_nodes`	Name(s) of the input nodes
`--output_nodes`	Name(s) of the output nodes
`--input_shapes`	Shape(s) of the input nodes
`--calib_iter`	Number of calibration iterations
`--gpu`	List of CUDA devices to be used for quantization

Note: Any error messages relating to ./bin/ptxas can be ignored.

Most of the arguments are self-explanatory but special mention needs to be made for the --input_fn and --calib_iter arguments.

You must use a sample set of data to calibrate the quantization process. This data will be passed through the model, so the data must be pre-processed in exactly the same way as the data is pre-processed in training. The function pointed to by the --input_fn argument will need to contain all of the pre-processing steps.

The image_input_fn.py Python script contains a single function called calib_input (the --input_fn argument is set to image_input_fn.calib_input in the 9_quant.sh shell script) which unpacks the training dataset from its numpy format and then creates a list of numpy arrays. The number of arrays in the list is given by calib_batch_size * calib_iter and should be at least 1000.

After quantization has completed, you will have the quantized deployment model (deploy_model.pb) and the evaluation model (quantize_eval_model.pb) in the ./files/build_pr/quantize folder.

Step 10 - Evaluate Quantized Graph

Run source 10_eval_quant.sh.

This step is optional but highly recommended. The conversion from a floating-point model where the values can have a very wide dynamic range to an 8-bit model where values can only have one of 256 values almost inevitably leads to a small loss of accuracy. You use the quantized evaluation model to see exactly how much impact the quantization process has had.

To ensure consistency, the same Python script, eval_graph.py, that was used to evaluate the frozen graph is also used to evaluate the quantized model.

Step 11 - Compile

For the ZCU10, run source 11_compile_zcu102.sh. For the Alveo U50, run source 11_compile_u50.sh.

The DPU is a soft-core IP whose only function is to accelerate the execution of convolutional neural networks. It acts as a co-processor to the host processor and has its own instruction set;those instructions are passed to the DPU in .xmodel file format.

The Vitis AI compiler will convert and optimize where possible, the quantized model to a set of micro-instructions and then output them to an .xmodel file.

The compile scripts are written into a folder called files/build_pr/compile_<board_name>

Step 12 - Run the Application on the Target Board

For the ZCU10, run source 12_make_target_zcu102.sh. For the Alveo U50, run source 12_make_target_u50.sh.

This final step will copy all the required files for running on the chosen target board board into the files/build_pr/target_<board_name> folder.

ZCU102

Ensure that the ZCU102 SDCard has been flashed with the correct version of the image file and boots correctly before proceeding.

The entire files/build_pr/target_zcu102 folder will need to be copied to the /home/root folder of ZCU102's SDcard. This can be done in one of several ways:

Direct copy to SD card:
- If the host machine has an SD card slot, insert the flashed SD card, and when it is recognized, you will see two volumes, BOOT and ROOTFS.
- Navigate into the ROOTFS and then into the /home folder. Make the ./root folder writeable by issuing the command sudo chmod -R 777 root, and then copy the entire target folder from the host machine into the /home/root folder of the SD card.
- Unmount both the BOOT and ROOTFS volumes from the host machine, and then eject the SD card from the host machine.
With the scp command:
- If the ZCU102 is connected to a network and reachable by the host machine, the target folder can be copied using the scp command. If you connect directly from your host machine to the ZCU102 using ethernet, you might need to set up static IP addresses.
- The command will be something like scp -r ./build_pr/target_zcu102 root@192.168.1.227:~/. assuming that the ZCU102 IP address is 192.168.1.227. Adjust this and the path to the target folder as appropriate for your system.
- If the password is asked for, enter root.

With the target folder copied to the SD card and the ZCU102 booted, you can issue the command to launch the application.

Note: This process is done on the ZCU102 board, not the host machine, so it requires a connection to the ZCU102, such as a serial connection to the UART or an SSH connection via Ethernet.

The application can be started on the ZCU102 by navigating into the target_zcu102 folder (cd target_zcu102) and then issuing the command python3 app_mt.py -m model_dir/alexnet_pr.xmodel. The application will start and after a few seconds show the throughput (in frames/sec) and the accuracy:

root@xilinx-zcu102-2020_2:~/target_zcu102# python3 app_mt.py -m model_dir/alexnet_pr.xmodel 
-----------------------------------------------
Command line options:
 --image_dir :  images
 --threads   :  1
 --model     :  model_dir/alexnet_pr.xmodel
-----------------------------------------------
Found 5000 images - processing 2500 of them
Found 1 subgraphs in model_dir/alexnet_pr.xmodel
Pre-processing 2500 images...
Starting 1 threads...
-----------------------------------------------
Correct:2307, Wrong:193, Accuracy:0.9228
-----------------------------------------------
FPS: 161.02, total frames: 2500, total time: 15.526 seconds
-----------------------------------------------

The throughput can be increased by using more than 1 thread with the --threads option. For example, to use 4 threads:

root@xilinx-zcu102-2020_2:~/target_zcu102# python3 app_mt.py -m model_dir/alexnet_pr.xmodel --threads 4
-----------------------------------------------
Command line options:
 --image_dir :  images
 --threads   :  4
 --model     :  model_dir/alexnet_pr.xmodel
-----------------------------------------------
Found 5000 images - processing 2500 of them
Found 1 subgraphs in model_dir/alexnet_pr.xmodel
Pre-processing 2500 images...
Starting 4 threads...
-----------------------------------------------
Correct:2307, Wrong:193, Accuracy:0.9228
-----------------------------------------------
FPS: 309.76, total frames: 2500, total time: 8.071 seconds
-----------------------------------------------

Alveo U50

Note that the U50 will need to have been flashed with the correct deployment shell - this should have been done in the 'Preparing the host machine and target boards' section above.

The following steps should be run from inside the Vitis-AI Docker container:

Ensure that Vitis-AI's TensorFlow2 conda environment is enabled (if not, the run conda activate vitis-ai-tensorflow).
Run source setup.sh DPUCAHX8H which sets environment variables to point to the correct overlay for the U50. The complete steps to run are as follows:

conda activate vitis-ai-tensorflow
source setup.sh DPUCAHX8H
cd build_pr/target_u50
/usr/bin/python3 app_mt.py -m model_dir/alexnet_pr.xmodel

You should see something like this:

Vitis-AI /workspace/build_pr/target_u50 > /usr/bin/python3 app_mt.py -m model_dir/alexnet_pr.xmodel
-----------------------------------------------
Command line options:
 --image_dir :  images
 --threads   :  1
 --model     :  model_dir/alexnet_pr.xmodel
-----------------------------------------------
Found 5000 images - processing 2500 of them
Found 1 subgraphs in model_dir/alexnet_pr.xmodel
Pre-processing 2500 images...
Starting 1 threads...
-----------------------------------------------
Correct:2312, Wrong:188, Accuracy:0.9248
-----------------------------------------------
FPS: 471.33, total frames: 2500, total time: 5.304 seconds
-----------------------------------------------

As with the ZCU102, the performance can be increased by using more threads:

Vitis-AI /workspace/build_pr/target_u50 > /usr/bin/python3 app_mt.py -m model_dir/alexnet_pr.xmodel --threads 4
-----------------------------------------------
Command line options:
 --image_dir :  images
 --threads   :  4
 --model     :  model_dir/alexnet_pr.xmodel
-----------------------------------------------
Found 5000 images - processing 2500 of them
Found 1 subgraphs in model_dir/alexnet_pr.xmodel
Pre-processing 2500 images...
Starting 4 threads...
-----------------------------------------------
Correct:2312, Wrong:188, Accuracy:0.9248
-----------------------------------------------
FPS: 2959.87, total frames: 2500, total time: 0.845 seconds
-----------------------------------------------

References

Vitis AI Optimizer User Guide (UG1333)
Vitis AI User Guide (UG1414)
Kaggle Dogs-vs-Cats dataset
ImageNet Classification with Deep Convolutional Neural Networks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vitis AI Tutorials

TensorFlow AI Optimizer example using low-level coding style

Current Status

Tutorial Overview

TensorFlow AI Optimizer design steps

Before You Begin

Step 0 - Setting up the workspace, dataset and AI Optimizer license

Step 1 - Convert Dataset Images to numpy Files

Step 2 - Initial Training

Step 3 - Export Inference Graph

Step 4 - Run Pruning Analysis

Step 5 - Run Pruning

Step 6 - Create Dense Checkpoint

Step 7 - Freeze Pruned Graph and Checkpoint

Step 8 - Evaluate the Frozen Graph

Step 9 - Quantize

Step 10 - Evaluate Quantized Graph

Step 11 - Compile

Step 12 - Run the Application on the Target Board

ZCU102

Alveo U50

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vitis AI Tutorials

TensorFlow AI Optimizer example using low-level coding style

Current Status

Tutorial Overview

TensorFlow AI Optimizer design steps

Before You Begin

Step 0 - Setting up the workspace, dataset and AI Optimizer license

Step 1 - Convert Dataset Images to numpy Files

Step 2 - Initial Training

Step 3 - Export Inference Graph

Step 4 - Run Pruning Analysis

Step 5 - Run Pruning

Step 6 - Create Dense Checkpoint

Step 7 - Freeze Pruned Graph and Checkpoint

Step 8 - Evaluate the Frozen Graph

Step 9 - Quantize

Step 10 - Evaluate Quantized Graph

Step 11 - Compile

Step 12 - Run the Application on the Target Board

ZCU102

Alveo U50

References