This is the Pytorch agent for MLModelScope, an open-source framework and hardware agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets / frameworks / systems, and within AI application pipelines.
Currently it has most of the models from Pytorch Model Zoo built in, plus many models acquired from public repositories. Although the agent supports different modalities including Object Detection and Image Enhancement, most of the built-in models are for Image Classification. More built-in models are coming. One can evaluate the ~50 models on any system of interest with either local Pytorch installation or Pytorch docker images.
Check out MLModelScope and welcome to contribute.
We first discuss a bare minimum pytorch-agent installation without the tracing and profiling capabilities. To make this work, you will need to have the following system libraries preinstalled in your system.
- The CUDA library (required)
- The CUPTI library (required)
- The Pytorch C++ (libtorch) library (required)
- The libjpeg-turbo library (optional, but preferred)
Please refer to Nvidia CUDA library installation on this. Find the localation of your local CUDA installation, which is typically at /usr/local/cuda/
, and setup the path to the libcublas.so
library. Place the following in either your ~/.bashrc
or ~/.zshrc
file:
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
Please refer to Nvidia CUPTI library installation on this. Find the localation of your local CUPTI installation, which is typically at /usr/local/cuda/extras/CUPTI
, and setup the path to the libcupti.so
library. Place the following in either your ~/.bashrc
or ~/.zshrc
file:
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
The Pytorch C++ library is required for our Pytorch Go package. If you want to use Pytorch Docker Images (e.g. NVIDIA GPU CLOUD (NGC)) instead, skip this step for now and refer to our later section on this.
You can download pre-built Pytorch C++ (libtorch) library from Pytorch. Choose Pytorch Build = Stable (1.3)
, Your OS = <fill>
, Package = LibTorch
, Language = C++
and CUDA = <fill>
. Download Pre-cxx11 ABI
or cxx11 ABI
version based on local gcc/g++ version.
Extract the downloaded archive to /opt/libtorch/
.
tar -C /opt/libtorch -xzf (downloaded file)
Configure the linker environmental variables since the Pytorch C++ library is extracted to a non-system directory. Place the following in either your ~/.bashrc
or ~/.zshrc
file
Linux
export LIBRARY_PATH=$LIBRARY_PATH:/opt/libtorch/lib
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/libtorch/lib
macOS
export LIBRARY_PATH=$LIBRARY_PATH:/opt/libtorch/lib
export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/opt/libtorch/lib
You can test the installed Pytorch C++ library using an example C++ program, although we suggest running an example in github.com/rai-project/go-pytorch
as per its documentation to confirm library installation.
To build the Pytorch C++ library from source, refer to https://github.com/pytorch/pytorch#installation and the code for building go-pytorch dockerfiles.
libjpeg-turbo is a JPEG image codec that uses SIMD instructions (MMX, SSE2, AVX2, NEON, AltiVec) to accelerate baseline JPEG compression and decompression. It outperforms libjpeg by a significant amount.
You need libjpeg installed.
sudo apt-get install libjpeg-dev
The default is to use libjpeg-turbo, to opt-out, use build tag nolibjpeg
.
To install libjpeg-turbo, refer to libjpeg-turbo.
Linux
export TURBO_VER=2.0.2
cd /tmp
wget https://cfhcable.dl.sourceforge.net/project/libjpeg-turbo/${TURBO_VER}/libjpeg-turbo-official_${TURBO_VER}_amd64.deb
sudo dpkg -i libjpeg-turbo-official_${TURBO_VER}_amd64.deb
macOS
brew install jpeg-turbo
Since we use go
for MLModelScope development, it's required to have go
installed in your system before proceeding.
Please follow Installing Go Compiler to have go
installed.
Download and install the MLModelScope Pytorch Agent by running the following command in any location, assuming you have installed go
following the above instruction.
go get -v github.com/rai-project/pytorch
You can then install the dependency packages through go get
.
cd $GOPATH/src/github.com/rai-project/pytorch
go get -u -v ./...
An alternative to install the dependency packages is to use Dep.
dep ensure -v
This installs the dependency in vendor/
.
The CGO interface passes go pointers to the C API. There is an error in the CGO runtime. We can disable the error by placing
export GODEBUG=cgocheck=0
in your ~/.bashrc
or ~/.zshrc
file and then run either source ~/.bashrc
or source ~/.zshrc
Build the Pytorch agent with GPU enabled
cd $GOPATH/src/github.com/rai-project/pytorch/pytorch-agent
go build
Build the Pytorch agent without GPU or libjpeg-turbo
cd $GOPATH/src/github.com/rai-project/pytorch/pytorch-agent
go build -tags="nogpu nolibjpeg"
If everything is successful, you should have an executable pytorch-agent
binary in the current directory.
To run the agent, you need to setup the correct configuration file for the agent. Some of the information may not make perfect sense for all testing scenarios, but they are required and will be needed for later stage testing. Some of the port numbers as specified below can be changed depending on your later setup for those service.
So let's just set them up as is, and worry about the detailed configuration parameter values later.
You must have a carml
config file called .carml_config.yml
under your home directory. An example config file carml_config.yml.example
is in github.com/rai-project/MLModelScope . You can move it to ~/.carml_config.yml
.
The following configuration file can be placed in $HOME/.carml_config.yml
or can be specified via the --config="path"
option.
app:
name: carml
debug: true
verbose: true
tempdir: ~/data/carml
registry:
provider: consul
endpoints:
- localhost:8500
timeout: 20s
serializer: jsonpb
database:
provider: mongodb
endpoints:
- localhost
tracer:
enabled: true
provider: jaeger
endpoints:
- localhost:9411
level: FULL_TRACE
logger:
hooks:
- syslog
With the configuration and the above bare minimumn installation, you should be ready to test the installation and see how things works.
Here are a few examples. First, make sure we are in the right location
cd $GOPATH/src/github.com/rai-project/pytorch/pytorch-agent
To see a list of help
./pytorch-agent -h
To see a list of models that we can run with this agent
./pytorch-agent info models
To run an inference using the default DNN model alexnet
with a default input image.
./pytorch-agent predict urls --model_name TorchVision_Alexnet --profile=false --publish=false
The above --profile=false --publish=false
command parameters tell the agent that we do not want to use profiling capability and publish the results, as we haven't installed the MongoDB database to store profiling data and the tracer service to accept tracing information.
We now discuss how to install a few external services that make the agent fully useful in terms of collecting tracing and profiling data.
MLModelScope relies on a few external services. These services provide tracing, registry, and database servers.
These services can be installed and enabled in different ways. We discuss how we use docker
below to show how this can be done. You can also not use docker
but install those services from either binaries or source codes directly.
Refer to Install Docker.
On Ubuntu, an easy way is using
curl -fsSL get.docker.com -o get-docker.sh | sudo sh
sudo usermod -aG docker $USER
On macOS, intsall Docker Destop
This service is required.
- On x86 (e.g. intel) machines, start jaeger by
docker run -d -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p5775:5775/udp -p6831:6831/udp -p6832:6832/udp \
-p5778:5778 -p16686:16686 -p14268:14268 -p9411:9411 jaegertracing/all-in-one:latest
- On ppc64le (e.g. minsky) machines, start jaeger machine by
docker run -d -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p5775:5775/udp -p6831:6831/udp -p6832:6832/udp \
-p5778:5778 -p16686:16686 -p14268:14268 -p9411:9411 carml/jaeger:ppc64le-latest
The trace server runs on http://localhost:16686
This service is not required if using pytorch-agent for local evaluation.
- On x86 (e.g. intel) machines, start consul by
docker run -p 8500:8500 -p 8600:8600 -d consul
- On ppc64le (e.g. minsky) machines, start consul by
docker run -p 8500:8500 -p 8600:8600 -d carml/consul:ppc64le-latest
The registry server runs on http://localhost:8500
This service is not required if not using database to publish evaluation results.
- On x86 (e.g. intel) machines, start mongodb by
docker run -p 27017:27017 --restart always -d mongo:3.0
You can also mount the database volume to a local directory using
docker run -p 27017:27017 --restart always -d -v $HOME/data/carml/mongo:/data/db mongo:3.0
You must have a carml
config file called .carml_config.yml
under your home directory. An example config file ~/.carml_config.yml
is already discussed above. Please update the port numbers for the above external services accordingly if you decide to choose a different ports above.
The testing steps are very similar to those testing we discussed above, except that you can now safely use both the profiling and publishing services.
Instead of using a local Pytorch library to install the MLModelScope pytorch-agent
, we can also use a pytorch docker image to start this process.
You need to follow the above similar procedures to setup go
and get all the related rai-project
projects in your local go development environment.
go get -v github.com/rai-project/pytorch
cd $GOPATH/src/github.com/rai-project/pytorch
go get -u -v ./...
You also need to have the .carml_config.yml
configuraiton file as discussed above to be placed under $HOME as .carml_config.yml
You can also setup all the external services as discussed above in your local host machine where you plan to use the Pytorch Docker container.
Continue if you have
- installed all the dependencies
- downloaded carml_config_example.yml to $HOME as .carml_config.yml
- launched docker external services on the host machine of the docker container you are going to use
, otherwise read above
Assuming you want to use the NGC Pytorch docker image. Here is an example on how to do this:
docker run --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it --privileged=true --network host \
-v $GOPATH:/workspace/go1.12/global \
-v $GOROOT:/workspace/go1.12_root \
-v ~/.carml_config.yml:/root/.carml_config.yml \
nvcr.io/nvidia/pytorch:20.01-py3
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for PyTorch. NVIDIA recommends the use of the following flags:
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...
Within the container, set up the environment so that the agent can find the Pytorch C++ library.
export GOPATH=/workspace/go1.12/global
export GOROOT=/workspace/go1.12_root
export PATH=$GOROOT/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.6/site-packages/torch/lib:$LD_LIBRARY_PATH
ln -s /usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /opt/conda/lib/python3.6/site-packages/torch/lib/libnvidia-ml.so.1
export CGO_LDFLAGS "${CGO_LDFLAGS} -L /opt/conda/lib/python3.6/site-packages/torch/lib"
export CGO_CFLAGS "${CGO_CFLAGS} -I /opt/conda/lib/python3.6/site-packages/torch/include -I /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include"
export CGO_CXXFLAGS "${CGO_CXXFLAGS} -I /opt/conda/lib/python3.6/site-packages/torch/include -I /opt/conda/lib/python3.6/site-packages/torch/include/torch/csrc/api/include"
export PATH=$PATH:$(go env GOPATH)/bin
export GODEBUG=cgocheck=0
Build the Pytorch agent with GPU enabled
cd $GOPATH/src/github.com/rai-project/pytorch/pytorch-agent
go build
Build the Pytorch agent without GPU or libjpeg-turbo
cd $GOPATH/src/github.com/rai-project/pytorch/pytorch-agent
go build -tags="nogpu nolibjpeg"
Use the Agent with the MLModelScope Web UI
./pytorch-agent serve -l -d -v
Refer to [TODO] to run the web UI to interact with the agent.
Run ./pytorch-agent -h
to list the available commands.
Run ./pytorch-agent info models
to list the available models.
Run ./pytorch-agent predict
to evaluate a model. This runs the default evaluation.
./pytorch-agent predict -h
shows the available flags you can set.
An example run is
./pytorch-agent predict urls --model_name TorchVision_Alexnet --profile=false --publish=false
Refer to [TODO] to run the web UI to interact with the agent.
We have pre-built docker images on Dockerhub. The images are carml/pytorch-agent:amd64-cpu-latest
, carml/pytorch-agent:amd64-gpu-latest
and carml/pytorch-agent:amd64-gpu-ngc-latest
. The entrypoint is set as pytorch-agent
thus these images act similar as the command line above.
An example run is
docker run --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --privileged=true \
--network host \
-v ~/.carml_config.yml:/root/.carml_config.yml \
-v ~/results:/go/src/github.com/rai-project/pytorch/results \
carml/pytorch-agent:amd64-gpu-latest predict urls --model_name TorchVision_Alexnet --profile=false --publish=false
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for PyTorch. NVIDIA recommends the use of the following flags:
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...
NOTE: To run with GPU, you need to meet following requirements:
- Docker >= 19.03 with nvidia-container-toolkit (otherwise need to use nvidia-docker)
- CUDA >= 10.1 (10.2 for NGC)
- NVIDIA Driver >= 418.39 (440.33 for NGC)
To build the Pytorch C++ library from source, refer to https://github.com/pytorch/pytorch#installation and the code for building go-pytorch dockerfiles.