Skip to content

BertForSequenceClassification inference in C/C++ but binding Rust without dependencies

License

Notifications You must be signed in to change notification settings

EeyoreLee/bert.cpp

Folders and files

NameName
Last commit message
Last commit date
Sep 18, 2024
Oct 11, 2024
Aug 20, 2024
Sep 2, 2024
Oct 11, 2024
Oct 13, 2024
Oct 13, 2024
Oct 9, 2024
Aug 23, 2024
Oct 18, 2024
Oct 9, 2024
Aug 19, 2024
Oct 9, 2024
Sep 3, 2024

Repository files navigation

bert.cpp

The motivation of this project is to optimize the inference speed of deploying BERT on CPU using PyTorch in Python, while also supporting C++ projects.

Get Start

Here's a blog but writes in Chinese notes a real case of using the project to optimize inference speed.

For Python

  • Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
  • Convert Pytorch checkpoint file and configs to ggml file using
cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}
  • Make sure tokenizer.json exists, otherwise execute
cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}
  • Build dynamic library(libbert_shared.so)
git submodule update --init --recursive
mkdir build
cd build/
cmake ..
make
  • Refer to examples/sample_dylib.py, replace PyTorch inference.

For C++

  • Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
  • Convert Pytorch checkpoint file and configs to ggml file using
cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}
  • Make sure tokenizer.json exists, otherwise execute
cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}
  • Add this project as a submodule and include it via add_sub_directory in your CMake project. You also need to turn on c++17 support. you can then link the library.

Performance

Tokenizer performance

Tokenizer type cost time
transformers.BertTokenizer (Python) 734ms
tokenizers-cpp (binding Rust) 3ms

Single sentence inference performance

Type cost time
Python W/O Loading 248ms
C++&Rust W/O Loading(n_thread=4) 2ms
Python W Loading 1104ms
C++&Rust W Loading(n_thread=4) 19ms

Batch inference performance

Type cost time
Python(batch_size=50) 260ms
C++&Rust (batch_size=50, n_thread=8) 23ms

ggml performance worse as sentence length increases

Python inference using cpp dynamic library

Type cost time
Python&C++&Rust (batch_size=50, n_thread=8) 26ms

Future Work

  • Using broadcast instead of ggml.repeat. (WIP)
  • Update ggml format to gguf.
  • Implement Python binding instead of dynamic library.

Acknowledgements

Thanks for the projects we rely on or refer to.

About

BertForSequenceClassification inference in C/C++ but binding Rust without dependencies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published