bert.cpp

The motivation of this project is to optimize the inference speed of deploying BERT on CPU using PyTorch in Python, while also supporting C++ projects.

Get Start

Here's a blog but writes in Chinese notes a real case of using the project to optimize inference speed.

For Python

Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
Convert Pytorch checkpoint file and configs to ggml file using

cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}

Make sure tokenizer.json exists, otherwise execute

cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}

Build dynamic library(libbert_shared.so)

git submodule update --init --recursive
mkdir build
cd build/
cmake ..
make

Refer to examples/sample_dylib.py, replace PyTorch inference.

For C++

Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
Convert Pytorch checkpoint file and configs to ggml file using

cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}

Make sure tokenizer.json exists, otherwise execute

cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}

Add this project as a submodule and include it via add_sub_directory in your CMake project. You also need to turn on c++17 support. you can then link the library.

Performance

Tokenizer performance

Tokenizer type	cost time
transformers.BertTokenizer (Python)	734ms
tokenizers-cpp (binding Rust)	3ms

Single sentence inference performance

Type	cost time
Python W/O Loading	248ms
C++&Rust W/O Loading(n_thread=4)	2ms
Python W Loading	1104ms
C++&Rust W Loading(n_thread=4)	19ms

Batch inference performance

Type	cost time
Python(batch_size=50)	260ms
C++&Rust (batch_size=50, n_thread=8)	23ms

ggml performance worse as sentence length increases

Python inference using cpp dynamic library

Type	cost time
Python&C++&Rust (batch_size=50, n_thread=8)	26ms

Future Work

Using broadcast instead of ggml.repeat. (WIP)
Update ggml format to gguf.
Implement Python binding instead of dynamic library.

Acknowledgements

Thanks for the projects we rely on or refer to.

Name	Name	Last commit message	Last commit date
Latest commit EeyoreLee Submodule use https link instead of ssh link (#13 ) Oct 18, 2024 e3d139a · Oct 18, 2024 History 13 Commits
.vscode	.vscode	inference for single sequence (#7 )	Sep 18, 2024
examples	examples	Feature: add predict logits fucntion (#11 )	Oct 11, 2024
ggml @ 46e22f5	ggml @ 46e22f5	complete model architecture in header file (#1 )	Aug 20, 2024
requirements	requirements	load model from ggml (#5 )	Sep 2, 2024
scripts	scripts	Feature: add predict logits fucntion (#11 )	Oct 11, 2024
src	src	fix: memory leak cause unclear static flat and mask vector (#12 )	Oct 13, 2024
test	test	fix: memory leak cause unclear static flat and mask vector (#12 )	Oct 13, 2024
tokenizers-cpp @ 5a2d40e	tokenizers-cpp @ 5a2d40e	Example for python dynamic lib (#9 )	Oct 9, 2024
.gitignore	.gitignore	add test folder and cmakelists.txt for test (#4 )	Aug 23, 2024
.gitmodules	.gitmodules	Submodule use https link instead of ssh link (#13 )	Oct 18, 2024
CMakeLists.txt	CMakeLists.txt	Example for python dynamic lib (#9 )	Oct 9, 2024
LICENSE	LICENSE	Initial commit	Aug 19, 2024
README.md	README.md	Example for python dynamic lib (#9 )	Oct 9, 2024
pyproject.toml	pyproject.toml	add tokenizer using tokenizers-cpp binding HF Rust tokenizers (#6 )	Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bert.cpp

Get Start

For Python

For C++

Performance

Tokenizer performance

Single sentence inference performance

Batch inference performance

Python inference using cpp dynamic library

Future Work

Acknowledgements

About

Releases

Packages

Languages

License

EeyoreLee/bert.cpp

Folders and files

Latest commit

History

Repository files navigation

bert.cpp

Get Start

For Python

For C++

Performance

Tokenizer performance

Single sentence inference performance

Batch inference performance

Python inference using cpp dynamic library

Future Work

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages