We can process the dataset in pre-training step 1 using the scripts of datatools to obtain pre-training data of the specified length.
bash pretrain01-wo-score.sh
For the labeled data in pre-training step 2, we need to use new scripts to retain the required scores during encoding and packing.
bash pretrain02-w-score.sh
If further data filtering is needed, we provided a script for filtering by score (to select the highest quality data) and a script for filtering by text length (to ensure sufficiently long data for adapting the model's long-text capabilities).
bash pretrain03-filter-by-score.sh
bash pretrain04-filter-by-length.sh
The code for using FastText to perform quality annotation on large-scale pre-trained text data has been provided. our classification model focuses on four aspects: Safety
, Reasoning
, Quality
, and Knowledge
.
# train
python scoring01-train.py
# test
python scoring02-test.py
# batch inference
python scoring03-batch_inference.py --data_folder input/data/path --output_folder output/data/path
When you use scoring03-batch_inference.py
, please ensure there is a readable jsonl
file under the path input/data/path
. Each piece of data should contain a "text"
field, which will be used for inference.