Encoding and Packing the Pre-train dataset

Pre-training Step 1

We can process the dataset in pre-training step 1 using the scripts of datatools to obtain pre-training data of the specified length.

bash pretrain01-wo-score.sh

Pre-training Step 2

For the labeled data in pre-training step 2, we need to use new scripts to retain the required scores during encoding and packing.

bash pretrain02-w-score.sh

Filtering

If further data filtering is needed, we provided a script for filtering by score (to select the highest quality data) and a script for filtering by text length (to ensure sufficiently long data for adapting the model's long-text capabilities).

bash pretrain03-filter-by-score.sh
bash pretrain04-filter-by-length.sh

Scoring datasets

The code for using FastText to perform quality annotation on large-scale pre-trained text data has been provided. our classification model focuses on four aspects: Safety, Reasoning, Quality, and Knowledge.

Training

# train
python scoring01-train.py

# test
python scoring02-test.py

Inference

# batch inference
python scoring03-batch_inference.py --data_folder input/data/path --output_folder output/data/path

When you use scoring03-batch_inference.py, please ensure there is a readable jsonl file under the path input/data/path. Each piece of data should contain a "text" field, which will be used for inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Encoding and Packing the Pre-train dataset

Pre-training Step 1

Pre-training Step 2

Filtering

Scoring datasets

Training

Inference

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Encoding and Packing the Pre-train dataset

Pre-training Step 1

Pre-training Step 2

Filtering

Scoring datasets

Training

Inference