How can i correctly use a CPU to perform inference of a quantized model #1374

neavo · 2024-09-27T17:18:57Z

neavo
Sep 27, 2024

I made some attempts, such as:

self.model = AutoModelForTokenClassification.from_pretrained(
    "resource/kg_ner_gpu",
    device_map = "cpu",
    torch_dtype = torch.float16,
    quantization_config = BitsAndBytesConfig(
        load_in_4bit = True,
        load_in_8bit = False,
        bnb_4bit_quant_type = "nf4",
        bnb_4bit_compute_dtype = torch.float16,
    ),
    local_files_only = True,
    low_cpu_mem_usage = True,
)

self.classifier = pipeline(
    "token-classification",
    model = self.model,
    tokenizer = self.tokenizer,
    aggregation_strategy = "simple",
)

But the speed is very slow, so is there a correct code snippet as an example?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can i correctly use a CPU to perform inference of a quantized model #1374

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How can i correctly use a CPU to perform inference of a quantized model #1374

Uh oh!

neavo Sep 27, 2024

Replies: 0 comments

neavo
Sep 27, 2024