diff --git a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md
index 2c4e05f29c..bda1aea38a 100644
--- a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md
+++ b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md
@@ -270,7 +270,7 @@ as its backbone :cite:`Radford.Narasimhan.Salimans.ea.2018`.
 Following the autoregressive language model training
 as described in :numref:`subsec_partitioning-seqs`,
 :numref:`fig_gpt-decoder-only` illustrates
-GPT pretraining with a Transformer encoder,
+GPT pretraining with a Transformer decoder,
 where the target sequence is the input sequence shifted by one token.
 Note that the attention pattern in the Transformer decoder
 enforces that each token can only attend to its past tokens