Skip to content

parent-issue-for-generative-ai-model #193

@david-thrower

Description

@david-thrower

A comprehensive list of issues needed to solve to develop a generative loop from our text classification model.

Don't create a branch or merge from this. The individual sub-issues will be used to create issues specific to each deliverable

Problems to solve (Milestone 1):

  • Persistance of models that is not affected adversely by custom components (e.g. the iRoPE embedding) ... This, as well as the broader talent base familiar with PyTorch, leads us to desire transpiling this to PyTorch.
  • Translation to PyTorch:
    • Main issue that complicates this: In cerebros, the layers are instantiated sequentially as part of the NAS class instance, and a tf.keras,Model() is composed after the last layer is instantiated, using the keras functional API:
    • In PyTorch, the layer of a functional model must be instantiated within a torch.nn.Model() object. then the graph connextivity is established within model.forward().
      -I am grappling with how to make this work.

In units/units.py line 458:

            self.neural_network_layer =\
                tf.keras.layers.Dense(
                    self.n_neurons,
                    self.activation,
                    name=f"{self.name}_dns_{rn_5}")(merged_neural_network_layer_input)

THEN

In neuralnetworkfuture/neural_network_future.py line 316:

            self.materialized_neural_network =\
                tf.keras.Model(inputs=materialized_neural_network_inputs,
                               outputs=materialized_neural_network_outputs,

... This issue is under construction.

Milestone 2: Training loop

  • Write an ETL pipeline that that prepares text data and feeds it through the model in an auto-regressive manner:

Takes 1 batch of samples of text (1d list of strings): list[str]

[
"Jesus is the way."
"Jesus is the truth."
"Jesus is the life"
"And that is the way it is."
]

Step 1: Toeknizes it with padding to max_sequence_length ( returns a 2d list of int) -> list[list[int]]:

[
 [95, 82, 23, 72, 21, 22,22,22,22, ... 22] # 22 is the padding token
,... 4 more rows like this
]

Step 2: Exapands each sample to n-1 pairs of sample and label where n is the number of words in that sample:

[
 [95, 82, 23, 72, 21, 22,22,22,22, ... 22] <-----<<< This single row becomes what you see below
,...
]

... becomes

[
# First sample only the token for the first word, then padding tokens
 [95, 22, 22, 22, 22, 22,22,22,22, ... 22], # 82 is the label for this sample
# Then
 [95, 82, 22, 22, 22, 22,22,22,22, ... 22], # 23 is the label for this sample 
# Then
 [95, 82, 23, 22, 22, 22,22,22,22, ... 22] # 72 is the label for this sample
Then:
 [95, 82, 23, 72, 22, 22,22,22,22, ... 22] # 21 is the label for this sample
 
Then the first 
[95, 82, 23, 72, 21, 22,22,22,22, ... 22] # 22 is the label for this sample and happens to also be the padding token
# Because the paddign token is the label for this sample, we stop expanding this one and move on to the next text sample
,... # Do the same process to the next original row
]
  • Training loop loop that takes a list of text samples passes them through the aforementioned ETL pipeline and passes it through acerebros text classification (model from the final result of the text classification example phishing_email_detection_gpt2.py).

Milestone 3: Inference stage fundamentals

  • Be able to take a text prompt, tokenize it pad it, then:
    • pass it through the model,
    • replace the first padding token with the token representing the sampled logit,
    • repeat the indented steps until the resulting token is the padding token

Milestone 4: Sampling at inference (co-requisite of milestone 3)

  • Implement temperature sampling
  • Implement top_p sampling
  • Implement top_k sampling
  • presence penalty
  • frequency penalty
  • Implement repetition_penalty

Milestone 4: Huggingface transformers integration

  • Be able to instantiate a subclass of CausalLLM from our generative model
  • Be able to push it to Huggingface
  • Be a able to pull the model using AutoModelForCausalLM.from_pretrained() and reproduce model validation.

Improved? Write up

Plan: Evolving Our Text Classifier into a Generative Model

This document outlines the issues we need to solve in order to translate our existing text classification model into a generative language model (LM).

👉 Note: Do not create a branch or merge from this document. Each sub-issue will be tracked as its own deliverable ticket.


Milestone 1: Model Persistence & Framework Migration

Problem: Saving and Reloading Models with Custom Components

  • We currently have custom components such as iRoPE embeddings.
  • These create persistence problems (difficulties saving/loading models using Keras load_model() functionality).
  • To broaden usability and talent pool, we want to migrate from TensorFlow/Keras (cerebros)PyTorch .... unless we can find a way to make this work in Keras and can convert the result to ONNX.

Problem: Translation from TensorFlow to PyTorch

  • In TensorFlow:

    • Layers are defined sequentially inside a NAS (Neural Architecture Search) class.
    • At the end, a final tf.keras.Model() object is created using the Functional API.
    • Example:
      # Units class
      self.neural_network_layer = tf.keras.layers.Dense(...)(merged_input)
      
      # Later in the pipeline
      self.materialized_neural_network = tf.keras.Model(
          inputs=materialized_inputs,
          outputs=materialized_outputs,
      )
  • In PyTorch:

    • All layers must be defined inside the __init__() of a torch.nn.Module class.
    • Connectivity between layers is implemented inside .forward().

Challenge: How to restructure our model so it cleanly fits the PyTorch pattern, while still supporting flexibility (since layers used to be dynamically composed via functional API in Keras).


Milestone 2: Training Pipeline

We need a training loop that turns text into (input → label) pairs suitable for auto-regressive training.

✅ Step 1: ETL Pipeline (Tokenization + Padding)

  • Input: a batch of text samples (list of str)

    [
      "Jesus is the way.",
      "Jesus is the truth.",
      "Jesus is the life",
      "And that is the way it is."
    ]
  • Output: tokenized sequences with padding (list of list[int])

    [
      [95, 82, 23, 72, 21, 22, 22, 22, ...],  # 22 = padding token
      ...
    ]

✅ Step 2: Create Auto-Regressive Training Pairs

  • Each sample is expanded so that the model learns to predict the next token given previous tokens.
  • Example (for 1 input sentence):
    Input:  [95, 22, 22, ...], Label: 82
    Input:  [95, 82, 22, ...], Label: 23
    Input:  [95, 82, 23, 22, ...], Label: 72
    Input:  [95, 82, 23, 72, 22, ...], Label: 21
    ...
    
  • Training stops once the model is asked to predict the padding token (22).

❌ Next Step: Training Loop (to be developed)

  • Input: batch of text
  • Runs through ETL pipeline
  • Feeds autoregressive inputs into our classification model (currently phishing_email_detection_gpt2.py).
  • Updates weights based on loss (typically cross-entropy against the next-token label).

Milestone 3: Inference – Generating Text

We need a way to run the model in a step-by-step loop:

  1. Take a text prompt → tokenize + pad.
  2. Pass through the model → get predicted logits for next token.
  3. Sample a token → insert it into the sequence at the position of the first padded token.
  4. Repeat until either:
    • the model outputs the padding token (end of sequence), or
    • we hit max_sequence_length.

Milestone 4: Sampling Strategies

To control creativity vs. stability during text generation, we need to implement the following:

  • Temperature Sampling → scale logits before sampling (controls randomness).
  • Top-k Sampling → only keep the top-k probable tokens at each step.
  • Top-p (Nucleus) Sampling → only keep the smallest set of tokens whose probabilities sum ≥ p.
  • Presence Penalty → discourages repeating tokens already present.
  • Frequency Penalty → lowers probability of tokens proportional to how often they’ve already occurred.
  • Repetition Penalty → penalizes repeating long spans of text.

Milestone 5: Hugging Face Transformers Integration

To make the model production-ready and interoperable:

  • Wrap our PyTorch-based LM in a subclass of transformers.CausalLM.
  • Implement from_pretrained()/save_pretrained() compatibility.
  • Push model to Hugging Face Hub.
  • Pull it using AutoModelForCausalLM.from_pretrained() and validate outputs match expectations.

Summary

  • Milestone 1 solves persistence + PyTorch migration.
  • Milestone 2 builds the data pipeline and training loop.
  • Milestone 3 builds inference in a generative loop.
  • Milestone 4 adds advanced sampling controls.
  • Milestone 5 integrates with Hugging Face for broader use.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions