parent-issue-for-generative-ai-model


# A comprehensive list of issues needed to solve to develop a generative loop from our text classification model.

## Don't create a branch or merge from this. The individual sub-issues will be used to create issues specific to each deliverable


## Problems to solve (Milestone 1):

- [ ] **Persistance of models** that is not affected adversely by custom components (e.g. the iRoPE embedding) ... This, as well as the broader talent base familiar with PyTorch, leads us to desire transpiling this to PyTorch. 
- [ ] **Translation to PyTorch**: 
    - Main issue that complicates this: In cerebros, the layers are instantiated sequentially as part of the NAS class instance, and a tf.keras,Model() is composed after the last layer is instantiated, using the keras functional API:
    - In PyTorch, the layer of a functional model must be instantiated within a torch.nn.Model() object. then the graph connextivity is established within model.forward().
    -I am grappling with how to make this work.


In units/units.py line 458: 

```python

            self.neural_network_layer =\
                tf.keras.layers.Dense(
                    self.n_neurons,
                    self.activation,
                    name=f"{self.name}_dns_{rn_5}")(merged_neural_network_layer_input)

```

THEN

In neuralnetworkfuture/neural_network_future.py line 316:
```
            self.materialized_neural_network =\
                tf.keras.Model(inputs=materialized_neural_network_inputs,
                               outputs=materialized_neural_network_outputs,
```
... This issue is under construction.

## Milestone 2: Training loop

- [x] Write an ETL pipeline that that prepares text data and feeds it through the model in an auto-regressive manner:  

Takes 1 batch of samples of text (1d list of strings): list[str]
```
[
"Jesus is the way."
"Jesus is the truth."
"Jesus is the life"
"And that is the way it is."
]
```  

Step 1: Toeknizes it with padding to max_sequence_length ( returns a 2d list of int) -> list[list[int]]:
```
[
 [95, 82, 23, 72, 21, 22,22,22,22, ... 22] # 22 is the padding token
,... 4 more rows like this
]
```

Step 2: Exapands each sample to n-1 pairs of sample and label where n is the number of words in that sample:
```
[
 [95, 82, 23, 72, 21, 22,22,22,22, ... 22] <-----<<< This single row becomes what you see below
,...
]
```

... becomes 
```
[
# First sample only the token for the first word, then padding tokens
 [95, 22, 22, 22, 22, 22,22,22,22, ... 22], # 82 is the label for this sample
# Then
 [95, 82, 22, 22, 22, 22,22,22,22, ... 22], # 23 is the label for this sample 
# Then
 [95, 82, 23, 22, 22, 22,22,22,22, ... 22] # 72 is the label for this sample
Then:
 [95, 82, 23, 72, 22, 22,22,22,22, ... 22] # 21 is the label for this sample
 
Then the first 
[95, 82, 23, 72, 21, 22,22,22,22, ... 22] # 22 is the label for this sample and happens to also be the padding token
# Because the paddign token is the label for this sample, we stop expanding this one and move on to the next text sample
,... # Do the same process to the next original row
]
```

- [ ] Training loop loop that takes a list of text samples passes them through the aforementioned ETL pipeline and passes it through acerebros text classification (model from the final result of the text classification example `phishing_email_detection_gpt2.py`). 

## Milestone 3: Inference stage fundamentals

- Be able to take a text prompt, tokenize it pad it, then:
  - pass it through the model, 
  - replace the first padding token with the token representing the sampled logit, 
  - repeat the indented steps until the resulting token is the padding token   

## Milestone 4: Sampling at inference (co-requisite of milestone 3)

- [ ] Implement temperature sampling
- [ ] Implement top_p sampling
- [ ] Implement top_k sampling
- [ ] presence penalty 
- [ ] frequency penalty
- [ ] Implement repetition_penalty 

## Milestone 4: Huggingface transformers integration

- [ ] Be able to instantiate a subclass of CausalLLM from our generative model
- [ ] Be able to push it to Huggingface
- [ ] Be a able to pull the model using AutoModelForCausalLM.from_pretrained() and reproduce model validation.   

## Improved? Write up 


# Plan: Evolving Our Text Classifier into a Generative Model  

This document outlines the issues we need to solve in order to translate our existing **text classification model** into a **generative language model (LM)**.  

👉 Note: **Do not create a branch or merge from this document.** Each sub-issue will be tracked as its own deliverable ticket.  

***

## Milestone 1: Model Persistence & Framework Migration  

### Problem: Saving and Reloading Models with Custom Components  
- We currently have custom components such as **iRoPE embeddings**.  
- These create persistence problems (difficulties saving/loading models using Keras load_model() functionality).  
- To broaden usability and talent pool, we want to migrate from **TensorFlow/Keras (cerebros)** → **PyTorch** .... **unless** we can find a way to make this work in Keras **and** can convert the result to ONNX.  

### Problem: Translation from TensorFlow to PyTorch  
- In **TensorFlow**:  
  - Layers are defined sequentially inside a NAS (Neural Architecture Search) class.  
  - At the end, a final `tf.keras.Model()` object is created using the **Functional API**.  
  - Example:  
    ```python
    # Units class
    self.neural_network_layer = tf.keras.layers.Dense(...)(merged_input)

    # Later in the pipeline
    self.materialized_neural_network = tf.keras.Model(
        inputs=materialized_inputs,
        outputs=materialized_outputs,
    )
    ```

- In **PyTorch**:  
  - All layers must be defined inside the `__init__()` of a `torch.nn.Module` class.  
  - Connectivity between layers is implemented inside `.forward()`.  

**Challenge**: How to restructure our model so it cleanly fits the PyTorch pattern, while still supporting flexibility (since layers used to be dynamically composed via functional API in Keras).  

***

## Milestone 2: Training Pipeline  

We need a **training loop** that turns text into `(input → label)` pairs suitable for auto-regressive training.  

### ✅ Step 1: ETL Pipeline (Tokenization + Padding)  
- Input: a batch of text samples (list of `str`)  
  ```python
  [
    "Jesus is the way.",
    "Jesus is the truth.",
    "Jesus is the life",
    "And that is the way it is."
  ]
  ```

- Output: tokenized sequences with padding (list of `list[int]`)  
  ```python
  [
    [95, 82, 23, 72, 21, 22, 22, 22, ...],  # 22 = padding token
    ...
  ]
  ```

### ✅ Step 2: Create Auto-Regressive Training Pairs    
- Each sample is expanded so that the model learns to predict **the next token given previous tokens**.  
- Example (for 1 input sentence):  
  ```
  Input:  [95, 22, 22, ...], Label: 82
  Input:  [95, 82, 22, ...], Label: 23
  Input:  [95, 82, 23, 22, ...], Label: 72
  Input:  [95, 82, 23, 72, 22, ...], Label: 21
  ...
  ```
- Training stops once the model is asked to predict the **padding token** (22).  

### ❌ Next Step: Training Loop (to be developed)  
- Input: batch of text  
- Runs through ETL pipeline  
- Feeds autoregressive inputs into **our classification model (currently `phishing_email_detection_gpt2.py`)**.  
- Updates weights based on loss (typically cross-entropy against the next-token label).  

***

## Milestone 3: Inference – Generating Text  

We need a way to run the model in a step-by-step loop:  

1. Take a text prompt → tokenize + pad.  
2. Pass through the model → get predicted logits for next token.  
3. Sample a token → insert it into the sequence at the position of the first padded token.  
4. Repeat until either:  
   - the model outputs the padding token (end of sequence), or  
   - we hit `max_sequence_length`.  

***

## Milestone 4: Sampling Strategies  

To control **creativity vs. stability** during text generation, we need to implement the following:  

- [ ] **Temperature Sampling** → scale logits before sampling (controls randomness).  
- [ ] **Top-k Sampling** → only keep the top-k probable tokens at each step.  
- [ ] **Top-p (Nucleus) Sampling** → only keep the smallest set of tokens whose probabilities sum ≥ p.  
- [ ] **Presence Penalty** → discourages repeating tokens already present.  
- [ ] **Frequency Penalty** → lowers probability of tokens proportional to how often they’ve already occurred.  
- [ ] **Repetition Penalty** → penalizes repeating long spans of text.  

***

## Milestone 5: Hugging Face Transformers Integration  

To make the model production-ready and interoperable:  

- [ ] Wrap our PyTorch-based LM in a subclass of `transformers.CausalLM`.  
- [ ] Implement `from_pretrained()`/`save_pretrained()` compatibility.  
- [ ] Push model to Hugging Face Hub.  
- [ ] Pull it using `AutoModelForCausalLM.from_pretrained()` and validate outputs match expectations.  

***

## Summary  

- **Milestone 1** solves persistence + PyTorch migration.  
- **Milestone 2** builds the data pipeline and training loop.  
- **Milestone 3** builds inference in a generative loop.  
- **Milestone 4** adds advanced sampling controls.  
- **Milestone 5** integrates with Hugging Face for broader use.  

***



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parent-issue-for-generative-ai-model #193

A comprehensive list of issues needed to solve to develop a generative loop from our text classification model.

Don't create a branch or merge from this. The individual sub-issues will be used to create issues specific to each deliverable

Problems to solve (Milestone 1):

Milestone 2: Training loop

Milestone 3: Inference stage fundamentals

Milestone 4: Sampling at inference (co-requisite of milestone 3)

Milestone 4: Huggingface transformers integration

Improved? Write up

Plan: Evolving Our Text Classifier into a Generative Model

Milestone 1: Model Persistence & Framework Migration

Problem: Saving and Reloading Models with Custom Components

Problem: Translation from TensorFlow to PyTorch

Milestone 2: Training Pipeline

✅ Step 1: ETL Pipeline (Tokenization + Padding)

✅ Step 2: Create Auto-Regressive Training Pairs

❌ Next Step: Training Loop (to be developed)

Milestone 3: Inference – Generating Text

Milestone 4: Sampling Strategies

Milestone 5: Hugging Face Transformers Integration

Summary

Sub-issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

parent-issue-for-generative-ai-model #193

Description

A comprehensive list of issues needed to solve to develop a generative loop from our text classification model.

Don't create a branch or merge from this. The individual sub-issues will be used to create issues specific to each deliverable

Problems to solve (Milestone 1):

Milestone 2: Training loop

Milestone 3: Inference stage fundamentals

Milestone 4: Sampling at inference (co-requisite of milestone 3)

Milestone 4: Huggingface transformers integration

Improved? Write up

Plan: Evolving Our Text Classifier into a Generative Model

Milestone 1: Model Persistence & Framework Migration

Problem: Saving and Reloading Models with Custom Components

Problem: Translation from TensorFlow to PyTorch

Milestone 2: Training Pipeline

✅ Step 1: ETL Pipeline (Tokenization + Padding)

✅ Step 2: Create Auto-Regressive Training Pairs

❌ Next Step: Training Loop (to be developed)

Milestone 3: Inference – Generating Text

Milestone 4: Sampling Strategies

Milestone 5: Hugging Face Transformers Integration

Summary

Sub-issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions