Whether you're coding in Python or simply running a single command in your terminal, NarrativeMapper gives you instant insight into the dominant stories behind the noise.
Ever wonder what stories are dominating Reddit, Twitter, or any corner of the internet? NarrativeMapper clusters similar online discussions and uses OpenAI’s GPT to summarize the dominant narratives, tone, and sentiment. Built for researchers, journalists, analysts, and anyone trying to make sense of the chaos.
Extracts dominant narratives from messy text data:
-
Embeds messages
-
Clusters embeddings
-
Summarizes clusters with GPT
-
Analyzes sentiments of clusters
Packages it all together in clean functional, class-based, and command line interfaces.
Click to view actual models and algorithms
-
Uses OpenAI Embeddings: OpenAI's text-embedding-3-small
-
Preprocessing: L2 Normalization + PCA (these are only used if UMAP + HDBSCAN similarity metric is euclidean, which is the default)
-
Dimensionality reduction: UMAP
-
Clustering: HDBSCAN
-
Cluster Merging: Union-find algorithm
-
Topic summary + sentiment extraction: OpenAI's Chat Completions API, model gpt-4o-mini + Hugging Face's distilbert-base-uncased-finetuned-sst-2-english
Installation:
Setup:
Click to view setup process
-
Create a .env file in your root directory (same folder where your script runs).
-
Inside the .env file, add your OpenAI API key like this:
OPENAI_API_KEY=your-api-key-here
- Before importing narrative_mapper, make sure to load your .env like this:
from dotenv import load_dotenv
load_dotenv()
from narrative_mapper import *
(Make sure to keep your .env file private and add it to your .gitignore if you're using Git.)
Run NarrativeMapper directly from the terminal:
narrativemapper path/to/your.csv --flag-options
or
narrativemapper subreddit_name --reddit --other-flag-options
This will:
-
Load the CSV
-
Automatically embed, cluster, and summarize the comments (with pretty progress bars if using --verbose)
-
Print the summarized narratives and sentiment to the terminal
-
Output a formatted results file in the current directory (if using --flag-output)
Output example from this dataset:
Run Timestamp: 2025-04-24 17:33:07
Online Group Name: comment_data_space
Summary: The conversation focuses on the challenges of obtaining detailed lunar imagery, doubts regarding the authenticity of moon rocks, and the optical phenomena observed during lunar observations.
Sentiment: NEUTRAL
Text Samples: 17
---
Summary: The cluster focuses on the reliability and safety challenges of space missions involving Boeing and NASA, while exploring the philosophical implications of humanity's existence and the potential for extraterrestrial life.
Sentiment: NEGATIVE
Text Samples: 257
---
Summary: The cluster focuses on astrophysical phenomena such as supernova formation, the characteristics of planets, and the interactions of celestial bodies with the Sun and other planets.
Sentiment: NEGATIVE
Text Samples: 23
---
Summary: The cluster focuses on the observation of celestial objects, specifically star clusters and galaxies, while exploring the intricacies of astrophotography and the vastness of cosmic structures.
Sentiment: NEUTRAL
Text Samples: 16
---
Summary: The conversation cluster centers on the high quality of a photo, doubts regarding its authenticity, and an interest in acquiring photography techniques.
Sentiment: NEUTRAL
Text Samples: 21
---
Summary: This cluster focuses on the emotional impact and awe of witnessing celestial events such as eclipses and auroras, as well as the logistical challenges involved in planning travel to experience these phenomena.
Sentiment: NEUTRAL
Text Samples: 47
---
Flag Options:
--verbose Print/show detailed parameter scaling info and progress bars.
--cache Cache embeddings and summary pkl files to working directory.
--reddit Full reddit pipeline. Replace file-path with subreddit name.
--load-embeddings Use embeddings pkl as file-path (must contain mandatory cols). Skips previous parts of the pipeline.
--load-summary Use summary pkl as file-path (must contain mandatory cols). Skips previous parts of the pipeline.
--file-output Output summaries to text file in working directory.
--max-samples Max amount of texts samples from clusters being used in summarization. Default is 500.
--random-state Changes value to UMAP and PCA random state. Default value is 42.
--no-pca Skip PCA and go straight to UMAP.
--dim-pca Change PCA dim. Default is 100.
Mandatory cols for pkl loading
-
--load-embeddings: Requires 'text' and 'embeddings' cols.
-
--load-summary: Requires the following cols: 'text', 'cluster', 'cluster_summary', 'aggregated_sentiment'.
Note: Make sure you're running the CLI from the same directory where your .env file is located (Unless you have set OPENAI_API_KEY globally in your environment).
from dotenv import load_dotenv
load_dotenv()
from narrative_mapper import *
import pandas as pd
file_df = pd.read_csv("file-path")
#initialize NarrativeMapper object
mapper = NarrativeMapper(file_df, "online_group_name", verbose=True)
#embeds semantic vectors
mapper.load_embeddings()
#clustering params are autocalculated based on dataset size (by default).
mapper.cluster() #Refer to Parameter Reference for optional kwargs if you want custom params
#summarizing with default parameters.
mapper.summarize() #Has max_sample_size param (refer to Parameter Reference)
#export in your preferred format
summary_dict = mapper.format_to_dict()
text_df = mapper.format_by_text()
cluster_df = mapper.format_by_cluster()
#saving DataFrames to csv
text_df.to_csv("by_texts_summary.csv", index=False)
cluster_df.to_csv("by_cluster_summary.csv", index=False)
from dotenv import load_dotenv
load_dotenv()
from narrative_mapper import *
import pandas as pd
df = pd.read_csv("file-path")
#manual control over each step:
embeddings = get_embeddings(file_df)
cluster_df = cluster_embeddings(embeddings)
summary_df = summarize_clusters(cluster_df)
#export/format options
summary_dict = format_to_dict(summary_df, online_group_name="online_group_name")
text_df = format_by_text(summary_df, online_group_name="online_group_name")
cluster_df = format_by_cluster(summary_df, online_group_name="online_group_name")
The following example outputs are based off of this dataset.
The three formatter functions return the following:
format_to_dict() returns dict with following format:
Click to view
{
'online_group_name': 'r/antiwork',
'clusters': [
{
'cluster': 2,
'cluster_summary': 'The cluster focuses on the exploitation of workers under capitalism, highlighting the growing wealth disparity driven by corporate greed, the manipulation of housing markets, and the urgent need for systemic reforms to improve living conditions, wages, and labor rights.',
'sentiment': 'NEGATIVE',
'text_count': 483
},
{
'cluster': 4,
'cluster_summary': 'The conversation cluster centers on critiques of remote work policies, reflections on privilege and inequality, and humorous observations about daily frustrations and absurdities.',
'sentiment': 'NEGATIVE',
'text_count': 80
},
{
'cluster': 5,
'cluster_summary': 'This cluster highlights the frustrations and absurdities of modern job application processes, focusing on discriminatory hiring practices, excessive interview demands, and the dehumanizing effects of AI and psychometric testing on candidates.',
'sentiment': 'NEGATIVE',
'text_count': 76
},
{
'cluster': 7,
'cluster_summary': 'The conversation focuses on the low wages and poor treatment of fast food workers, emphasizing the urgent need for improved compensation and benefits in relation to living costs.',
'sentiment': 'NEGATIVE',
'text_count': 58
},
{
'cluster': 8,
'cluster_summary': 'The conversation cluster highlights pervasive issues of employee dissatisfaction stemming from wage theft, workplace exploitation, toxic environments, harassment, and inadequate labor rights, alongside the struggle for work-life balance and the necessity for legal recourse in employment disputes.',
'sentiment': 'NEGATIVE',
'text_count': 392
}
]
}
format_by_cluster() returns pandas DataFrame with columns:
Click to view
-
online_group_name: online group name
-
cluster: numeric cluster number
-
cluster_summary: summary of the cluster
-
text_count: sampled textual messages per cluster
-
aggregated_sentiment: net sentiment, of form 'NEGATIVE', 'POSITIVE', 'NEUTRAL'
-
text: the list of textual messages that are part of the cluster
-
all_sentiments: this is a list containing dict items of the form '{'label': 'NEGATIVE', 'score': 0.9896971583366394}' for each message (sentiment calculated by distilbert-base-uncased-finetuned-sst-2-english).
format_by_text() returns pandas DataFrame with columns:
Click to view
-
online_group_name: online group name
-
cluster: numeric cluster number
-
cluster_summary: summary of the cluster
-
text: the sampled textual message (this function returns all of them row by row)
-
sentiment: dict item holding sentiment calculation, of the form '{'label': 'NEGATIVE', 'score': 0.9896971583366394}' (sentiment calculated by distilbert-base-uncased-finetuned-sst-2-english).
Pipeline:
CSV Text Data → Embeddings → Clustering → Summarization → Formatting
Parameter Reference:
Click to expand
-
verbose: Print/show detailed parameter scaling info and progress bars.
-
use_pca: Toggle whether or not you want to use PCA before UMAP (default is True since it helps reduce RAM usage from UMAP).
-
umap_kwargs: Allows for customized input of all UMAP parameters.
-
hdbscan_kwags: Allows for customized input of all HDBSCAN parameters.
-
pca_kwargs: Allows for customized input of all PCA parameters.
-
max_sample_size: Max amount of texts in each cluster being used for summarization (limits OpenAI spending on gpt-4o-mini).
Default Parameter Values:
verbose=False
umap_kwargs={'n_components': 10, 'n_neighbors': 20, 'min_dist': 0.0},
hdbscan_kwargs={'min_cluster_size': 30, 'min_samples': 10},
pca_kwargs={'n_components': 100},
use_pca=True
max_sample_size=500
#Converts each message into a 1536-dimensional vector using OpenAI's text-embedding-3-small.
get_embeddings(file_df, verbose=bool)
#Clusters the embeddings using PCA and L2 normalization (for preprocessing if metric is euclidean),
#UMAP (for reduction), and HDBSCAN (for clustering).
#Both UMAP and HDBSCAN are set to euclidean distance, all other parameters can be changed with kwargs.
#Uses Union-find algorithm to merge clusters that are too similar.
cluster_embeddings(
embeddings,
verbose=bool,
use_pca=bool,
pca_kwargs=dict,
umap_kwargs=dict,
hdbscan_kwags=dict
)
#Uses OpenAI Chat Completions gpt-4o-mini (in 2 stages) for cluster summaries and Hugging Face's
#distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
#If there are 2 times more negative texts than positive, that cluster is determined to be
#'NEGATIVE', and vice versa for 'POSITIVE' clusters. Otherwise they are determined 'NEUTRAL'.
summarize_clusters(clustered_df, max_sample_size=int, verbose=bool)
#Returns structured output as a dictionary (ideal for JSON export).
format_to_dict(summary_df)
#Returns a DataFrame where each row summarizes a cluster.
format_by_cluster(summary_df)
#Returns a DataFrame where each row is an individual comment with its sentiment and cluster label.
format_by_text(summary_df)
Instance Attributes:
class NarrativeMapper:
def __init__(self, df, online_group_name: str, verbose=False):
self.verbose # Verbose for all parts of the pipeline
self.file_df # DataFrame of csv file
self.online_group_name # Name of the online community or data source
self.embeddings_df # DataFrame after embedding
self.cluster_df # DataFrame after clustering
self.summary_df # DataFrame after summarization
Methods:
load_embeddings()
cluster(
use_pca=bool,
pca_kwargs=dict,
umap_kwargs=dict,
hdbscan_kwargs=dict
)
summarize(max_sample_size=int)
format_by_text()
format_by_cluster()
format_to_dict()
#num_texts is the size of the dataset
base_num_texts = 500
N = max(1, num_texts / base_num_texts)
#n_components ~ constant to N.
n_components = 10
#n_neighbors ~ cube root of N. range [15, 75]
n_neighbors = int(min(75, max(15, 15*(N**(1/3)))))
#min_cluster_size ~ sqrt(N). range [20, 200]
min_cluster_size = int(min(200, max(20, 20*sqrt(N))))
#min_samples ~ log2(N). range [5, 30]
min_samples = int(min(30, max(5, 5*log2(N))))
Estimated cost: $0.02 to $0.17 per 1 million tokens.
Example: A CSV containing 1,000, all greater than one sentence long, Reddit comments costs approximately $0.01 to process.
Click for pricing details
The OpenAI text-embedding-3-small model costs approximately $0.02 per 1 million input tokens. Determined by the total tokens of your input textual messages.
The Chat Completions model used for summarization (gpt-4o-mini) is $0.15 per 1 million input tokens. The max_sample_size parameter (referenced later) helps reduce costs by limiting how many comments are passed into gpt-4o-mini for each cluster. This can significantly reduce the Chat Completions token usage.
The gpt-4o-mini input prompt (excluding the text) and output summary (for both stages) are very short (<1000 tokens), so their cost contribution is negligible.