- Cross-lingual Stance Detection for Climate Change Discourse
- Table of Contents
- 1. Project Overview
- 2. Getting Started
- 3. Motivation
- 4. Languages Covered
- 5. Technical Architecture
- 6. Project Steps
- 7. Results and Analysis
- 8. Future Work
- Deployment and Maintenance
- Testing and Quality Assurance
- Monitoring and Logging
- Quick Start Guide
- Support and Contact
- License
- Development Status
This project introduces an innovative approach to cross-lingual stance detection in climate change discussions, focusing on two critical challenges in modern NLP: linguistic inclusivity and computational efficiency. Our solution represents a significant departure from traditional transformer-based approaches, demonstrating that effective cross-lingual analysis can be achieved with limited computational resources.
-
Resource-Efficient Architecture
- Developed a lightweight ensemble approach combining multiple efficient classifiers
- Achieved comparable performance to transformer models while requiring <8GB RAM
- Implemented chunk-based processing for handling large datasets
- Utilized efficient feature extraction techniques optimized for multi-lingual text
-
Cross-lingual Capabilities
- Successfully analyzes stance across five European languages
- Language-agnostic feature extraction pipeline
- Robust performance across diverse linguistic patterns
- Effective handling of language-specific nuances in climate discourse
Our approach evolved through several stages:
-
Initial Transformer Attempt
- Started with XLM-RoBERTa for its proven cross-lingual capabilities
- Encountered significant resource constraints:
- Memory requirements exceeded available hardware (>16GB RAM)
- Training times were prohibitively long on CPU
- Model size made deployment challenging
-
Naive Bayes Exploration
- Attempted a simpler statistical approach
- Faced challenges:
- Poor performance on minority classes
- Limited ability to capture cross-lingual patterns
- Insufficient handling of linguistic nuances
-
Final Ensemble Solution
- Developed a novel ensemble combining:
- Optimized TF-IDF vectorization
- Multiple lightweight classifiers
- Language-aware feature selection
- Efficient memory management techniques
- Developed a novel ensemble combining:
- Minimum: 8GB RAM
- Recommended: 16GB RAM
- Storage: 5GB free space
- CPU: Multi-core processor (Intel i5/AMD Ryzen 5 or better)
- Python 3.12.4
- Git (for version control)
- Virtual environment capability
- PRAW API access for Reddit data collection
-
Clone the Repository
git clone https://github.com/jaxendutta/climate-stance-detection.git cd climate-stance-detection
-
Set Up Virtual Environment
# Create virtual environment python -m venv venv # Activate virtual environment # On Unix/MacOS: source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Configure API Access. Create
config.ini
in the project root:[Reddit] client_id = your_client_id client_secret = your_client_secret user_agent = your_user_agent
-
Verify Installation
# Checks environment and dependencies python src/verify_setup.py
Our project addresses several critical challenges in modern NLP and climate change research:
-
Cross-lingual Understanding
- Climate change discussions occur across language barriers
- Important insights are often isolated within language communities
- Need for unified analysis across linguistic boundaries
- Current solutions require extensive computational resources
-
Resource Constraints
- Most cross-lingual models require significant computational power
- Many researchers lack access to high-end GPU resources
- Need for efficient solutions that run on standard hardware
- Importance of accessibility in research tools
-
Real-world Application
- Climate change communication requires immediate action
- Need for tools that can be deployed in resource-constrained environments
- Importance of analyzing regional perspectives
- Requirement for scalable solutions
-
Methodological Innovation
- Challenge traditional assumptions about resource requirements
- Demonstrate alternative approaches to cross-lingual NLP
- Contribute to democratizing NLP research
- Advance efficient computing practices
Our project focuses on five major world languages, chosen for their global significance and to represent diverse linguistic families:
- English
- German
- French
- Spanish
- Italian
These languages were chosen to provide a broad global perspective while keeping the project scope manageable. This choice also reflects the availability of publicly accessible subreddits. Future iterations may expand to include more languages.
climate_stance_detection/
├── data/
│ ├── raw/
│ │ └── reddit_climate_data_YYYYMMDD_HHMMSS.csv
| | # Raw data from collect_data.py
| |
│ └── processed/
│ ├── collection_stats_YYYYMMDD_HHMMSS.json
| | # Collection stats from collect_data.py
| |
| ├── processed_data.joblib
| | # Preprocessed features from 01_data_exploration.ipynb
| |
│ ├── stance_classifier.joblib
| | # Trained model
| |
│ └── cross_lingual_metrics.json
| # Analysis results (Possible addition)
|
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_preprocessing.ipynb
│ ├── 03_model_development.ipynb
│ └── 04_cross_lingual_analysis.ipynb
|
├── src/
│ └── collect_data.py
|
├── requirements.txt
└── README.md
graph TD
A[Data Collection] -->|Raw Data| B[Data Exploration]
B -->|Analyzed Data| C[Preprocessing]
C -->|Processed Features| D[Model Development]
D -->|Trained Model| E[Cross-lingual Analysis]
%% Data Flow
subgraph Data Storage
F[data/raw] -->|Processing| G[data/processed]
end
flowchart TD
A[Raw Reddit Data] --> B[Language Detection]
B --> C[Text Cleaning]
C --> D[Feature Extraction]
D --> E[Model Input]
flowchart TD
A[Input Text] --> B[TF-IDF Vectorization]
B --> C[Ensemble Classification]
C --> D[Stance Prediction]
- Data Collection:
- PRAW-based Reddit scraper
- Multi-language subreddit targeting
- Automated data cleaning
- Metadata tracking
- Feature Extraction: TF-IDF with n-grams (1-3)
- Ensemble Model:
- MultinomialNB (Probability-based)
- LogisticRegression (Linear)
- RandomForestClassifier (Non-linear)
- SMOTE for class balancing
- Soft voting for final prediction
OBJECTIVE
Gather multi-lingual climate change discussions from Reddit across five languages (English, German, French, Spanish, Italian).
STEPS
A. Create config.ini
in project root:
[Reddit]
client_id = your_client_id
client_secret = your_client_secret
user_agent = your_user_agent
B. Run collection script:
python src/collect_data.py
This script collects posts from climate-related subreddits in multiple languages, handling rate limiting and API interactions automatically. The data is saved to data/raw/
with timestamps, post content, and metadata that we'll need for analysis.
OBJECTIVE
Analyze the collected data to understand patterns, distributions, and characteristics across languages.
STEPS
Run the exploration notebook:
jupyter notebook notebooks/01_data_exploration.ipynb
Running this notebook provides crucial insights about our dataset:
- Language distribution visualization
- Temporal trend analysis
- Content pattern examination
- Engagement metric calculation
These insights informed our preprocessing decisions and modeling approach, particularly highlighting the class imbalance and language distribution challenges we needed to address.
OBJECTIVE
Transform raw Reddit data into a clean, structured format suitable for model development.
STEPS
Run the preprocessing notebook:
jupyter notebook notebooks/02_preprocessing.ipynb
The preprocessing pipeline includes:
- Text cleaning and normalization
- Language verification
- Stance determination
- Train/validation/test splitting (70/15/15)
The processed data is saved in data/processed/
as processed_data.joblib
.
OBJECTIVE
Implement and train the stance detection model.
We explored several approaches before finding an efficient solution:
-
XLM-RoBERTa
jupyter notebook notebooks/initial_attempts/04_model_development_xlm.ipynb
ISSUE This approach proved too resource-intensive for our computational constraints.
-
XLM-RoBERTa (Lightweight)
jupyter notebook notebooks/initial_attempts/04_model_development_xlm_lite.ipynb
ISSUE Still exceeded our memory limitations despite optimizations.
-
Naive Bayes
jupyter notebook notebooks/initial_attempts/04_model_development_naive_bayes.ipynb
ISSUE Continued to exceeded our memory limitations despite optimizations.
jupyter notebook notebooks/03_model_development.ipynb
This method implements:
- Feature extraction pipeline
- Ensemble model architecture
- Training procedures
- Initial model evaluation
- Good performance within resource constraints
The trained model is saved to models/stance_classifier.joblib
.
OBJECTIVE
Evaluate model performance across different languages and, identify and analyze cross-lingual patterns.
STEPS
Run the analysis notebook:
jupyter notebook notebooks/04_cross_lingual_analysis.ipynb
This notebook analyzes:
- Per-language evaluation
- Error pattern analysis
- Performance visualization
- Cross-lingual comparison
-
Reddit API Rate Limiting:
- Wait a few minutes and try again
- Check API credentials in
config.ini
- Retry using the command:
python src/collect_data.py --retry
-
Memory Issues:
- Use institution or specialized servers for model training
- Ensure no other memory-intensive processes are running
- Monitor memory usage during model training
watch -n 1 'free -m'
- If needed, adjust batch size in config:
batch_size: 32 # Reduce if memory issues occur
-
Environment Setup Issues:
- Create fresh environment if issues occur
conda create -n stance_detection python=3.12.4 conda activate stance_detection pip install -r requirements.txt
- Create fresh environment if issues occur
Metric | Value |
---|---|
Total Posts | 8,080 |
Language Verification Rate | 96.99% |
Collection Date | November 6, 2024 |
Successful Subreddits | 12/14 |
Total Languages | 5 |
Language | Posts | Percentage | Primary Focus | Engagement Level | Stance Balance |
---|---|---|---|---|---|
English | 3,971 | 49.1% | Climate Discourse | High (avg. 38.2 comments) | Most Balanced |
German | 1,993 | 24.7% | Policy Discussion | Medium (avg. 15.3 comments) | Neutral Heavy |
Italian | 998 | 12.4% | Environmental Activism | Low (avg. 8.7 comments) | Neutral Dominant |
French | 988 | 12.2% | Energy Policy | Medium (avg. 12.4 comments) | Neutral Biased |
Spanish | 130 | 1.6% | Environmental Justice | Low (avg. 5.2 comments) | Limited Data |
Category | Count | Percentage | Distribution |
---|---|---|---|
Total Samples | 1,212 | 100% | - |
English Samples | 623 | 51.4% | Balanced |
German Samples | 297 | 24.5% | Neutral Heavy |
French Samples | 144 | 11.9% | Neutral Dominant |
Italian Samples | 130 | 10.7% | Neutral Biased |
Spanish Samples | 18 | 1.5% | Limited |
Stance | Count | Percentage | Primary Languages |
---|---|---|---|
Neutral | 1,052 | 86.8% | All Languages |
Positive | 148 | 12.2% | Mainly English |
Negative | 12 | 1.0% | English Only |
Language | Neutral F1 | Positive F1 | Negative F1 | Overall Accuracy |
---|---|---|---|---|
English | 0.888 | 0.644 | 0.375 | 0.84 |
German | 0.974 | 0.000 | 0.000 | 0.92 |
French | 0.982 | 0.000 | N/A | 0.97 |
Italian | 0.984 | 0.000 | N/A | 0.94 |
Spanish | 1.000 | N/A | N/A | 1.00 |
True\Predicted | Neutral | Positive | Negative | Misclassification Rate |
---|---|---|---|---|
Neutral | 987 | 63 | 2 | 6.2% |
Positive | 65 | 80 | 3 | 45.9% |
Negative | 4 | 3 | 5 | 58.3% |
Language | Stance | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|
English | Neutral | 0.926 | 0.852 | 0.888 | 415 |
Positive | 0.559 | 0.760 | 0.644 | 95 | |
Negative | 0.600 | 0.273 | 0.375 | 11 | |
German | Neutral | 0.949 | 1.000 | 0.974 | 282 |
Positive | 0.000 | 0.000 | 0.000 | 14 | |
Negative | 0.000 | 0.000 | 0.000 | 1 | |
French | Neutral | 0.965 | 1.000 | 0.982 | 139 |
Positive | 0.000 | 0.000 | 0.000 | 5 | |
Italian | Neutral | 0.969 | 1.000 | 0.984 | 126 |
Positive | 0.000 | 0.000 | 0.000 | 4 | |
Spanish | Neutral | 1.000 | 1.000 | 1.000 | 18 |
Priority Enhancement Status Expected Impact
High Dynamic feature selection Planned +5% accuracy
Medium Adaptive ensemble weights In Progress +3% on minority classes
Low Meta-learning integration Proposed Better generalization
- Planned Implementations
# Example: Dynamic Feature Selection class DynamicFeatureSelector: def __init__(self, threshold=0.01): self.threshold = threshold def select_features(self, X, y, language): """ Dynamically select features based on language and importance scores """ feature_importance = self._calculate_importance(X, y) return self._filter_features(feature_importance)
-
Additional Languages
- Arabic (RTL support needed)
- Mandarin (character-based tokenization)
- Hindi (new script support)
- Portuguese (Brazilian/European variants)
-
Language-Specific Optimizations
Language Planned Feature Implementation Path Arabic RTL handling Q2 2024 Mandarin Character embedding Q3 2024 Hindi Script normalization Q3 2024 Portuguese Variant handling Q4 2024
-
Memory Reduction Targets
Component Current Target Method Feature Extraction 4GB 2GB Sparse matrices Model Storage 2GB 1GB Quantization Runtime Memory 2GB 1GB Streaming
-
Speed Improvements
# Planned optimization class StreamingEnsemble: def predict_stream(self, text_iterator): """ Stream-based prediction for memory efficiency """ for batch in self._create_batches(text_iterator): yield self._predict_batch(batch)
-
Feature Extraction Pipeline
class FeatureExtractor: def __init__(self): self.vectorizer = TfidfVectorizer( max_features=10000, ngram_range=(1, 3), analyzer='char_wb' ) self.feature_selector = SelectFromModel( LogisticRegression(class_weight='balanced') )
-
Model Architecture
def build_model(): return Pipeline([ ('features', FeatureUnion([ ('tfidf', TfidfVectorizer()), ('char_ngrams', CharNGramTransformer()) ])), ('ensemble', VotingClassifier([ ('nb', MultinomialNB()), ('lr', LogisticRegression()), ('rf', RandomForestClassifier()) ])) ])
-
Memory Management
class MemoryEfficientDataset: def __init__(self, data_path, chunk_size=1000): self.data_path = data_path self.chunk_size = chunk_size def __iter__(self): with pd.read_csv(self.data_path, chunksize=self.chunk_size) as reader: for chunk in reader: yield self.process_chunk(chunk)
-
Batch Processing
class BatchProcessor: def process_data(self, data_iterator): results = [] for batch in data_iterator: processed = self.model.predict_proba(batch) results.extend(self._aggregate_predictions(processed)) return results
-
Environment Configuration
# Create virtual environment python -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Verify installation python scripts/verify_setup.py
-
Data Preparation
# Example configuration config = { 'data_paths': { 'raw': 'data/raw', 'processed': 'data/processed', 'models': 'models/' }, 'model_params': { 'feature_count': 10000, 'ngram_range': (1, 3), 'batch_size': 1000 } }
-
Basic Usage
from src.models.ensemble_model import StanceDetector # Initialize detector detector = StanceDetector() # Single prediction text = "Climate change requires immediate action" stance = detector.predict(text)
-
Batch Processing
# Process multiple texts texts = [ "Global warming is a serious threat", "We need more research on climate impact", "Environmental regulations are important" ] stances = detector.predict_batch(texts)
-
Cross-lingual Analysis
# Analyze texts in different languages multilingual_texts = { 'en': "Climate change is real", 'de': "Klimawandel ist real", 'fr': "Le changement climatique est réel" } results = detector.analyze_multilingual(multilingual_texts)
-
Data Handling
Practice Reason Implementation Chunk Processing Memory efficiency Use data iterators Text Normalization Consistency Apply standard cleanup Language Verification Accuracy Check before processing
-
Model Usage
Scenario Recommended Approach Consideration Single Text Direct prediction Quick results Large Dataset Batch processing Memory efficient Mixed Languages Language detection Accuracy first
- Docker Deployment
# Dockerfile for stance detection
FROM python:3.12.4-slim
# Set working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Set environment variables
ENV PYTHONPATH=/app
ENV MODEL_PATH=/app/models
ENV DATA_PATH=/app/data
# Run the application
CMD ["python", "src/api/serve.py"]
- Server Configuration
# config/production.yaml
server:
host: '0.0.0.0'
port: 8080
workers: 4
model:
batch_size: 32
max_queue_size: 100
timeout: 30
monitoring:
log_level: INFO
metrics_port: 9090
- API Implementation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
class TextInput(BaseModel):
text: str
language: str = None
@app.post("/predict")
async def predict_stance(input_data: TextInput):
try:
result = detector.predict(
text=input_data.text,
language=input_data.language
)
return {"stance": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Component Scaling Method Considerations
API Server Horizontal Load balancing needed
Model Inference Vertical Memory constraints
Data Processing Distributed Network overhead
- Performance Monitoring
class ModelMonitor:
def __init__(self):
self.metrics = {
'accuracy': [],
'latency': [],
'memory_usage': []
}
def log_prediction(self, true_label, pred_label, latency):
"""Log prediction metrics"""
self.metrics['accuracy'].append(true_label == pred_label)
self.metrics['latency'].append(latency)
def get_statistics(self):
"""Calculate and return monitoring statistics"""
return {
'accuracy': np.mean(self.metrics['accuracy']),
'avg_latency': np.mean(self.metrics['latency']),
'p95_latency': np.percentile(self.metrics['latency'], 95)
}
- Update Protocol
Step Action Validation Criteria
1 Collect new data Min 1000 samples per language
2 Validate labels 95% confidence level
3 Retrain model Equal/better performance
4 A/B test 1% traffic for 24h
5 Gradual rollout Monitor errors
- Automated Testing
# tests/test_model_quality.py
class TestModelQuality(unittest.TestCase):
def setUp(self):
self.model = load_production_model()
self.test_cases = load_test_cases()
def test_cross_lingual_performance(self):
"""Test model performance across languages"""
for lang, cases in self.test_cases.items():
accuracy = self.evaluate_language(lang, cases)
self.assertGreaterEqual(
accuracy,
MINIMUM_LANGUAGE_ACCURACY[lang]
)
def test_error_cases(self):
"""Test known edge cases"""
for case in self.error_cases:
prediction = self.model.predict(case.text)
self.assertEqual(prediction, case.expected)
- Quality Metrics
Metric Threshold Monitoring Frequency
Accuracy > 90% Daily
F1 Score > 0.85 Daily
Error Rate < 5% Real-time
Response Time < 100ms Real-time
Memory Usage < 8GB Hourly
- Memory Problems
Issue: Memory spike during batch processing
Solution:
1. Check batch size configuration
2. Monitor memory usage:
watch -n 1 'free -m'
3. Adjust chunk_size parameter:
optimal_chunk_size = available_memory // 4
- Performance Degradation
Symptom Cause Solution
High Latency Large batch size Reduce batch size
Low Accuracy Concept drift Retrain model
Memory Leaks Resource cleanup Implement gc.collect()
- Cross-lingual Issues
Problem Check Fix
Wrong Language Language detection Update detection threshold
Missing Features Feature extraction Adjust n-gram range
Encoding Errors Text preprocessing Set proper encodings
- Python Standards
# Example of expected code style
def process_text(
text: str,
language: str = None,
**kwargs
) -> Dict[str, Any]:
"""
Process input text for stance detection.
Args:
text: Input text to process
language: ISO language code
**kwargs: Additional parameters
Returns:
Dict containing processed features
Raises:
ValueError: If text is empty
"""
if not text:
raise ValueError("Empty text provided")
return {
'features': extract_features(text),
'language': detect_language(text) if not language else language
}
- Documentation Requirements
Component Required Documentation
Functions Docstrings with args/returns
Classes Class and method documentation
Modules Module-level docstring
Tests Test case descriptions
- Branch Naming
Type Pattern Example
Feature feature/XXX-desc feature/123-add-language
Bugfix fix/XXX-desc fix/456-memory-leak
Improvement improve/XXX-desc improve/789-performance
- Commit Messages
Format:
<type>(<scope>): <description>
Examples:
feat(model): Add support for Spanish language
fix(memory): Resolve memory leak in batch processing
perf(speed): Optimize feature extraction
# tests/unit/test_feature_extraction.py
import unittest
import numpy as np
from src.models.feature_extraction import FeatureExtractor
class TestFeatureExtraction(unittest.TestCase):
def setUp(self):
self.extractor = FeatureExtractor(
max_features=1000,
ngram_range=(1, 3)
)
self.test_texts = {
'en': "Climate change is real",
'de': "Klimawandel ist real",
'fr': "Le changement climatique est réel"
}
def test_cross_lingual_features(self):
"""Test feature extraction across languages"""
for lang, text in self.test_texts.items():
features = self.extractor.transform([text])
self.assertIsNotNone(features)
self.assertTrue(isinstance(features, np.ndarray))
self.assertTrue(features.shape[1] == self.extractor.n_features_)
def test_memory_efficiency(self):
"""Test memory usage during feature extraction"""
import psutil
process = psutil.Process()
initial_memory = process.memory_info().rss
# Process large batch
large_text = ["Sample text"] * 1000
_ = self.extractor.transform(large_text)
peak_memory = process.memory_info().rss
memory_increase = peak_memory - initial_memory
self.assertLess(memory_increase / 1024 / 1024, 100) # Max 100MB increase
# tests/integration/test_pipeline.py
class TestStanceDetectionPipeline(unittest.TestCase):
@classmethod
def setUpClass(cls):
cls.pipeline = StanceDetectionPipeline(
model_path='models/ensemble_v1.joblib'
)
cls.test_dataset = load_test_dataset()
def test_end_to_end_processing(self):
"""Test complete processing pipeline"""
test_cases = [
{
'text': 'Climate action is urgent',
'language': 'en',
'expected_stance': 2 # Support
},
{
'text': 'Klimawandel ist übertrieben',
'language': 'de',
'expected_stance': 0 # Against
}
]
for case in test_cases:
result = self.pipeline.process(
text=case['text'],
language=case['language']
)
self.assertEqual(result['stance'], case['expected_stance'])
self.assertIn('confidence', result)
self.assertGreater(result['confidence'], 0.7)
# .github/workflows/main.yml
name: Stance Detection CI/CD
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.12.4'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run tests
run: |
python -m pytest tests/ --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v2
with:
file: ./coverage.xml
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to production
run: |
# Deployment steps here
# .github/workflows/quality.yml
quality-gates:
metrics:
test-coverage: 80%
code-quality:
maintainability: A
reliability: A
security: A
performance:
max-memory: 8GB
max-latency: 100ms
thresholds:
critical-bugs: 0
major-bugs: 2
code-smells: 10
# src/security/data_protection.py
from cryptography.fernet import Fernet
import hashlib
class DataProtector:
def __init__(self, key_path: str):
self.key = self._load_key(key_path)
self.cipher = Fernet(self.key)
def anonymize_text(self, text: str) -> str:
"""Anonymize sensitive information in text"""
# Replace personal identifiers
for pattern in SENSITIVE_PATTERNS:
text = pattern.sub('[REDACTED]', text)
return text
def encrypt_data(self, data: str) -> bytes:
"""Encrypt sensitive data"""
return self.cipher.encrypt(data.encode())
def secure_storage(self, data: dict) -> None:
"""Securely store processed data"""
hashed_id = hashlib.sha256(
str(data['id']).encode()
).hexdigest()
self._store_secure_data(
hashed_id,
self.encrypt_data(str(data))
)
# src/api/security.py
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
api_key_header = APIKeyHeader(name="X-API-Key")
async def verify_api_key(api_key: str = Security(api_key_header)):
if not is_valid_api_key(api_key):
raise HTTPException(
status_code=403,
detail="Invalid API key"
)
return api_key
@app.post("/predict")
async def predict_stance(
input_data: TextInput,
api_key: str = Security(verify_api_key)
):
# Process prediction
pass
# src/optimization/memory.py
class MemoryOptimizer:
def __init__(self, threshold_mb: int = 1000):
self.threshold = threshold_mb * 1024 * 1024
self.current_usage = 0
def check_memory(self) -> bool:
"""Monitor memory usage"""
import psutil
process = psutil.Process()
self.current_usage = process.memory_info().rss
return self.current_usage < self.threshold
@contextmanager
def memory_check(self):
"""Context manager for memory monitoring"""
try:
initial = self.current_usage
yield
finally:
if not self.check_memory():
self._optimize_memory()
def _optimize_memory(self):
"""Optimize memory usage"""
gc.collect()
torch.cuda.empty_cache()
class BatchOptimizer:
def __init__(self, max_batch_size: int = 32):
self.max_batch_size = max_batch_size
self.memory_optimizer = MemoryOptimizer()
def optimize_batch_size(self, data_size: int) -> int:
"""Dynamically adjust batch size"""
if not self.memory_optimizer.check_memory():
return self.max_batch_size // 2
return self.max_batch_size
def process_in_batches(self, data: List[str]) -> List[int]:
"""Process data in optimized batches"""
results = []
batch_size = self.optimize_batch_size(len(data))
for i in range(0, len(data), batch_size):
with self.memory_optimizer.memory_check():
batch = data[i:i + batch_size]
results.extend(self.process_batch(batch))
return results
# src/monitoring/metrics.py
from dataclasses import dataclass
from datetime import datetime
import logging
@dataclass
class PerformanceMetrics:
timestamp: datetime
response_time: float
memory_usage: float
prediction_confidence: float
language: str
class MetricsCollector:
def __init__(self):
self.logger = logging.getLogger('metrics')
self.metrics = []
def log_prediction(self, text: str, result: dict, metrics: PerformanceMetrics):
"""Log prediction metrics"""
self.logger.info(
f"Prediction: {result['stance']} | "
f"Confidence: {metrics.prediction_confidence:.2f} | "
f"Response Time: {metrics.response_time:.3f}s | "
f"Language: {metrics.language}"
)
self.metrics.append(metrics)
# src/monitoring/health.py
class HealthMonitor:
def __init__(self, check_interval: int = 300):
self.check_interval = check_interval
self.last_check = datetime.now()
async def health_check(self) -> dict:
"""Perform system health check"""
return {
'status': 'healthy',
'memory_usage': self.get_memory_usage(),
'model_loaded': self.verify_model(),
'response_time': self.check_response_time(),
'last_prediction': self.last_prediction_time
}
# src/logging_config.py
import logging.config
LOGGING_CONFIG = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'standard': {
'format': '%(asctime)s [%(levelname)s] %(name)s: %(message)s'
},
'detailed': {
'format': '%(asctime)s [%(levelname)s] %(name)s.%(funcName)s:%(lineno)d: %(message)s'
}
},
'handlers': {
'console': {
'level': 'INFO',
'formatter': 'standard',
'class': 'logging.StreamHandler',
},
'file': {
'level': 'DEBUG',
'formatter': 'detailed',
'class': 'logging.FileHandler',
'filename': 'logs/stance_detection.log',
}
},
'loggers': {
'': { # Root logger
'handlers': ['console', 'file'],
'level': 'INFO',
}
}
}
# Clone repository
git clone https://github.com/yourusername/climate-stance-detection.git
cd climate-stance-detection
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
from src.models.ensemble_model import StanceDetector
# Initialize detector
detector = StanceDetector()
# Single prediction
text = "Climate change requires immediate action"
result = detector.predict(text)
print(f"Stance: {result['stance']}")
# Process multiple texts
texts = [
"Global warming is a serious threat",
"We need more research on climate impact",
"Environmental regulations are important"
]
results = detector.predict_batch(texts)
for text, result in zip(texts, results):
print(f"Text: {text[:30]}... -> Stance: {result['stance']}")
For issues and support:
- Create an issue on GitHub
- Contact: your.email@institution.edu
- Documentation: Project Wiki
This project is licensed under the MIT License - see the LICENSE file for details.
- Data Collection
- Preprocessing Pipeline
- Model Development
- Cross-lingual Analysis
- Advanced Feature Implementation
- Production Deployment