Skip to content

revupp-ai/data-masker

Repository files navigation

Data Masker

A flexible, configurable tool for masking sensitive information in advertising data exports while preserving analytics capabilities.

Overview

Data Masker helps you anonymize sensitive data in CSV files while maintaining the ability to perform analytics. It intelligently identifies which columns need masking and which should be preserved (like numeric metrics and dates), using either reversible encryption or consistent hashing. The tool maintains referential integrity across files to enable cross-file analysis with masked data.

Features

  • Multi-platform support for Google Ads, Facebook Ads, and other advertising platforms
  • Configurable masking rules via YAML or JSON configuration files
  • Smart column detection for numeric, date, and sensitive data
  • Reversible masking with private key encryption
  • Consistent masking with salted hashing to maintain relationships in data
  • Type-aware prefixes that indicate the type of masked data (e.g., Campaign_abc123)
  • Column name normalization to handle variations in column headers
  • Analysis mode to preview which columns will be masked before processing
  • Minimal dependencies - just Python with pandas, pyyaml, cryptography, and standard libraries

Installation

# Clone the repository
git clone https://github.com/revupp-ai/data-masker.git
cd data-masker

# Install dependencies
pip install pandas pyyaml cryptography

Usage

Basic Usage

Masking Data:

python masker.py mask input_file.csv

This will create a file named masked_input_file.csv with sensitive data masked using a hash-based approach.

Reversible Masking:

python masker.py mask input_file.csv --private-key "your-secret-key" --salt "your-salt"

This will create a masked file that can be later unmasked using the same private key and salt.

Unmasking Data:

python masker.py unmask masked_file.csv --private-key "your-secret-key" --salt "your-salt"

This will restore the original values in columns that were encrypted with reversible masking.

Command Line Options

Masking:

python masker.py mask input_file.csv [options]

Unmasking:

python masker.py unmask masked_file.csv --private-key KEY --salt SALT [options]
Option Description
--output, -o Specify the output file path
--config, -c Path to a custom configuration file
--salt, -s Salt string for consistent hashing/encryption
--private-key, -k Private key for reversible encryption/decryption
--save-config Save default config to specified path and exit
--analyze-only, -a Only analyze the file without masking it

Examples

# Use a custom configuration file
python masker.py mask google_ads_report.csv --config google_ads_config.yaml

# Preview which columns will be masked without applying changes
python masker.py mask facebook_ads_report.csv --analyze-only

# Save the default configuration as a starting point
python masker.py --save-config my_config.yaml

# Specify an output file
python masker.py mask data.csv --output masked_data.csv

# Use a specific salt for consistent masking across runs
python masker.py mask data.csv --salt "my-salt-2023"

# Enable reversible masking with a private key
python masker.py mask data.csv --private-key "secret-key-2023" --salt "my-salt-2023"

# Unmask previously masked data
python masker.py unmask masked_data.csv --private-key "secret-key-2023" --salt "my-salt-2023"

Configuration

The masking behavior is controlled by a configuration file in YAML or JSON format. You can generate a default configuration file using the --save-config option.

Configuration Options

  • numeric_indicators: Terms that indicate a column contains numeric data (to be preserved)
  • date_indicators: Terms that indicate a column contains date information (to be preserved)
  • sensitive_indicators: Terms that indicate a column contains sensitive data (to be masked)
  • masking_patterns: Rules for determining the prefix of masked values based on content patterns
  • default_mask_prefix: Default prefix for masked values that don't match patterns
  • hash_length: Number of characters to use from the hash
  • normalize_column_names: Whether to normalize column names for consistent matching
  • reversible_masking: Whether to use reversible encryption when a private key is provided
  • iterations: Number of iterations for key derivation function (higher is more secure)

Example Configuration

# Data Masking Configuration
numeric_indicators:
  # Financial metrics
  - cost
  - spend
  - revenue
  # ... more indicators

date_indicators:
  - date
  - day
  # ... more indicators

sensitive_indicators:
  # Campaign structure
  - campaign
  - ad set
  # ... more indicators

masking_patterns:
  - pattern: "^campaign"
    prefix: "Campaign"
  - pattern: "^ad set|^adset"
    prefix: "AdSet"
  # ... more patterns

default_mask_prefix: "Item"
hash_length: 8
normalize_column_names: true
reversible_masking: true
iterations: 100000

Masking Methods

Data Masker supports two masking methods:

  1. Hash-based Masking (default): Uses a one-way hash with a salt to create consistent but irreversible masked values.

  2. Reversible Encryption: When a private key is provided, uses the Fernet symmetric encryption algorithm to create reversible masked values. This allows data to be unmasked later using the same private key and salt.

The format of masked values differs:

  • Hash-based: Prefix_a1b2c3d4
  • Reversible: Prefix_enc:encrypted-data-in-base64

Security Considerations

When using reversible masking:

  1. Keep your private key secure - anyone with the key and salt can unmask the data
  2. Use a strong private key - longer, more complex keys are more secure
  3. Change the salt periodically for improved security
  4. Limit access to masked files that contain reversibly masked data

Supported Advertising Platforms

The default configuration includes patterns for:

  • Google Ads - campaigns, ad groups, keywords, etc.
  • Facebook Ads - campaigns, ad sets, custom audiences, etc.
  • Other platforms - generic patterns that work across platforms

Programmatic Usage

You can also use the DataMasker class in your Python code:

from masker import DataMasker

# Initialize with default configuration
masker = DataMasker()

# Or with custom configuration
masker = DataMasker("config.yaml")

# Analyze a file without masking
analysis = masker.mask_file("data.csv", analyze_only=True)
print(f"Will mask {len(analysis['to_mask_columns'])} columns")

# Mask a DataFrame with reversible encryption
df = pd.read_csv("data.csv")
masked_df = masker.mask_dataframe(df, salt="my-salt", private_key="my-secret-key")

# Unmask a previously masked DataFrame
unmasked_df = masker.unmask_dataframe(masked_df, salt="my-salt", private_key="my-secret-key")

# Or mask a file directly
result = masker.mask_file("data.csv", "masked_data.csv", 
                         salt="my-salt", private_key="my-secret-key")
print(f"Masked {len(result['masked_columns'])} columns")

# Unmask a file
result = masker.unmask_file("masked_data.csv", "unmasked_data.csv", 
                           salt="my-salt", private_key="my-secret-key")
print(f"Unmasked {len(result['unmasked_columns'])} columns")

How It Works

  1. Column names are normalized and analyzed against configuration patterns
  2. The script identifies columns containing numeric data, dates, and sensitive information
  3. For columns that need masking:
    • With hash-based masking: values are hashed with a salt for consistency
    • With reversible masking: values are encrypted with a private key and salt
  4. Pattern matching determines appropriate type prefixes for masked values
  5. The result maintains the structure of the original data but with sensitive info masked
  6. For unmasking, the process is reversed using the same private key and salt

Dependencies

  • pandas - for data handling
  • pyyaml - for configuration file parsing
  • cryptography - for secure encryption/decryption (required for reversible masking)

License

MIT License