Skip to content

AshtonIzmev/spark-anonymization-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Why ?

This stack is a spark/scala classification/anonymization as a code project that is here to help us anonymize data using spark

Features

{
  "db": "db1",
  "tables": [
    {
      "table": "tb1",
      "limit": -1,
      "columns": [
        {
          "column": "col1",
          "classification": "personnelle",
          "policy": "fake_firstname"
        },
        {
          "column": "col2",
          "classification": "sensible",
          "policy": "round_ten"
        },
        {
          "column": "col3",
          "classification": "sensible",
          "policy": "weaken_month"
        }
      ]
    }
  ]
}

A policy file is used to list databases and its tables. Within each table, anonymization policy is specified with a classification level. Uncited columns are kept as is.

Controling the anonymization using entropy

It is possible to calculate the entropy of categorical columns using meanCategEntropy and an approximation of the entropy of numerical and date columns with meanContinuousEntropy

TODO

  • Date conversion for entropy calculation
  • Complete dataframe entropy calculation (detecting column types)
  • Hierarchy anonymization implicit
  • Continuous variable "normalization and binning" anonymization implicit
  • Big data compatible kindof k-anonymity

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages