Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: distribution calculators #352

Merged
merged 19 commits into from
Jan 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
770a42c
First version of continuous distribution calculator working
nnansters Jan 10, 2024
9c9c3eb
Refactor plotting to support drift results for alert specification
nnansters Jan 11, 2024
ebb900a
Support running ContinuousDistributionCalculator in the Runner
nnansters Jan 12, 2024
2a52cda
Fix pickling ContinuousDistributionCalculator
michael-nml Jan 12, 2024
56a16f1
Working version of CategoricalDistributionCalculator
nnansters Jan 12, 2024
d42ecdd
This is not how overload works.
nnansters Jan 12, 2024
26a78e4
Getting index-based plots to work
nnansters Jan 12, 2024
fd2e868
Support categorical distribution calculator in the runner
nnansters Jan 12, 2024
0d0fd8d
Fix Flake8 & mypy
nnansters Jan 15, 2024
60a2597
Merge branch 'main' into feat/distribution-calculators
nnansters Jan 17, 2024
3cbb56e
Expose option to downscale resolution of individual joyplots for cont…
nnansters Jan 17, 2024
529876b
Expose cumulative density for KDE quartiles
michael-nml Jan 18, 2024
78c87c5
Use first point >= quartile instead of closest
michael-nml Jan 18, 2024
331dab1
Updated default thresholds for Univariate Drift detection methods
nnansters Jan 17, 2024
4fe8bbd
Fix broken ranker tests. This is why we do PR's kids.
nnansters Jan 17, 2024
975e4b5
Fix linting
nnansters Jan 17, 2024
9dec43c
Register summary stats in CLI runner (#353)
michael-nml Jan 17, 2024
eaef148
Unique identifier column to nannyML datasets (#348)
santiviquez Jan 18, 2024
581277d
Merge branch 'main' into feat/distribution-calculators
nnansters Jan 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions nannyml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
load_titanic_dataset,
load_us_census_ma_employment_data,
)
from .distribution import CategoricalDistributionCalculator, ContinuousDistributionCalculator
from .drift import AlertCountRanker, CorrelationRanker, DataReconstructionDriftCalculator, UnivariateDriftCalculator
from .exceptions import ChunkerException, InvalidArgumentsException, MissingMetadataException
from .io import DatabaseWriter, PickleFileWriter, RawFilesWriter
Expand Down
1 change: 1 addition & 0 deletions nannyml/data_quality/unseen/calculator.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ def _calculate(self, data: pd.DataFrame, *args, **kwargs) -> Result:
# Applicable here but to many of the base classes as well (e.g. fitting and calculating)
self.result = self.result.filter(period='reference')
self.result.data = pd.concat([self.result.data, res]).reset_index(drop=True)
self.result.data.sort_index(inplace=True)

return self.result

Expand Down
2 changes: 2 additions & 0 deletions nannyml/distribution/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .categorical import CategoricalDistributionCalculator
from .continuous import ContinuousDistributionCalculator
1 change: 1 addition & 0 deletions nannyml/distribution/categorical/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .calculator import CategoricalDistributionCalculator
140 changes: 140 additions & 0 deletions nannyml/distribution/categorical/calculator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
from typing import List, Optional, Union

import numpy as np
import pandas as pd
from typing_extensions import Self

from nannyml import Chunker
from nannyml.base import AbstractCalculator, _list_missing
from nannyml.distribution.categorical.result import Result
from nannyml.exceptions import InvalidArgumentsException


class CategoricalDistributionCalculator(AbstractCalculator):
def __init__(
self,
column_names: Union[str, List[str]],
timestamp_column_name: Optional[str] = None,
chunk_size: Optional[int] = None,
chunk_number: Optional[int] = None,
chunk_period: Optional[str] = None,
chunker: Optional[Chunker] = None,
):
super().__init__(

Check warning on line 23 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L23

Added line #L23 was not covered by tests
chunk_size,
chunk_number,
chunk_period,
chunker,
timestamp_column_name,
)

self.column_names = column_names if isinstance(column_names, List) else [column_names]
self.result: Optional[Result] = None
self._was_fitted: bool = False

Check warning on line 33 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L31-L33

Added lines #L31 - L33 were not covered by tests

def _fit(self, reference_data: pd.DataFrame, *args, **kwargs) -> Self:
self.result = self._calculate(reference_data)
self._was_fitted = True

Check warning on line 37 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L36-L37

Added lines #L36 - L37 were not covered by tests

return self

Check warning on line 39 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L39

Added line #L39 was not covered by tests

def _calculate(self, data: pd.DataFrame, *args, **kwargs) -> Result:
if data.empty:
raise InvalidArgumentsException('data contains no rows. Please provide a valid data set.')

Check warning on line 43 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L43

Added line #L43 was not covered by tests

_list_missing(self.column_names, data)

Check warning on line 45 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L45

Added line #L45 was not covered by tests

# result_data = pd.DataFrame(columns=_create_multilevel_index(self.column_names))
result_data = pd.DataFrame()

Check warning on line 48 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L48

Added line #L48 was not covered by tests

chunks = self.chunker.split(data)

Check warning on line 50 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L50

Added line #L50 was not covered by tests
chunks_data = pd.DataFrame(
{
'key': [c.key for c in chunks],
'chunk_index': [c.chunk_index for c in chunks],
'start_datetime': [c.start_datetime for c in chunks],
'end_datetime': [c.end_datetime for c in chunks],
'start_index': [c.start_index for c in chunks],
'end_index': [c.end_index for c in chunks],
'period': ['analysis' if self._was_fitted else 'reference' for _ in chunks],
}
)

for column in self.column_names:
value_counts = calculate_value_counts(

Check warning on line 64 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L64

Added line #L64 was not covered by tests
data=data[column],
chunker=self.chunker,
timestamps=data.get(self.timestamp_column_name, default=None),
max_number_of_categories=5,
missing_category_label='Missing',
column_name=column,
)
result_data = pd.concat([result_data, pd.merge(chunks_data, value_counts, on='chunk_index')])

Check warning on line 72 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L72

Added line #L72 was not covered by tests

# result_data.index = pd.MultiIndex.from_tuples(list(zip(result_data['column_name'], result_data['value'])))

if self.result is None:
self.result = Result(result_data, self.column_names, self.timestamp_column_name, self.chunker)

Check warning on line 77 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L77

Added line #L77 was not covered by tests
else:
# self.result = self.result.data.loc[self.result.data['period'] == 'reference', :]
self.result.data = pd.concat([self.result.data, result_data]).reset_index(drop=True)

Check warning on line 80 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L80

Added line #L80 was not covered by tests

return self.result

Check warning on line 82 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L82

Added line #L82 was not covered by tests


def calculate_value_counts(
data: Union[np.ndarray, pd.Series],
chunker: Chunker,
missing_category_label,
max_number_of_categories,
timestamps: Optional[Union[np.ndarray, pd.Series]] = None,
column_name: Optional[str] = None,
):
if isinstance(data, np.ndarray):
if column_name is None:
raise InvalidArgumentsException("'column_name' can not be None when 'data' is of type 'np.ndarray'.")
data = pd.Series(data, name=column_name)

Check warning on line 96 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L95-L96

Added lines #L95 - L96 were not covered by tests
else:
column_name = data.name

Check warning on line 98 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L98

Added line #L98 was not covered by tests

data = data.astype("category")

Check warning on line 100 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L100

Added line #L100 was not covered by tests
cat_str = [str(value) for value in data.cat.categories.values]
data = data.cat.rename_categories(cat_str)
data = data.cat.add_categories([missing_category_label, 'Other'])
data = data.fillna(missing_category_label)

Check warning on line 104 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L102-L104

Added lines #L102 - L104 were not covered by tests

if max_number_of_categories:
top_categories = data.value_counts().index.tolist()[:max_number_of_categories]

Check warning on line 107 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L107

Added line #L107 was not covered by tests
if data.nunique() > max_number_of_categories + 1:
data.loc[~data.isin(top_categories)] = 'Other'

Check warning on line 109 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L109

Added line #L109 was not covered by tests

data = data.cat.remove_unused_categories()

Check warning on line 111 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L111

Added line #L111 was not covered by tests

categories_ordered = data.value_counts().index.tolist()
categorical_data = pd.Categorical(data, categories_ordered)

Check warning on line 114 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L113-L114

Added lines #L113 - L114 were not covered by tests

# TODO: deal with None timestamps
if isinstance(timestamps, pd.Series):
timestamps = timestamps.reset_index()

Check warning on line 118 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L118

Added line #L118 was not covered by tests

chunks = chunker.split(pd.concat([pd.Series(categorical_data, name=column_name), timestamps], axis=1))

Check warning on line 120 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L120

Added line #L120 was not covered by tests
data_with_chunk_keys = pd.concat([chunk.data.assign(chunk_index=chunk.chunk_index) for chunk in chunks])

value_counts_table = (

Check warning on line 123 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L123

Added line #L123 was not covered by tests
data_with_chunk_keys.groupby(['chunk_index'])[column_name]
.value_counts()
.to_frame('value_counts')
.sort_values(by=['chunk_index', 'value_counts'])
.reset_index()
.rename(columns={column_name: 'value'})
.assign(column_name=column_name)
)

value_counts_table['value_counts_total'] = value_counts_table['chunk_index'].map(

Check warning on line 133 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L133

Added line #L133 was not covered by tests
value_counts_table.groupby('chunk_index')['value_counts'].sum()
)
value_counts_table['value_counts_normalised'] = (

Check warning on line 136 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L136

Added line #L136 was not covered by tests
value_counts_table['value_counts'] / value_counts_table['value_counts_total']
)

return value_counts_table

Check warning on line 140 in nannyml/distribution/categorical/calculator.py

View check run for this annotation

Codecov / codecov/patch

nannyml/distribution/categorical/calculator.py#L140

Added line #L140 was not covered by tests
Loading
Loading