Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test test_update_shuffle_no_partition_on[None] is flaky #500

Open
steffen-schroeder-by opened this issue Nov 29, 2021 · 0 comments
Open

Comments

@steffen-schroeder-by
Copy link
Contributor

Problem description

The test io/dask/dataframe/test_shuffle.py::test_update_shuffle_no_partition_on[None] is flaky and fail in a small fraction of runs.
This was first obsererved in https://github.com/JDASoftwareGroup/kartothek/runs/4334825494?check_suite_focus=true but is also reproducible on master:

================================================================================ FAILURES =================================================================================
___________________________________________________________ test_update_shuffle_no_partition_on[None-550-2000] ____________________________________________________________

store_factory = functools.partial(<function get_store_from_url at 0x7f884861eb80>, 'hfs:///private/var/folders/78/9wnl2y0s66dcy42qb8_20nwm0000gr/T/pytest-of-clbb/pytest-81/test_update_shuffle_no_partiti549/store')
bucket_by = None

    @pytest.mark.repeat(2000)
    @pytest.mark.parametrize("bucket_by", [None, "range"])
    def test_update_shuffle_no_partition_on(store_factory, bucket_by):
        df = pd.DataFrame(
            {
                "range": np.arange(10),
                "range_duplicated": np.repeat(np.arange(2), 5),
                "random": np.random.randint(0, 100, 10),
            }
        )
        ddf = dd.from_pandas(df, npartitions=10)

        with pytest.raises(
            ValueError, match="``num_buckets`` must not be None when shuffling data."
        ):
            update_dataset_from_ddf(
                ddf,
                store_factory,
                dataset_uuid="output_dataset_uuid",
                table="table",
                shuffle=True,
                num_buckets=None,
                bucket_by=bucket_by,
            ).compute()

        res_default = update_dataset_from_ddf(
            ddf,
            store_factory,
            dataset_uuid="output_dataset_uuid_default",
            table="table",
            shuffle=True,
            bucket_by=bucket_by,
        ).compute()
        assert len(res_default.partitions) == 1

        res = update_dataset_from_ddf(
            ddf,
            store_factory,
            dataset_uuid="output_dataset_uuid",
            table="table",
            shuffle=True,
            num_buckets=2,
            bucket_by=bucket_by,
        ).compute()

>       assert len(res.partitions) == 2
E       assert 1 == 2
E         +1
E         -2

bucket_by  = None
ddf        = Dask DataFrame Structure:
               range range_duplicated random __KTK_HASH_BUCKET
npartitions=9                ...   ...               ...
9                ...              ...    ...               ...
Dask Name: from_pandas, 9 tasks
df         =    range  range_duplicated  random
0      0                 0      58
1      1                 0      32
2      2     ...     1      99
7      7                 1      18
8      8                 1      78
9      9                 1      69
res        = DatasetMetadata(uuid=output_dataset_uuid, tables=['table'], partition_keys=[], metadata_version=4, indices=[], explicit_partitions=True)
res_default = DatasetMetadata(uuid=output_dataset_uuid_default, tables=['table'], partition_keys=[], metadata_version=4, indices=[], explicit_partitions=True)
store_factory = functools.partial(<function get_store_from_url at 0x7f884861eb80>, 'hfs:///private/var/folders/78/9wnl2y0s66dcy42qb8_20nwm0000gr/T/pytest-of-clbb/pytest-81/test_update_shuffle_no_partiti549/store')

io/dask/dataframe/test_shuffle.py:91: AssertionError
============================================================================ warnings summary =============================================================================
tests/io/dask/dataframe/test_shuffle.py::test_update_shuffle_no_partition_on[None-550-2000]
tests/io/dask/dataframe/test_shuffle.py::test_update_shuffle_no_partition_on[None-550-2000]
  /Users/clbb/dev/kartothek/kartothek/core/dataset.py:107: DeprecationWarning: The attribute `DatasetMetadataBase.table_meta` will be removed in kartothek 4.0 in favour of `DatasetMetadataBase.schema`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================================================================= short test summary info =========================================================================
FAILED io/dask/dataframe/test_shuffle.py::test_update_shuffle_no_partition_on[None-550-2000] - assert 1 == 2
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================= 1 failed, 549 passed, 7729 deselected, 2 warnings in 343.11s (0:05:43) ==================================================
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52668
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52667
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:52668', name: tcp://127.0.0.1:52668, status: running, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52668
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:52667', name: tcp://127.0.0.1:52667, status: closing, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52667
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing..

This was reproduced with pytest-repeat and adding an @pytest.mark.repeat(2000) on top of the test.
Executed command: pytest -v -k "test_update_shuffle_no_partition_on[None" -x --showlocals --repeat-scope session

@pytest.mark.repeat(2000)
@pytest.mark.parametrize("bucket_by", [None, "range"])
def test_update_shuffle_no_partition_on(store_factory, bucket_by):
    df = pd.DataFrame(
        {
            "range": np.arange(10),
            "range_duplicated": np.repeat(np.arange(2), 5),
            "random": np.random.randint(0, 100, 10),
        }
    )
    ddf = dd.from_pandas(df, npartitions=10)

    with pytest.raises(
        ValueError, match="``num_buckets`` must not be None when shuffling data."
    ):
        update_dataset_from_ddf(
            ddf,
            store_factory,
            dataset_uuid="output_dataset_uuid",
            table="table",
            shuffle=True,
            num_buckets=None,
            bucket_by=bucket_by,
        ).compute()

    res_default = update_dataset_from_ddf(
        ddf,
        store_factory,
        dataset_uuid="output_dataset_uuid_default",
        table="table",
        shuffle=True,
        bucket_by=bucket_by,
    ).compute()
    assert len(res_default.partitions) == 1

    res = update_dataset_from_ddf(
        ddf,
        store_factory,
        dataset_uuid="output_dataset_uuid",
        table="table",
        shuffle=True,
        num_buckets=2,
        bucket_by=bucket_by,
    ).compute()

    assert len(res.partitions) == 2

Used versions

Package                           Version             Location                 
--------------------------------- ------------------- -------------------------
appnope                           0.1.2               
asv                               0.4.2               
attrs                             21.2.0              
backcall                          0.2.0               
backports.entry-points-selectable 1.1.1               
cffi                              1.15.0              
cfgv                              3.3.1               
click                             8.0.3               
cloudpickle                       2.0.0               
coverage                          6.1.2               
dask                              2021.11.2           
decorator                         5.1.0               
deprecation                       2.1.0               
distlib                           0.3.3               
distributed                       2021.11.2           
filelock                          3.4.0               
flake8                            4.0.1               
flake8-mutable                    1.2.0               
freezegun                         1.1.0               
fsspec                            2021.11.0           
HeapDict                          1.0.1               
hypothesis                        6.24.6              
identify                          2.3.6               
iniconfig                         1.1.1               
ipython                           7.29.0              
jedi                              0.18.1              
Jinja2                            3.0.3               
kartothek                         5.2.1.dev4+gf4dc09f /Users/xxx/dev/kartothek
locket                            0.2.1               
MarkupSafe                        2.0.1               
matplotlib-inline                 0.1.3               
mccabe                            0.6.1               
milksnake                         0.1.5               
msgpack                           1.0.2               
nodeenv                           1.6.0               
numpy                             1.21.4              
packaging                         21.3                
pandas                            1.3.4               
parso                             0.8.2               
partd                             1.2.0               
pexpect                           4.8.0               
pickleshare                       0.7.5               
pip                               19.2.3              
platformdirs                      2.4.0               
pluggy                            1.0.0               
pre-commit                        2.15.0              
prompt-toolkit                    3.0.22              
psutil                            5.8.0               
ptyprocess                        0.7.0               
py                                1.11.0              
pyarrow                           3.0.0               
pycodestyle                       2.8.0               
pycparser                         2.21                
pyflakes                          2.4.0               
Pygments                          2.10.0              
pyparsing                         3.0.6               
pytest                            6.2.5               
pytest-cov                        3.0.0               
pytest-mock                       3.6.1               
pytest-repeat                     0.9.1               
python-dateutil                   2.8.2               
pytz                              2021.3              
PyYAML                            6.0                 
setuptools                        41.2.0              
setuptools-scm                    6.3.2               
simplejson                        3.17.6              
simplekv                          0.14.1              
six                               1.16.0              
sortedcontainers                  2.4.0               
storefact                         0.10.0              
tblib                             1.7.0               
toml                              0.10.2              
tomli                             1.2.2               
toolz                             0.11.2              
tornado                           6.1                 
traitlets                         5.1.1               
uritools                          3.0.2               
urlquote                          1.1.4               
virtualenv                        20.10.0             
wcwidth                           0.2.5               
zict                              2.0.0               
zstandard                         0.16.0              


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant