Fix error in AXLearn tests, and remove unnecessary ones #1443

Steboss · 2025-05-09T15:09:39Z

This PR does the following:

remove array_serialization_test.py that's causing the test to hang with the following error:

[pod/axlearn-14925960362-xb57t/axlearn]   File "/opt/jax/jax/experimental/array_serialization/serialization.py", line 193, in __del__
[pod/axlearn-14925960362-xb57t/axlearn]     logger.warning('Please add `.wait_until_finished()` in the main thread '
[pod/axlearn-14925960362-xb57t/axlearn] Message: 'Please add `.wait_until_finished()` in the main thread before your program finishes because there is a possibility of losing errors raised if the this class is deleted before writing is completed.'
[pod/axlearn-14925960362-xb57t/axlearn] Arguments: ()
[pod/axlearn-14925960362-xb57t/axlearn] sssssssssssssss.ssssssssssss.ssssssssssssssssssssssFsFs.s...sssssssFs.Fs [ 97%]

causing the EKS job to run out of time - so we can't get the tests

remove tests that are redundant (namely, there are similar test already running) such as:

"/opt/axlearn/axlearn/common/deberta_test.py"
"/opt/axlearn/axlearn/common/distilbert_test.py"
"/opt/axlearn/axlearn/common/trainer_test.py"
"/opt/axlearn/axlearn/common/decoder_test.py"
"/opt/axlearn/axlearn/common/adapter_torch_test.py"
"/opt/axlearn/axlearn/common/attention_test.py"
"/opt/axlearn/axlearn/common/convolution_test.py"

remove tests for models that we're not currently using:

"/opt/axlearn/axlearn/common/mixture_of_experts_test.py"
"/opt/axlearn/axlearn/common/t5_test.py"
"/opt/axlearn/axlearn/common/vision_transformer_test.py"
"/opt/axlearn/axlearn/common/input_reading_comprehension_test.py"
"/opt/axlearn/axlearn/common/input_t5_test.py"

remove tests like summary_writer_test.py that is mostly using python library we're not employing here (e.g. wandb)
add the installation of pytest-xdist and pytest-reportlog to avoid the following error:

ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --report-log=/tmp/tmp.iL7rVQtXBq --dist=load --tx --tx

Overall, this should allow us to reduce the testing time from 50 minutes to 30 minutes, covering the most important tests as well, that are dealing with general AXLearn infrastructure.

olupton

I see an error that seems like a test/infra bug, rather than a failing test: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/14931992154/job/41952767038#step:6:41440

Some of the tests look like they are failing due to missing input data.

Also, the CI job is marked as successful despite the tests failing.

.github/container/test-axlearn.sh

.github/eks-workflow-files/axlearn/axlearn-job.yml

.github/container/test-axlearn.sh

…rn only

…olbox into sbosisio/fix_axlearn_tests

Steboss · 2025-05-13T16:41:48Z

In this PR I added a workflow dispatch, so we can test single parts of the CI.
In particular, this may be a better trick to be used during CI, so that we can have MODE=SOMETHING to trigger only specific parts of the workflow, rather than testing everything

Steboss · 2025-05-14T13:51:16Z

@olupton
I can see that the axlearn tests now are working fine. We have some tests that are still failing - I can have a look at those.
Working on the way we're monitoring and returning the k8s job status, as it's still giving green flag on the axlearn eks job.

Steboss · 2025-05-14T18:49:21Z

@olupton
It looks like we need the XLA_FLAGS="--xla_force_host_platform_device_count=8" for the for_8_devices tests, otherwise the XLA tests will fail as:

[pod/axlearn-15024599931-2vfmc/axlearn] ___________________ HostArrayTest.test_fixed_process_shape67 ___________________
[pod/axlearn-15024599931-2vfmc/axlearn] [gw93] linux -- Python 3.12.3 /usr/bin/python3
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn] self = <axlearn.common.host_array_test.HostArrayTest testMethod=test_fixed_process_shape67>
[pod/axlearn-15024599931-2vfmc/axlearn] platform = 'cpu', mesh_shape = (-1, 2), process_shape = [1]
[pod/axlearn-15024599931-2vfmc/axlearn] partition = PartitionSpec('data', 'model')
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn]     @parameterized.product(
[pod/axlearn-15024599931-2vfmc/axlearn]         platform=("cpu", "tpu"),
[pod/axlearn-15024599931-2vfmc/axlearn]         mesh_shape=[
[pod/axlearn-15024599931-2vfmc/axlearn]             (-1, 1),  # Fully partitioned along one dim.
[pod/axlearn-15024599931-2vfmc/axlearn]             (2, -1),  # Partitioned along multiple dims.
[pod/axlearn-15024599931-2vfmc/axlearn]             (-1, 2),  # Test the other way.
[pod/axlearn-15024599931-2vfmc/axlearn]             (1, -1),
[pod/axlearn-15024599931-2vfmc/axlearn]         ],
[pod/axlearn-15024599931-2vfmc/axlearn]         process_shape=[
[pod/axlearn-15024599931-2vfmc/axlearn]             # Each process produces single dim.
[pod/axlearn-15024599931-2vfmc/axlearn]             [1],  # Not divisible by number of devices (replicated).
[pod/axlearn-15024599931-2vfmc/axlearn]             [8],  # Divisible by number of devices.
[pod/axlearn-15024599931-2vfmc/axlearn]             [16],  # Multiple elements per device.
[pod/axlearn-15024599931-2vfmc/axlearn]             # Each process produces multiple dims.
[pod/axlearn-15024599931-2vfmc/axlearn]             [1, 1],  # Not divisible by number of devices (replicated).
[pod/axlearn-15024599931-2vfmc/axlearn]             [2, 1],  # Can be partitioned over dim=0, replicated on dim=1.
[pod/axlearn-15024599931-2vfmc/axlearn]             [16, 1],  # Multiple elements per device.
[pod/axlearn-15024599931-2vfmc/axlearn]             [2, 4],  # Can be fully partitioned.
[pod/axlearn-15024599931-2vfmc/axlearn]             [8, 8],  # Can be fully partitioned.
[pod/axlearn-15024599931-2vfmc/axlearn]         ],
[pod/axlearn-15024599931-2vfmc/axlearn]         partition=(
[pod/axlearn-15024599931-2vfmc/axlearn]             DataPartitionType.FULL,
[pod/axlearn-15024599931-2vfmc/axlearn]             DataPartitionType.REPLICATED,
[pod/axlearn-15024599931-2vfmc/axlearn]             PartitionSpec("data"),
[pod/axlearn-15024599931-2vfmc/axlearn]             PartitionSpec("data", "model"),
[pod/axlearn-15024599931-2vfmc/axlearn]         ),
[pod/axlearn-15024599931-2vfmc/axlearn]     )
[pod/axlearn-15024599931-2vfmc/axlearn]     # NOTE: while annotated with `for_8_devices`, this runs on other configurations.
[pod/axlearn-15024599931-2vfmc/axlearn]     @pytest.mark.for_8_devices
[pod/axlearn-15024599931-2vfmc/axlearn]     def test_fixed_process_shape(
[pod/axlearn-15024599931-2vfmc/axlearn]         self,
[pod/axlearn-15024599931-2vfmc/axlearn]         platform: str,
[pod/axlearn-15024599931-2vfmc/axlearn]         mesh_shape: tuple[int, int],
[pod/axlearn-15024599931-2vfmc/axlearn]         process_shape: Sequence[int],
[pod/axlearn-15024599931-2vfmc/axlearn]         partition: Union[DataPartitionType, PartitionSpec],
[pod/axlearn-15024599931-2vfmc/axlearn]     ):
[pod/axlearn-15024599931-2vfmc/axlearn]         """Tests roundtrip host-to-global and global-to-host with fixed process shape."""
[pod/axlearn-15024599931-2vfmc/axlearn]     
[pod/axlearn-15024599931-2vfmc/axlearn] >       mesh_shape = infer_mesh_shape(mesh_shape)
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn] axlearn/common/host_array_test.py:124: 
[pod/axlearn-15024599931-2vfmc/axlearn] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn] mesh_shape = (-1, 2)
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn]     def infer_mesh_shape(mesh_shape: MeshShape, *, num_devices: Optional[int] = None) -> MeshShape:
[pod/axlearn-15024599931-2vfmc/axlearn]         """Infer the value for -1 from len(jax.devices()) and other dims if there is -1 in mesh shape.
[pod/axlearn-15024599931-2vfmc/axlearn]     
[pod/axlearn-15024599931-2vfmc/axlearn]         Args:
[pod/axlearn-15024599931-2vfmc/axlearn]             mesh_shape: The original MeshShape, which might have -1 in one axis.
[pod/axlearn-15024599931-2vfmc/axlearn]             num_devices: The devices that will be used to construct the mesh.
[pod/axlearn-15024599931-2vfmc/axlearn]                 If None, defaults to len(jax.devices()).
[pod/axlearn-15024599931-2vfmc/axlearn]     
[pod/axlearn-15024599931-2vfmc/axlearn]         Returns
[pod/axlearn-15024599931-2vfmc/axlearn]             A new MeshShape with inferred value for -1.
[pod/axlearn-15024599931-2vfmc/axlearn]         """
[pod/axlearn-15024599931-2vfmc/axlearn]         if -1 not in mesh_shape:
[pod/axlearn-15024599931-2vfmc/axlearn]             return mesh_shape
[pod/axlearn-15024599931-2vfmc/axlearn]     
[pod/axlearn-15024599931-2vfmc/axlearn]         if mesh_shape.count(-1) > 1:
[pod/axlearn-15024599931-2vfmc/axlearn]             raise ValueError(f"Only one axis can be -1 in {mesh_shape=}.")
[pod/axlearn-15024599931-2vfmc/axlearn]     
[pod/axlearn-15024599931-2vfmc/axlearn]         # Handle the case with one -1.
[pod/axlearn-15024599931-2vfmc/axlearn]         prod = math.prod(mesh_shape, start=-1)
[pod/axlearn-15024599931-2vfmc/axlearn]         if num_devices is None:
[pod/axlearn-15024599931-2vfmc/axlearn]             num_devices = len(jax.devices())
[pod/axlearn-15024599931-2vfmc/axlearn]         if num_devices % prod != 0:
[pod/axlearn-15024599931-2vfmc/axlearn] >           raise ValueError(
[pod/axlearn-15024599931-2vfmc/axlearn]                 f"Unable to infer -1 in mesh shape {mesh_shape} as num_devices {num_devices} "
[pod/axlearn-15024599931-2vfmc/axlearn]                 f"is not a multiple of the product {prod} of mesh axes."
[pod/axlearn-15024599931-2vfmc/axlearn]             )
[pod/axlearn-15024599931-2vfmc/axlearn] E           ValueError: Unable to infer -1 in mesh shape (-1, 2) as num_devices 1 is not a multiple of the product 2 of mesh axes.
[pod/axlearn-15024599931-2vfmc/axlearn] 
[pod/axlearn-15024599931-2vfmc/axlearn] axlearn/common/utils.py:1834: ValueError

I ran a test, that results in:

with flag: 1 failed, 161 passed
without flag: 203 failed, 100 passed

Steboss · 2025-05-14T19:01:51Z

This may be a new JAX version error

AttributeError: module 'jax.experimental.array_serialization.serialization' has no attribute '_spec_has_metadata'

…olbox into sbosisio/fix_axlearn_tests

.github/eks-workflow-files/axlearn/axlearn-job.yml

.github/workflows/ci.yaml

.github/container/test-axlearn.sh

.github/eks-workflow-files/axlearn/axlearn-job.yml

Co-authored-by: Olli Lupton <olupton@nvidia.com>

…olbox into sbosisio/fix_axlearn_tests

olupton

I think we can merge this to speed up the pipeline, but I left some more comments on error handling and robustness.

.github/actions/submit-delete-k8s-job/action.yml

olupton · 2025-05-20T18:12:03Z

.github/eks-workflow-files/axlearn/axlearn-job.yml

-                      # Run tests
-                      pytest-xdist.sh 1 6 ${LOG_DIR}/axlearn-unittests.jsonl test-axlearn.sh --directory "." --output ${LOG_DIR} --test-files "/opt/axlearn/axlearn/common/*_test.py" | tee -a ${LOG_DIR}/pytest_stdout.log
-
+                      # test on JAX, make sure 8 devices are visible


Presumably this means we are leaving parallelism on the table by launching 1-GPU tests with 8 visible.

olupton · 2025-05-20T19:16:24Z

.github/container/test-axlearn.sh

+echo "Total number of failed tests ${failed}"
+echo "Total number of skipped tests ${skipped}"
+# add those to summary.txt and we're using it for extracting values
+echo "PASSED: ${passed} FAILED: ${failed} SKIPPED: ${skipped}" >> ${LOG_DIRECTORY}/summary.txt


This is not capturing this error in the run: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15144748689/job/42579078908?pr=1443#step:6:680

.github/workflows/_ci.yaml

.github/container/test-axlearn.sh

Steboss added 2 commits May 9, 2025 16:01

fix axlearn tests, and remove not necessary ones

8e99e73

fix axlearn test

f84b130

Steboss requested a review from olupton May 9, 2025 20:54

Steboss mentioned this pull request May 12, 2025

Create an action for submit k8s and one for monitoring/deletion #1440

Closed

olupton reviewed May 12, 2025

View reviewed changes

.github/container/test-axlearn.sh Outdated Show resolved Hide resolved

Steboss added 2 commits May 12, 2025 11:03

fix summary write up and fix exclude patterns

a7b5e79

fix axlearn tests

e785509

olupton reviewed May 13, 2025

View reviewed changes

.github/eks-workflow-files/axlearn/axlearn-job.yml Show resolved Hide resolved

.github/container/test-axlearn.sh Outdated Show resolved Hide resolved

Steboss and others added 6 commits May 13, 2025 13:38

add a workflow dispatch for running selective jobs + try to run axlae…

f82bda4

…rn only

fix error

fc34220

Merge branch 'main' into sbosisio/fix_axlearn_tests

7294422

wrong variable

7f357ff

Merge branch 'sbosisio/fix_axlearn_tests' of github.com:NVIDIA/JAX-To…

b6e5121

…olbox into sbosisio/fix_axlearn_tests

Fix output directory

b4c5831

Steboss requested a review from olupton May 13, 2025 16:41

Steboss added 3 commits May 14, 2025 09:56

fix the copy from s3

fb1666c

fix the aws cp command

cb3dadc

Fake test to run ci

d105d81

Steboss added 2 commits May 14, 2025 16:19

try to revert this action in order to detect failures and successes

9fc1724

revert changes to k8s checker

833ba61

add the xla flag

f13deb8

Steboss and others added 5 commits May 15, 2025 08:58

Merge branch 'main' into sbosisio/fix_axlearn_tests

e6e6e52

try with new branch for runnings tests

589e06a

Merge branch 'sbosisio/fix_axlearn_tests' of github.com:NVIDIA/JAX-To…

94579a7

…olbox into sbosisio/fix_axlearn_tests

back to the origins

0febad3

fix error

d852bc9

Steboss added 3 commits May 16, 2025 13:20

test with cuda as platform

2489910

fix tests

33b400f

Fix tests for GPUs and devices

e6e204c

Steboss requested a review from olupton May 16, 2025 16:29

Steboss and others added 2 commits May 19, 2025 10:26

try to check what gpus capabilities we see

440fc2d

Merge branch 'main' into sbosisio/fix_axlearn_tests

6bdd345

olupton reviewed May 19, 2025

View reviewed changes

Steboss and others added 4 commits May 19, 2025 14:24

Update .github/eks-workflow-files/axlearn/axlearn-job.yml

8a46fce

Co-authored-by: Olli Lupton <olupton@nvidia.com>

run the for 8 devices test

acce96a

Merge branch 'sbosisio/fix_axlearn_tests' of github.com:NVIDIA/JAX-To…

ca165c9

…olbox into sbosisio/fix_axlearn_tests

fix script for jobs

404c799

Steboss requested a review from olupton May 19, 2025 16:39

Steboss and others added 11 commits May 19, 2025 20:47

fix error in test variable

8b7c570

remove unnecessary cuda

8443391

Merge branch 'main' into sbosisio/fix_axlearn_tests

d93ca0e

reset CI to standard

dff5456

Merge branch 'sbosisio/fix_axlearn_tests' of github.com:NVIDIA/JAX-To…

b53408f

…olbox into sbosisio/fix_axlearn_tests

test on tests

c6f8342

Merge branch 'main' into sbosisio/fix_axlearn_tests

1388e6b

fix test and run 8_devices

be90585

Merge branch 'sbosisio/fix_axlearn_tests' of github.com:NVIDIA/JAX-To…

3171727

…olbox into sbosisio/fix_axlearn_tests

install missing packages

f815fa3

reset ci axlearn

dd22aaf

olupton previously approved these changes May 20, 2025

View reviewed changes

Steboss added 2 commits May 21, 2025 11:40

fix @olupton comments

25d1c77

fix @olupton comments

8a20be6

Steboss dismissed olupton’s stale review via 8a20be6 May 21, 2025 10:40

reset ci

66e374d

Steboss requested a review from olupton May 21, 2025 12:32

Fix whitespace

7ff0d6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error in AXLearn tests, and remove unnecessary ones #1443

Fix error in AXLearn tests, and remove unnecessary ones #1443

Steboss commented May 9, 2025

olupton left a comment

Steboss commented May 13, 2025

Steboss commented May 14, 2025

Steboss commented May 14, 2025

Steboss commented May 14, 2025

olupton left a comment

olupton May 20, 2025

Steboss May 21, 2025

olupton May 20, 2025

Fix error in AXLearn tests, and remove unnecessary ones #1443

Are you sure you want to change the base?

Fix error in AXLearn tests, and remove unnecessary ones #1443

Conversation

Steboss commented May 9, 2025

olupton left a comment

Choose a reason for hiding this comment

Steboss commented May 13, 2025

Steboss commented May 14, 2025

Steboss commented May 14, 2025

Steboss commented May 14, 2025

olupton left a comment

Choose a reason for hiding this comment

olupton May 20, 2025

Choose a reason for hiding this comment

Steboss May 21, 2025

Choose a reason for hiding this comment

olupton May 20, 2025

Choose a reason for hiding this comment