Question: Reproducing clusterd moka-housekeeper segfault #20136
-
Hi. I am the moka creator and trying to reproduce #19746. If I could reproduce, I can start a root cause analysis. From #19746:
I am running "Long Zippy w/ user tables test" locally at commit 9ab5f83, but I have not reproduced the segfault yet. moka-rs/moka#281 So far, I ran it with the following parameters:
I think I need to continue to run it so I can eventually reproduce the segfault. But I am not sure if I am doing right because I am new to Materialize. Question 1: How often did the segfault happen?
Question 2: Am I doing right on the test?I am running "Long Zippy w/ user tables test" as the followings. Am I doing right? $ git clone git@github.com:MaterializeInc/materialize.git
$ cd materialize
$ git checkout 9ab5f833b
$ ./bin/mzcompose --find zippy down -v
...
$ ./bin/mzcompose --find zippy run default --scenario UserTablesLarge --actions 80000
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-WBP7JFLWFQYUIHLXR5JSOAJ74VVKI6O7
materialize/clusterd:mzbuild-XEZDVCORJVQEPNOVJEOF33HFME4GV6QA
materialize/materialized:mzbuild-TV5MN5FZHFZO2FGJWS43E7Y2ASI423UW
materialize/test-certs:mzbuild-6JWZ2MGLSYOCL3MLHIEIJMMLQLEOLCNY
materialize/postgres:mzbuild-ENTVWBFMO762DEEKAMURDMVMQ52GIO63
materialize/testdrive:mzbuild-NRRGSD7T3OT3MVNN4UFU7H3AXQWWPW4N
==> Running test case workflow-default
==> Running workflow default
...
Generating test...
Running test...
--- #1: KafkaStart
...
--- #2: CockroachStart
...
--- #3: MinioStart
...
--- #4: MzStart
...
--- #5: StoragedStart
...
--- #6: CreateTable
> CREATE TABLE table_0 (f1 INTEGER);
rows match; continuing at ts 1687609226.7478156
> INSERT INTO table_0 VALUES (0);
rows match; continuing at ts 1687609226.7642455
--- #7: CreateTable
> CREATE TABLE table_1 (f1 INTEGER);
rows match; continuing at ts 1687609227.1155088
> INSERT INTO table_1 VALUES (0);
rows match; continuing at ts 1687609227.1261883
--- #8: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 1538;
rows match; continuing at ts 1687609227.4865289
--- #9: ShiftForward table_1
> UPDATE table_1 SET f1 = f1 + 7091;
rows match; continuing at ts 1687609227.7569017
...
--- #79998: DeleteFromHead table_1
> DELETE FROM table_1 WHERE f1 > 16601224;
rows match; continuing at ts 1687651205.2661505
--- #79999: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 443;
rows match; continuing at ts 1687651205.803389
--- #80000: ShiftBackward table_0
> UPDATE table_0 SET f1 = f1 - 1272;
rows match; continuing at ts 1687651206.5515945
==> mzcompose: test case workflow-default succeeded My environment
I allocated only 8 logical CPU cores to WSL2 because the following reasons:
Other things I tried
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Hi @tatsuya6502, thanks very much for trying to repro this! As I think someone pointed out somewhere we were using moka v0.9 when we observed the crashes, so it's possible that the segfaults were fixed in a more recent version of moka. That said, the commit of Materialize that you're testing definitely includes moka v0.9: Line 3233 in 9ab5f83 so I'm surprised you're not able to readily repro. I wonder if it's something to do with the fact that you're using WSL. That introduces a layer of emulation, right? I wonder if that slows things down enough that you no longer see the segfaults. CC'ing @MaterializeInc/qa. We can try to repro this on a scratch EC2 instance that looks very similar to our CI hardware. If it repros, we can give you SSH access to the machine for further debugging. How does that sound as a plan? |
Beta Was this translation helpful? Give feedback.
-
Sounds good! Hopefully someone from the QA team has time to attempt the
repro early next week and give you the full set of instructions.
…On Sun, Jun 25, 2023 at 10:26 AM Tatsuya Kawano ***@***.***> wrote:
Thank you for the quick reply.
We can try to repro this on a scratch EC2 instance that looks very similar
to our CI hardware. If it repros, we can give you SSH access to the machine
for further debugging. How does that sound as a plan?
Thanks. I think no SSH access is needed; I do not think I can figure out
the root cause by just accessing the instance after reproducing. To fix
this kind of issue, I will have to do many experiments; modifying moka and
Metirialize codes and run the same test to see whether the issue remains or
not. This can take weeks (depending on how hard to reproduce the issue).
So it will be very helpful if you try to reproduce the issue using a
scratch EC2 instance, and if it repros, then tell me how to build such an
EC2 instance and how to run the test (without setting up Buildkite). This
way, I can do the experiments on my own EC2 instance.
—
Reply to this email directly, view it on GitHub
<#20136 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGXSIG346KPG33IRUNLMK3XNBDBLANCNFSM6AAAAAAZTDSJYA>
.
You are receiving this because you commented.Message ID:
***@***.***
.com>
|
Beta Was this translation helpful? Give feedback.
-
I'm guessing that using a larger machine makes it easier to reproduce. I have now started a run on a c6a.12xlarge EC2 instance. Based on previous experience with this issue I hope to have a segfault in 3-6 hours. Will check back then. |
Beta Was this translation helpful? Give feedback.
I'm guessing that using a larger machine makes it easier to reproduce. I have now started a run on a c6a.12xlarge EC2 instance. Based on previous experience with this issue I hope to have a segfault in 3-6 hours. Will check back then.