Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Linux perf tool profiling with stargz-snapshotter #349

Open
sumeetb25 opened this issue Jun 30, 2021 · 10 comments
Open

[Discussion] Linux perf tool profiling with stargz-snapshotter #349

sumeetb25 opened this issue Jun 30, 2021 · 10 comments

Comments

@sumeetb25
Copy link

sumeetb25 commented Jun 30, 2021

Hi All! I've been doing benchmarks to compare overall container execution times w/ and w/o estargz images. The benchmarking strategy is similar to https://github.com/Tintri/hello-bench/blob/master/hello.py. For https://hub.docker.com/_/rabbitmq image the overall execution time is 9.921 seconds w/o estargz and 12.7 seconds w/ estargz. (The perf numbers w/ estargz look similar for prefetch enabled + background_fetch enabled and prefetch enabled + background_fetch disabled.)

To dig further into this regression I was hoping to use linux perf tool. perf tool requires the programs to be built by enabling frame pointers. Reading golang/go#15840 it seems like golang enables frame pointers by default.

I tried to record perf trace by executing perf record -ag -F 99 -p $(pidof containerd-stargz-grpc). Unfortunately the perf report spits out hex addresses and not human readable stack trace.

63.54%     0.00%  containerd-star  containerd-stargz-grpc  [.] 0x00000000002eb338
            |
            ---0x6eb338
               |          
               |--47.49%--0x6eb616
               |          |          
               |          |--27.68%--0x6df2ff
               |          |          0x703585
               |          |          |          
               |          |          |--11.31%--0x655c49
               |          |          |          0x654008
               |          |          |          |          
               |          |          |          |--7.73%--0xe6c94
               |          |          |          |          0xddddf
               |          |          |          |          0x87f4a
               |          |          |          |          entry_SYSCALL_64
               |          |          |          |          do_syscall_64
               |          |          |          |          ksys_pread64
               |          |          |          |          vfs_read
               |          |          |          |          new_sync_read
               |          |          |          |          xfs_file_read_iter
               |          |          |          |          xfs_file_buffered_aio_read
               |          |          |          |          generic_file_read_iter

This problem only seems to exist for profiling stargz-snapshotter. I tried profiling containerd and it prints out human readable stack trace. I'm reaching out for help on how can I go about getting human readable stack traces for stargz snapshotter.

(I'm new to golang and my apologies for such a naive question).

@ktock
Copy link
Member

ktock commented Jun 30, 2021

We strip debug information from the binary by default. Does removing - and -w from the following Makefile solve it?

GO_LD_FLAGS=-ldflags '-s -w -X $(PKG)/version.Version=$(VERSION) -X $(PKG)/version.Revision=$(REVISION) $(GO_EXTRA_LDFLAGS)'

@sumeetb25
Copy link
Author

We strip debug information from the binary by default. Does removing - and -w from the following Makefile solve it?

GO_LD_FLAGS=-ldflags '-s -w -X $(PKG)/version.Version=$(VERSION) -X $(PKG)/version.Revision=$(REVISION) $(GO_EXTRA_LDFLAGS)'

Removing -s and -w did the trick. Thank you!

@sumeetb25 sumeetb25 reopened this Jun 30, 2021
@sumeetb25
Copy link
Author

I'll keep the issue open until I am investigating the regression for rabbitmq image.

@sumeetb25
Copy link
Author

The overall execution time for OCI rabbit mq image (pulled from dockerhub) is 9.921 seconds and it jumps to 12.7 seconds on an optimized estargz image. While investigating this regression I observed that launches with estargz images consistently show a 2.5 second stall between two events.

+ echo '  Starting broker...2021-07-01 22:45:58.991 [info] <0.272.0> '
+ echo '2021-07-01 22:45:59.265 [debug] <0.283.0> Lager installed handler lager_backend_throttle into lager_event^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags: list of feature flags found:^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] drop_unroutable_metric^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] empty_basic_get_metric^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] implicit_default_bindings^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] maintenance_mode_status^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] quorum_queue^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] user_limits^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags:   [ ] virtual_host_metadata^M'
+ echo '2021-07-01 22:45:59.265 [info] <0.272.0> Feature flags: feature flag states written to disk: yes^M'
+ echo '2021-07-01 22:46:01.857 [info] <0.272.0> Running boot step pre_boot defined by app rabbit^M'
+ echo '2021-07-01 22:46:01.857 [info] <0.272.0> Running boot step rabbit_core_metrics defined by app rabbit^M'

Launching containers with vanilla OCI image this stall is consistently around ~1.1 seconds.

+ echo '  Starting broker...2021-07-01 13:54:32.617 [info] <0.272.0> '
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags: list of feature flags found:^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] drop_unroutable_metric^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] empty_basic_get_metric^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] implicit_default_bindings^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] maintenance_mode_status^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] quorum_queue^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] user_limits^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags:   [ ] virtual_host_metadata^M'
+ echo '2021-07-01 13:54:32.805 [info] <0.272.0> Feature flags: feature flag states written to disk: yes^M'
+ echo '2021-07-01 13:54:32.974 [debug] <0.283.0> Lager installed handler lager_backend_throttle into lager_event^M'
+ echo '2021-07-01 13:54:33.992 [info] <0.272.0> Running boot step pre_boot defined by app rabbit^M'

Following are the investigations I've done so far:

  1. I verified no disk IO was taking longer than 5ms by running fileslower.py while the container launch was in progress.
  2. Added following instrumentation to measure on-demand file access latency.
func (f *file) Read(ctx context.Context, dest []byte, off int64) (fuse.ReadResult, syscall.Errno) {
       start := time.Now()

        n, err := f.ra.ReadAt(dest, off)
        if err != nil && err != io.EOF {
                f.n.s.report(fmt.Errorf("failed to read node: %v", err))
                return nil, syscall.EIO
        }

        duration := time.Since(start)

           log.G(ctx).Infof("read_file=%v read_file_len=%v read_file_data_served=%v read_file_latency=%v ",
                                    f.n.name, len(dest), n, duration.Nanoseconds())
  
        return fuse.ReadResultData(dest[:n]), 0
}

Logs indicated no on-demand file access had a latency of higher than 2 ms.

  1. I launched snapshotter by enabling GC tracing and validated that GC was not triggered when we observed stall in the container launch. Command to enable GC tracking: sudo GODEBUG=gctrace=1 ./stargz-snapshotter/out/containerd-stargz-grpc --log-level=debug >& logs &. Following is a snapshot of relevant logs from the snapshotter.

Following are the relevant sections of the log

time="2021-07-01T22:46:42.014738448Z" level=info msg="read_file=libc-2.31.so read_file_len=45056 read_file_data_served=45056 read_file_latency=79682 "
gc 28 @60.980s 0%: 0.16+3.5+0.027 ms clock, 15+0.17/53/20+2.6 ms cpu, 253->253->144 MB, 262 MB goal, 96 P
time="2021-07-01T22:46:42.541053558Z" level=debug msg="estargz open_file=etc/rabbitmq/enabled_plugins"

....

time="2021-07-01T22:48:42.277128643Z" level=debug msg="ReadAt file=&{name:usr/lib/x86_64-linux-gnu/ld-2.31.so digest:sha256:96493303ba8ba364a8da6b77fbb9f04d0f170cbecbc6bbacca616161bd0f0008 ra:0xc0006ad4d0 r:0xc000e2a000 cache:0xc0007421e0 gr:0xc0002d0150}\n"
gc 29 time="2021-07-01T22:48:42.277718539Z" level=info msg="read_file=ld-2.31.so read_file_len=102400 read_file_data_served=102400 read_file_latency=605507 "
 @181.237s 0%: 0.14+3.9+0.079 ms clock, 13+0.29/67/34+7.6 ms cpu, 183->184->118 MB, 288 MB goal, 96 P
time="2021-07-01T22:48:42.277968206Z" level=debug msg="estargz open_file=etc/ld.so.cache"
  1. Nothing unusual in containerd logs
  2. Ran perf tool and nothing abnormal stands out. (If it helps happy to share the flame graph)
  3. Disabled background fetch to ensure there is no contention between background fetch tasks and container launch. The results with and w/o background fetch enabled are similar.
  4. Added instrumentation to measure total time required to complete prefetch. Logs indicated it takes ~1.3 seconds to complete prefetch.

It'd be super helpful if someone could give some ideas about possible next steps to investigate this regression.

@ktock
Copy link
Member

ktock commented Jul 2, 2021

@sumeetb25 Could you provide the step of how have you created the optimized eStargz image?
Did you enable the optimization feature provided by ctr-remote ? https://github.com/containerd/stargz-snapshotter/blob/v0.6.4/docs/ctr-remote.md

@sumeetb25
Copy link
Author

Yeah I had enabled the optimization feature. Following is the command I used to create optimized eStargz image:

time sudo ./stargz-snapshotter/out/ctr-remote images optimize --oci $IMAGE $ESTARGZ_IMAGE

$IMAGE referred to the image pulled from docker hub by executing docker pull rabbitmq and then exporting it into containerd.

@sumeetb25
Copy link
Author

@ktock Please let me know if you have any pointers on where I can start digging next.

@ktock
Copy link
Member

ktock commented Jul 8, 2021

@sumeetb25 Some pointers you can deeper look at:

  • Which phase (pull, create, run) shows worse performance for estargz?
  • If pull takes long, how long does it take to complete Mount for all layers? And what is the bottleneck of each one?
  • If create takes long, how long does prefetch take? create waits for completion of prefetch so it can take long if the image contains lots of files marked as the targets of prefetch.
  • If run takes long, how is the performance of the filesystem of stargz snapshotter worse compared to overlayfs? (in the benchmarking script, containerd uses overlayfs snapshotter for non-estargz images)

@sumeetb25
Copy link
Author

@ktock Following are the image pull and run times for rabbitmq image

Pull time for OCI image = Time required to execute ctr i pull --user "$CR_USER:$CR_PASS" $IMAGE
Pull time for estargz optimized image = Time required to execute ctr run $IMAGE container-id and READY_LINE appearing on console

Run time for OCI image = Time required to execute ./stargz-snapshotter/out/ctr-remote images rpull --user "$CR_USER:$CR_PASS" $IMAGE
Run time for estargz optimized image = Time required to execute ./stargz-snapshotter/out/ctr-remote run --snapshotter=stargz $IMAGE container-id and READY_LINE appearing on console

READY_LINE="Server startup complete"

Image Pull time Run time
OCI 3.9 sec 6.5 sec
estargz optimized 1.2 sec 11.5 sec

One layer in estargz optimized image took ~455 ms to complete Mount operation. Other layers took negligible time to complete the Mount operation. I assume this is all as expected.

The conundrum is why is the runtime with estargz optimized image twice as high as OCI image.

Next I added some instrumentation to containerd and here are my observations

Task Time taken by OCI image Time taken by estargz optimized image Code link
Create container 438 ms 1.5 sec https://github.com/containerd/containerd/blob/main/cmd/ctr/commands/run/run.go#L161-L167
Create task 81 ms 80 ms https://github.com/containerd/containerd/blob/main/cmd/ctr/commands/run/run.go#L185
Start task 13 ms 17.9 ms https://github.com/containerd/containerd/blob/main/cmd/ctr/commands/run/run.go#L215

Instrumentation in stargz snapshotter suggested one layer in estargz optimized image took ~1.3 seconds to complete prefetch operation. This seems to explain the high latency for create container. Until this point all seems to work as expected.

The big difference between OCI image and estargz optimized is the fact that the first log line from the container appear after ~6 seconds of starting the task.

+ echo 07/20/2021_00:17:14:019219657 'Creating new task pid=8201'
+ echo 07/20/2021_00:17:14:105041374 'Created new task pid=8201 duration=80506739'
+ echo 07/20/2021_00:17:14:106352186 'Waiting for new task pid=8201'
+ echo 07/20/2021_00:17:14:107899686 'Wait for new task completed pid=8201'
+ echo 07/20/2021_00:17:14:109199518 'Starting new task pid=8201'
+ echo 07/20/2021_00:17:14:123260705 'Started new task pid=8201 duration=17938450'
+ echo 07/20/2021_00:17:14:124609299 'Waiting for status result pid=8201'
+ echo 07/20/2021_00:17:20:900223216 'Configuring logger redirection'

"Configuring logger redirection" is the first log line from the container and as seen from the above trace it appears after ~6 seconds delay (This is roughly the difference between the runtimes of OCI and estargz image).

The same trace for an OCI image looks as follows

+ echo 07/19/2021_16:06:33:963409654 'Creating new task pid=73886'
+ echo 07/19/2021_16:06:34:056134144 'Created new task pid=73886 duration=94511835'
+ echo 07/19/2021_16:06:34:057402312 'Waiting new task pid=73886'
+ echo 07/19/2021_16:06:34:058880356 'Wait for new task completed pid=73886'
+ echo 07/19/2021_16:06:34:060144344 'Starting new task pid=73886'
+ echo 07/19/2021_16:06:34:068169773 'Started new task pid=73886 duration=12125714'
+ echo 07/19/2021_16:06:34:069452348 'Waiting for status result pid=73886'
+ echo 07/19/2021_16:06:36:537389542 'Configuring logger redirection'

As seen from ^^ trace there is hardly any wait time between starting the task and container log lines appearing on console.

I'm exploring running the container with strace to get a better sense of why do we see delay of ~6 seconds with estargz optimized image.

Please do let me know if you have any recommendations.

@rdpsin
Copy link
Contributor

rdpsin commented Oct 6, 2021

Instrumentation added to go-fuse suggests the container is doing ~88600 filesystem operations before it starts up. Following is the breakdown by operation type:

<opcode_number> <opcode_name> (number_of_operations)
1 - lookup (4519)
3 - Getattr (2836)
5 - readlink (1589)
14 - open (2738)
15 - read (6518)
17 - statfs (20)
18 - release (2646)
22 - getxattr (729)
23 - listxattr (8)
25 - flush (10)
26 - init (10)
27 - opendir (1688)
29 - releasedir (1688)
34 - access (59772)
36 - interrupt (77)
40 - poll (10)
44 - readdirplus (3742)

Most (about 90%) of these are sequential operations. The latency of these operations adds up to ~6 seconds which is roughly equal to the time delta between OCI and estargz optimized image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants