-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Linux perf tool profiling with stargz-snapshotter #349
Comments
We strip debug information from the binary by default. Does removing Line 24 in 7f024d4
|
Removing |
I'll keep the issue open until I am investigating the regression for rabbitmq image. |
The overall execution time for OCI rabbit mq image (pulled from dockerhub) is 9.921 seconds and it jumps to 12.7 seconds on an optimized estargz image. While investigating this regression I observed that launches with estargz images consistently show a 2.5 second stall between two events.
Launching containers with vanilla OCI image this stall is consistently around ~1.1 seconds.
Following are the investigations I've done so far:
func (f *file) Read(ctx context.Context, dest []byte, off int64) (fuse.ReadResult, syscall.Errno) {
start := time.Now()
n, err := f.ra.ReadAt(dest, off)
if err != nil && err != io.EOF {
f.n.s.report(fmt.Errorf("failed to read node: %v", err))
return nil, syscall.EIO
}
duration := time.Since(start)
log.G(ctx).Infof("read_file=%v read_file_len=%v read_file_data_served=%v read_file_latency=%v ",
f.n.name, len(dest), n, duration.Nanoseconds())
return fuse.ReadResultData(dest[:n]), 0
} Logs indicated no on-demand file access had a latency of higher than 2 ms.
Following are the relevant sections of the log
It'd be super helpful if someone could give some ideas about possible next steps to investigate this regression. |
@sumeetb25 Could you provide the step of how have you created the optimized eStargz image? |
Yeah I had enabled the optimization feature. Following is the command I used to create optimized eStargz image:
$IMAGE referred to the image pulled from docker hub by executing |
@ktock Please let me know if you have any pointers on where I can start digging next. |
@sumeetb25 Some pointers you can deeper look at:
|
@ktock Following are the image pull and run times for rabbitmq image Pull time for OCI image = Time required to execute Run time for OCI image = Time required to execute READY_LINE="Server startup complete"
One layer in estargz optimized image took ~455 ms to complete Mount operation. Other layers took negligible time to complete the Mount operation. I assume this is all as expected. The conundrum is why is the runtime with estargz optimized image twice as high as OCI image. Next I added some instrumentation to containerd and here are my observations
Instrumentation in stargz snapshotter suggested one layer in estargz optimized image took ~1.3 seconds to complete prefetch operation. This seems to explain the high latency for create container. Until this point all seems to work as expected. The big difference between OCI image and estargz optimized is the fact that the first log line from the container appear after ~6 seconds of starting the task.
"Configuring logger redirection" is the first log line from the container and as seen from the above trace it appears after ~6 seconds delay (This is roughly the difference between the runtimes of OCI and estargz image). The same trace for an OCI image looks as follows
As seen from ^^ trace there is hardly any wait time between starting the task and container log lines appearing on console. I'm exploring running the container with Please do let me know if you have any recommendations. |
Instrumentation added to go-fuse suggests the container is doing ~88600 filesystem operations before it starts up. Following is the breakdown by operation type:
Most (about 90%) of these are sequential operations. The latency of these operations adds up to ~6 seconds which is roughly equal to the time delta between OCI and estargz optimized image. |
Hi All! I've been doing benchmarks to compare overall container execution times w/ and w/o estargz images. The benchmarking strategy is similar to https://github.com/Tintri/hello-bench/blob/master/hello.py. For https://hub.docker.com/_/rabbitmq image the overall execution time is 9.921 seconds w/o estargz and 12.7 seconds w/ estargz. (The perf numbers w/ estargz look similar for prefetch enabled + background_fetch enabled and prefetch enabled + background_fetch disabled.)
To dig further into this regression I was hoping to use linux perf tool. perf tool requires the programs to be built by enabling frame pointers. Reading golang/go#15840 it seems like golang enables frame pointers by default.
I tried to record perf trace by executing
perf record -ag -F 99 -p $(pidof containerd-stargz-grpc)
. Unfortunately the perf report spits out hex addresses and not human readable stack trace.This problem only seems to exist for profiling stargz-snapshotter. I tried profiling containerd and it prints out human readable stack trace. I'm reaching out for help on how can I go about getting human readable stack traces for stargz snapshotter.
(I'm new to golang and my apologies for such a naive question).
The text was updated successfully, but these errors were encountered: