Improve logging to demo Triage and Log grouping #2130

mmanciop · 2025-03-21T08:51:04Z

Changes

Remove explicit attributes from log records (we want Log Grouping to extract them)
Mark severity as Errors for logs about product not found (LogAI does not override the severity_number set by the log bridge)
Remove error event, we have logs now instead
Add stacktrace to error log

Also: * propagate the change from bool to float introduced in open-telemetry#1237 more consistently via proto definitions by differentiating between the GetFlag operation (which evaluates the probabilty and therefore returns a bool) and all other operations, which need to operate with a float value/probability directly. To that end, the Flag grpc message has been split into two new types, FlagEvaluationResult and FlagDefinition. * Rename the UpdateFlag operation to UpdateFlagProbability, as it actually only updates the enabled/probability value, but not the description or the name.

Instead of requiring a git release as the trigger for publishing container images, add a workflow_dispatch trigger and set a fixed version number.

* allow arbitrary values, remove restriction for values to be >= 0 and <= 1 * rename feature flag GRPC methods accordingly * distinguish between evaluating a probability feature flag and fetching the raw value

This also adds support for range feature flags in the feature flag service and its API.

[skip ci]

it takes ages and is sometimes flaky and has zero value

Building the frontend container image on an Apple M1 would result in the following error during docker build: 15.56 > Build error occurred 15.56 [Error: ENOENT: no such file or directory, copyfile '/app/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node' -> '/app/.next/standalone/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node'] { 15.56 errno: -2, 15.56 code: 'ENOENT', 15.56 syscall: 'copyfile', 15.56 path: '/app/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node', 15.56 dest: '/app/.next/standalone/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node' 15.56 } Updating sharp to v0.33.x avoids this problem, as that version includes pre-built sharp binaries for various platforms, see https://sharp.pixelplumbing.com/changelog#v0330---29th-november-2023

[skip ci]

This effectively reverts aefb610. Having no arm64 images makes running the demo locally via K8s much harder, because K8s tries to pull the arm64 version of the images and that of course fails when it doesn't exist.

This allows to attach a persistent volume to the PostgreSQL service and still always run the init scripts on startup. For the first startup with a fresh volume, the database will be initialized correctly; on subsequent starts the init scripts will do nothing.

Either deploy to one namespace or to two different namespaces (just to make the deployment topology a bit more "interesting").

When introducing error rates/probabilities with 430b4c9, the correct move would have been to remove the 1/10 hard coded error probability in adservice (similar to b55b147 for cart service), since asking the feature flag service whether adServiceFailure is enabled is a random experiment each time anyway. Leaving the 1/10 probability in place in ad service as well skews the error rate by a factor of 1/10.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

This reverts commit bbe3d38.

Do not report local data to prod, report only to dash0-dev.com.

The resource detectors are configured in a file loaded via a --require hook (in utils/telemetry/Instrumentation.js), which needs to be pure JS.

[ci skip]

This reverts the following commits: - 1d52ea4. - 5a45834

[ci skip]

* remove deployment to two namespaces, we can resurrect that if we pick up https://linear.app/dash0/issue/ENG-951/make-the-deployment-topology-more-interesting again * rename namespace from otel-demo-ns to otel-demo * extract processing of values file into separate script * add support for sending data to a configurable ingress URL via OTEL_EXPORTER_OTLP_ENDPOINT * add support for using a custom values.yaml file via VALUES_YAML

* remove Grafana, Jaeger, Prometheus and OpenSearch by default

* add option to send data to recorder

* add recorder collector to write telemetry to files.

Too many random people (not Dash0 employees) requesting access. [skip ci]

)

[ci skip]

* Add more useful logging messages * Update README with release instructions * Update README

[ci skip]

1. Remove explicit attributes from log records (we want Log Grouping to extract them) 2. Mark severity as Errors for logs about product not found (LogAI does not override the severity_number set by the log bridge) 3. Remove error event, we have logs now instead 4. Add stacktrace to error log

mmanciop · 2025-03-21T08:51:50Z

Opened against wrong repo, sorry

basti1302 and others added 30 commits January 30, 2024 12:18

[dash0] use images from AWS ECR repository

e5ce4e8

[dash0] remove qemu support

aefb610

[dash0] temporarily allow publishing images on demand

99705c7

Instead of requiring a git release as the trigger for publishing container images, add a workflow_dispatch trigger and set a fixed version number.

[dash0] remove workflow dispatch trigger, publish on release again

0f55509

[featureflag] support arbitrary numerical settings

c6d237c

* allow arbitrary values, remove restriction for values to be >= 0 and <= 1 * rename feature flag GRPC methods accordingly * distinguish between evaluating a probability feature flag and fetching the raw value

[paymentservice] support range feature flag for simulated slowness

36a0681

This also adds support for range feature flags in the feature flag service and its API.

[shippingservice] add support for simulated slowness

bb47ccd

[dash0] add link to additional docs

db9f691

[skip ci]

[dash0] fix image version

0746612

[skip ci]

[dash0] remove markdownlinkcheck

bf6c5a6

it takes ages and is sometimes flaky and has zero value

[dash0] remove pull request template

c2b7549

[skip ci]

[dash0] deploy new releases automatically

34e127e

[dash0] update .env file automatically after release

3f97825

[dash0] bring back arm64 builds via qemu

581a824

This effectively reverts aefb610. Having no arm64 images makes running the demo locally via K8s much harder, because K8s tries to pull the arm64 version of the images and that of course fails when it doesn't exist.

[dash0] fix release workflow

80bf7d4

[adservice,recommendationservice] add debug logs

fd3b167

[checkoutservice] fix typo in log message

6914c0f

[frontend] fix typo in ShippingGateway

46d4dd1

[dash0] scripts for deploying the otel demo to k8s locally

213a3f8

Either deploy to one namespace or to two different namespaces (just to make the deployment topology a bit more "interesting").

[dash0] fix action to update the .env file

b0be0b8

[dash0] temporary workflow to update the .env file

43799a1

[dash0] update image version in .env to 1.1.1 (#15)

c0b2154

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Revert "[dash0] temporary workflow to update the .env file"

7189f02

This reverts commit bbe3d38.

docs(readme): add Dash0 to the list of forks

f1b0955

Merge remote-tracking branch 'upstream/main'

276c60b

[dash0] fix merge screwup in docker-compose.yml

61a6d0c

basti1302 and others added 21 commits March 11, 2024 17:26

[dash0] improve local k8s deployment

87bf8b4

Do not report local data to prod, report only to dash0-dev.com.

[dash0] frontend: add k8s resource detector

5a45834

[dash0] frontend: convert k8s resource detector from ts to js

1d52ea4

The resource detectors are configured in a file loaded via a --require hook (in utils/telemetry/Instrumentation.js), which needs to be pure JS.

[dash0] update image version in .env to 1.2.1

58242b4

[ci skip]

[dash0] frontend: k8s resource detector again

b3ba547

This reverts the following commits: - 1d52ea4. - 5a45834

[dash0] update image version in .env to 1.2.2

cc29ca5

[ci skip]

[dash0] chore: improve local k8s deployment

18f92fd

* remove Grafana, Jaeger, Prometheus and OpenSearch by default

[dash0] chore: improve local k8s deployment

633d021

* add option to send data to recorder

[dash0] chore: improve local k8s deployment

d740148

* add recorder collector to write telemetry to files.

[dash0] docs: remove link to google docs from public README

d3a49e5

Too many random people (not Dash0 employees) requesting access. [skip ci]

Added the otellogrus package to integrate logrus with OpenTelemetry (#21

14fea03

)

[dash0] update image version in .env to 1.2.3

22df463

[ci skip]

[dash0] update image version in .env to 1.3.0

4c7ed92

[ci skip]

ci(deploy): add commit SHA & workflow run info to commit message (#22)

ae05c5b

Fill otel demo with more data required before kubecon (#23)

bd7894d

* Add more useful logging messages * Update README with release instructions * Update README

[dash0] update image version in .env to 1.3.1

b4d408e

[ci skip]

Update logging (#24)

84beec1

[dash0] update image version in .env to 1.3.2

f141b71

[ci skip]

Change severity of logs about product id not found to ERROR

d49dbc4

mmanciop requested a review from a team as a code owner March 21, 2025 08:51

github-actions bot requested review from jack-berg, mateuszrzeszutek and trask March 21, 2025 08:51

github-actions bot added the helm-update-required Requires an update to the Helm chart when released label Mar 21, 2025

mmanciop closed this Mar 21, 2025

mmanciop deleted the productcatalog-error-logs branch March 21, 2025 08:51

mmanciop restored the productcatalog-error-logs branch March 21, 2025 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging to demo Triage and Log grouping #2130

Improve logging to demo Triage and Log grouping #2130

mmanciop commented Mar 21, 2025

mmanciop commented Mar 21, 2025

Improve logging to demo Triage and Log grouping #2130

Improve logging to demo Triage and Log grouping #2130

Conversation

mmanciop commented Mar 21, 2025

Changes

mmanciop commented Mar 21, 2025