Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging to demo Triage and Log grouping #2130

Closed
wants to merge 60 commits into from

Conversation

mmanciop
Copy link

Changes

  1. Remove explicit attributes from log records (we want Log Grouping to extract them)
  2. Mark severity as Errors for logs about product not found (LogAI does not override the severity_number set by the log bridge)
  3. Remove error event, we have logs now instead
  4. Add stacktrace to error log

basti1302 and others added 30 commits January 30, 2024 12:18
Also:
* propagate the change from bool to float introduced in
  open-telemetry#1237
  more consistently via proto definitions by differentiating between
  the GetFlag operation (which evaluates the probabilty and therefore
  returns a bool) and all other operations, which need to operate with
  a float value/probability directly. To that end, the Flag grpc
  message has been split into two new types, FlagEvaluationResult
  and FlagDefinition.
* Rename the UpdateFlag operation to UpdateFlagProbability, as it
  actually only updates the enabled/probability value, but not the
  description or the name.
Instead of requiring a git release as the trigger for publishing
container images, add a workflow_dispatch trigger and set a fixed
version number.
* allow arbitrary values, remove restriction for values to be >= 0 and
  <= 1
* rename feature flag GRPC methods accordingly
* distinguish between evaluating a probability feature flag and fetching
  the raw value
This also adds support for range feature flags in the feature flag
service and its API.
it takes ages and is sometimes flaky and has zero value
Building the frontend container image on an Apple M1 would result in
the following error during docker build:

    15.56 > Build error occurred
    15.56 [Error: ENOENT: no such file or directory, copyfile '/app/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node' -> '/app/.next/standalone/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node'] {
    15.56   errno: -2,
    15.56   code: 'ENOENT',
    15.56   syscall: 'copyfile',
    15.56   path: '/app/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node',
    15.56   dest: '/app/.next/standalone/node_modules/sharp/build/Release/sharp-darwin-arm64v8.node'
    15.56 }

Updating sharp to v0.33.x avoids this problem, as that version includes
pre-built sharp binaries for various platforms, see
https://sharp.pixelplumbing.com/changelog#v0330---29th-november-2023
This effectively reverts aefb610.
Having no arm64 images makes running the demo locally via K8s much
harder, because K8s tries to pull the arm64 version of the images and
that of course fails when it doesn't exist.
This allows to attach a persistent volume to the PostgreSQL service and
still always run the init scripts on startup. For the first startup with
a fresh volume, the database will be initialized correctly; on
subsequent starts the init scripts will do nothing.
Either deploy to one namespace or to two different namespaces (just to
make the deployment topology a bit more "interesting").
When introducing error rates/probabilities with 430b4c9, the correct
move would have been to remove the 1/10 hard coded error probability in
adservice (similar to b55b147 for cart service), since asking the
feature flag service whether adServiceFailure is enabled is a random
experiment each time anyway. Leaving the 1/10 probability in place in
ad service as well skews the error rate by a factor of 1/10.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
basti1302 and others added 21 commits March 11, 2024 17:26
Do not report local data to prod, report only to dash0-dev.com.
The resource detectors are configured in a file loaded via a
--require hook (in utils/telemetry/Instrumentation.js), which needs
to be pure JS.
This reverts the following commits:
- 1d52ea4.
- 5a45834
* remove deployment to two namespaces, we can resurrect that if we pick
  up
  https://linear.app/dash0/issue/ENG-951/make-the-deployment-topology-more-interesting
  again
* rename namespace from otel-demo-ns to otel-demo
* extract processing of values file into separate script
* add support for sending data to a configurable ingress URL via
  OTEL_EXPORTER_OTLP_ENDPOINT
* add support for using a custom values.yaml file via VALUES_YAML
* remove Grafana, Jaeger, Prometheus and OpenSearch by default
* add option to send data to recorder
* add recorder collector to write telemetry to files.
Too many random people (not Dash0 employees) requesting access.

[skip ci]
* Add more useful logging messages

* Update README with release instructions

* Update README
1. Remove explicit attributes from log records (we want Log Grouping to extract them)
2. Mark severity as Errors for logs about product not found (LogAI does not override the severity_number set by the log bridge)
3. Remove error event, we have logs now instead
4. Add stacktrace to error log
@mmanciop mmanciop requested a review from a team as a code owner March 21, 2025 08:51
@github-actions github-actions bot added the helm-update-required Requires an update to the Helm chart when released label Mar 21, 2025
@mmanciop mmanciop closed this Mar 21, 2025
@mmanciop mmanciop deleted the productcatalog-error-logs branch March 21, 2025 08:51
@mmanciop
Copy link
Author

Opened against wrong repo, sorry

@mmanciop mmanciop restored the productcatalog-error-logs branch March 21, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helm-update-required Requires an update to the Helm chart when released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants