Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for SQL Statement Sanitization in OpenTelemetryTraceEventListener #132

Open
agent-adam opened this issue Dec 5, 2024 · 11 comments

Comments

@agent-adam
Copy link

agent-adam commented Dec 5, 2024

Description:

Currently, the OpenTelemetryTraceEventListener enriches spans with valuable metadata such as SQL operation, connection ID, and database user. However, the addition of sensitive information, such as Original SQL Text and Actual SQL Text (enabled by the OPEN_TELEMETRY_TRACE_EVENT_LISTENER_SENSITIVE_ENABLED flag), raises concerns about exposing sensitive data in trace spans.

Proposed Feature:

I propose adding support for SQL statement sanitization in the OpenTelemetryTraceEventListener. This feature would allow users to include SQL statements in spans while masking sensitive information, such as literal values, without compromising visibility into database interactions.

Sanitization Capability:

  • Introduce a mechanism to sanitize SQL statements before adding them to spans.
  • Replace sensitive literals (e.g., numbers, strings, dates) with placeholders (?) while retaining the query structure.

Example:

  • Original SQL: SELECT * FROM users WHERE id = 123 AND name = 'John';
  • Sanitized SQL: SELECT * FROM users WHERE id = ? AND name = ?;

I am happy to contribute to this feature and provide an initial implementation if the maintainers agree on its inclusion. Please let me know your thoughts!

@Michael-A-McMahon
Copy link
Member

We should explore this idea!

What approach did you have in mind for parsing SQL strings and detecting sensitive values?

@agent-adam
Copy link
Author

agent-adam commented Dec 6, 2024

We should explore this idea!

Great, hopefully will free up some time to get a draft RP out.

What approach did you have in mind for parsing SQL strings and detecting sensitive values?

Going with a lexical analysis approach similar to the existing SqlSanitizer in OpenTelemetry should work well.
The SQL string would be tokenized to identify sensitive values like string literals, numbers, and parameter markers, which can then be replaced with placeholders (otel uses ?). With a cache of Sanitized statements to save some time with repeated DB statements.

@Michael-A-McMahon
Copy link
Member

One option might be to have our project directly use SqlStatementSanitizer from the Open Telemetry project. Maintaining our own sanitizer could be costly, so my preference would be to reuse what's already available.

I see "incubator" in the package. Usually, that means the API is not stable, and might even be removed. That's something we should consider. If we have to update our code to use a new API, it's not the end of the world. But if the API is removed, I guess we would have to think about forking and maintaining our on copy. At the very least, we would be able to preserve whatever functionality was already there.

Maybe the next step is to kick tires on SqlStatementSanitizer? I wonder how capable it is in the current state. I'd want to know if it's ready for use in production systems.

@Michael-A-McMahon
Copy link
Member

Michael-A-McMahon commented Dec 6, 2024

Some decent tire kicking happening here: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/57c7cf2ad5f7272c1eb83b9816e652bf832c91d4/instrumentation-api-incubator/src/test/java/io/opentelemetry/instrumentation/api/incubator/semconv/db/SqlStatementSanitizerTest.java#L143

I'll try to find some time to go hard on this. I want to toss in a CREATE JSON DUALITY VIEW command and see what happens :)

@agent-adam
Copy link
Author

agent-adam commented Dec 6, 2024

I completely agree that reusing SqlStatementSanitizer would be better than maintaining a new sanitizer.

It’s also worth noting that OpenTelemetry’s specification mandates database statements must be sanitized, which is why SqlStatementSanitizer is already widely used across the auto-instrumentation package. This makes it likely that OpenTelemetry will continue to maintain SqlStatementSanitizer in a reusable form.

I'll try to find some time to go hard on this. I want to toss in a CREATE JSON DUALITY VIEW command and see what happens :)

Output from my local test app:
Screenshot 2024-12-06 at 1 57 46 PM

@Michael-A-McMahon
Copy link
Member

I threw a few cases at this:

import io.opentelemetry.instrumentation.api.incubator.semconv.db.SqlStatementInfo;
import io.opentelemetry.instrumentation.api.incubator.semconv.db.SqlStatementSanitizer;

public class SanitizerTest {

  public static void main(String[] args) {
    SqlStatementSanitizer sanitizer = SqlStatementSanitizer.create(true);
    print(sanitizer, "SELECT 1");

    String[] cases = new String[]{
      "select ? from dual",
      "insert into dual values (?)",
      "delete from dual",
      "update dual set dummy = ?",
      "merge tab into dual",
      //weird spacing
      " select ? from dual where ? = ?",
      "select ?from dual where?=?for update",
      // string literals
      // NOTE: Sanitizer is not recognizing Oracle's wonderful q-tick syntax
      "select '?', n'?', q'???', q'{?}', q'{cat's}' from dual",
      "select'?',n'?',q'???',q'{?}',q'{cat's}'from dual",
      // comments
      // NOTE: Sanitizer is not preserving new lines,
      "select--line\n? from dual",
      "select --line\n? from dual",
      "--line\nselect ? from dual",
      " --line\nselect ? from dual",
      "--line\n select ? from dual",
      // problems
      // NOTE: Sanitizer is not recognizing PL/SQL procedure call, getOperation() == null
      "begin proc4in4out (:x1, :x2, :x3, :x4); end;",
      "{CALL tkpjpn01(:pin, :pinout, :pout)}",
      "select :NumberBindVar as the_number from dual",
      "select {fn locate(bob(carol(),ted(alice,sue)), 'xfy')} from dual",
      "CREATE USER vijay6 IDENTIFIED BY \"vjay?\"",
      "ALTER SESSION SET TIME",
      "SELECT ename FROM emp WHERE hiredate BETWEEN {ts'1980-12-17'} AND {ts '1981-09-28'} ",
    };
    for (String sql : cases)
      print(sanitizer, sql);

    print(sanitizer, "CREATE USER \"scott\" IDENTIFIED BY \"tiger\"");
  }

  static void print(SqlStatementSanitizer sanitizer, String sql) {
    SqlStatementInfo info = sanitizer.sanitize(sql);
    System.out.printf("%s\n%s\n\t%s\n\t%s\n\n",
      sql, info.getFullStatement(), info.getOperation(), info.getMainIdentifier());
  }
}

Here's what I noted:

  • The last one is the most concerning, as it seems double-quoted strings aren't detected. (Are double-quotes an Oracle specific thing?)
  • The q' syntax also isn't detected, but I that's Oracle specific and may not be too widely used.
  • PL/SQL is not detected as a Call operation. The getOperation() method just returns null for that. Not sure how important that is.

I'm also noting that a SqlDialect interface is defined. Probably the best thing we can do for Oracle users is contribute an implementation of that. This would help Oracle users who are just using the instrumentation library. And, if we decide to include sanitization in our extensions project here, it would help our users too.

IMO, we should be adhering to the specifications, which means we should be sanitizing SQL and conforming to the semantic conventions.

I'll give the rest of my team some time to weigh in on this.

@agent-adam
Copy link
Author

Yes seems that double-quoted strings aren't detected by default, but SqlDialect.COUCHBASE is used to address this: Relevant Code

{DOUBLE_QUOTED_STR} {
    if (dialect == SqlDialect.COUCHBASE) {
        builder.append('?');
    } else {
        if (!insideComment && !extractionDone) {
            extractionDone = operation.handleIdentifier();
        }
        appendCurrentFragment();
    }
    if (isOverLimit()) return YYEOF;
}

We can probably use this dialect until we add proper support for SqlDialect.ORACLE which to your point would allow the sanitizer to handle Oracle-specific syntax, including double-quoted strings, q-tick literals & PL/SQL, properly.


IMO, we should be adhering to the specifications, which means we should be sanitizing SQL and conforming to the semantic conventions.

On the topic of adhering to the specifications would folks be open to the idea of apply the full set of semantic standards defined by otel DB query execute call spans ? #134


I'll give the rest of my team some time to weigh in on this.

Sounds good, thanks again for the quick follow ups here! 🚀

@jeandelavarene
Copy link
Member

Hi @agent-adam, you've got a valid point that we should sanitize the SQL before it gets published in the span. We would love to get your contribution. Have you reviewed the Oracle Contributor Agreement (OCA)? You'll see that it will take you to an Oracle website where you can fill the agreement. If we can use the OT sanitizer in production, then we should go with that.

@agent-adam
Copy link
Author

Sounds good, thanks folks.
I’ll review the Oracle Contributor Agreement (OCA) and work towards getting a pull request (PR) out for folks to review

@jeandelavarene
Copy link
Member

Hi @agent-adam, we discussed this more internally and concluded that because we don't include the SQL text by default (SENSITIVE mode needs to be enabled) we follow the OpenTelemetry requirement. Furthermore, doing sanitization on the client will be challenging as it requires parsing SQL. What do you think?

@agent-adam
Copy link
Author

Yes, to your point, sanitization is only required if the SQL statement is going to be added to the spans by default. OpenTelemetry’s DB semantic conventions emphasize this.
I believe that having the SQL statement added and sanitized would be a valuable feature for extension users. It would provide better insights into what might be causing execution spans to delay or error. Notably, most DB auto-instrumentation packages in OpenTelemetry already include the SQL statement by default when feasible. OpenTelemetry also offers excellent sanitization packages that we could leverage or extend, which would save us from having to build a SQL parser from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants