Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [FC-0074] add support for annotated python dicts as avro map type #433

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

mariajgrimaldi
Copy link
Member

@mariajgrimaldi mariajgrimaldi commented Dec 12, 2024

Description

This PR supports Python dict types mapped to Avro Map type for avro schema generation, aiming to add support to a broader list of events payloads. This was previously attempted by mapping Python dicts -> records but this approach considers Python dicts -> maps to avoid conflicts with data attributes -> record mapping, also by using maps we avoid this kind of errors when we don't know the content of dictionaries:

image

This PR also refactors forum-related events so they can be sent through the event bus. Because of backward compatibility, those changes should be studied in a different PR but are here for testing simplicity.

Supporting information

This PR addresses #428 (comment)

Testing instructions

To test with the event bus:

  1. Install this branch into your environment
  2. Install an event bus backend like the redis implementation in the LMS
  3. Create a tutor plugin with this configuration so the event is produced to redis:
from tutor import hooks

redis_config = [
    "EVENT_BUS_PRODUCER = 'edx_event_bus_redis.create_producer'",
    "EVENT_BUS_REDIS_CONNECTION_URL = 'redis://@redis:6379/'",
    "EVENT_BUS_TOPIC_PREFIX = 'dev'",
    "EVENT_BUS_CONSUMER = 'edx_event_bus_redis.RedisEventConsumer'",
]

event_bus_config = """
EVENT_BUS_PRODUCER_CONFIG = {
    'org.openedx.learning.forum.thread.created.v1': {
         'forum-thread-created': {'event_key_field': 'thread.id', 'enabled': True},
     },
     'org.openedx.learning.forum.thread.response.created.v1': {
         'forum-thread-response-created': {'event_key_field': 'thread.id', 'enabled': True},
     },
     'org.openedx.learning.forum.thread.response.comment.created.v1': {
         'forum-thread-response-comment-created': {'event_key_field': 'thread.id', 'enabled': True},
     },
}
"""

hooks.Filters.ENV_PATCHES.add_item(
    (
        "openedx-common-settings", event_bus_config
    )
)

You should see a log similar to this in the LMS:
image

Deadline

None

Other information

This was previously attempted here: but using record type instead of the map type: #232

Checklists

Check off if complete or not applicable:

Merge Checklist:

  • All reviewers approved
  • Reviewer tested the code following the testing instructions
  • CI build is green
  • Version bumped
  • Changelog record added with short description of the change and current date
  • Documentation updated (not only docstrings)
  • Code dependencies reviewed
  • Fixup commits are squashed away
  • Unit tests added/updated
  • Noted any: Concerns, dependencies, migration issues, deadlines, tickets

Post Merge:

  • Create a tag
  • Create a release on GitHub
  • Check new version is pushed to PyPI after tag-triggered build is
    finished.
  • Delete working branch (if not needed anymore)
  • Upgrade the package in the Open edX platform requirements (if applicable)

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Dec 12, 2024
@openedx-webhooks
Copy link

openedx-webhooks commented Dec 12, 2024

Thanks for the pull request, @mariajgrimaldi!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently maintained by @openedx/hooks-extension-framework. Tag them in a comment and let them know that your changes are ready for review.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

dict[str, str]: {'key': 'value'},
dict[str, int]: {'key': 1},
dict[str, float]: {'key': 1.0},
dict[str, bool]: {'key': True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about union types:

dict[str, Union[str, int]]: ...,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more test cases and this covers even more than I initially thought: a701f78. Thanks for the suggestion!

@mariajgrimaldi mariajgrimaldi changed the title Mjg/add support for dicts feat: add support for annotated python dicts as avro map type Dec 13, 2024
@mariajgrimaldi mariajgrimaldi force-pushed the MJG/add-support-for-dicts branch from e71efcc to acdcaa9 Compare December 13, 2024 16:23
@mariajgrimaldi mariajgrimaldi marked this pull request as ready for review December 13, 2024 18:42
@mariajgrimaldi mariajgrimaldi requested a review from a team as a code owner December 13, 2024 18:42
Comment on lines 87 to 88
# returns types of dict contents
# if data_type == Dict[str, int], arg_data_type = (str, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove code comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's commented code but an explanation of what arg_data_type is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same confusion. What about a change like the following? Note that this was copied from above, so I'd make the same fix there as well.

Suggested change
# returns types of dict contents
# if data_type == Dict[str, int], arg_data_type = (str, int)
# Returns types of dict contents.
# Example: if data_type == Dict[str, int], arg_data_type = (str, int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks for the suggestion!

@@ -297,7 +297,7 @@
# .. event_data: DiscussionThreadData
# .. event_warning: This event is currently incompatible with the event bus, list/dict cannot be serialized yet
FORUM_THREAD_CREATED = OpenEdxPublicSignal(
event_type="org.openedx.learning.thread.created.v1",
event_type="org.openedx.learning.forum.thread.created.v1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be changed?
Should this emit a FORUM_THREAD_CREATED_V2 signal?

I imagine this will break existing code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, as I mentioned in the cover letter those changes are there for testing. That should be addressed in a follow up PR.

Copy link
Contributor

@Ian2012 Ian2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and concerns around changing existing code and some code cleanup

Copy link
Member

@felipemontoya felipemontoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff of supporting dicts only when annotated seems completely fair to me.

I agree that the refactoring of the forum events should go in a separate PR. Do you think this needs that we bump the versioning of the events or are they backwards compatible?

@mariajgrimaldi mariajgrimaldi changed the title feat: add support for annotated python dicts as avro map type feat: [FC-0074] add support for annotated python dicts as avro map type Dec 20, 2024
@mariajgrimaldi mariajgrimaldi added the FC Relates to an Axim Funded Contribution project label Dec 20, 2024
@mariajgrimaldi
Copy link
Member Author

I agree that the refactoring of the forum events should go in a separate PR. Do you think this needs that we bump the versioning of the events or are they backwards compatible?

I do think it's backwards compatible. We'd be affecting the event bus support, which we didn't have previously, so I think we're okay.

Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, but I haven't been in this code much so I'll refrain from thumbing it.

@mariajgrimaldi
Copy link
Member Author

I'm tagging @robrap @timmc-edx for a review since they're more familiar with the event bus code. Can you help us out here? Do you think this looks reasonable?

@timmc-edx
Copy link
Contributor

We're currently on break but should be able to review after Jan 6.

Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariajgrimaldi: I gave some comments, but I feel like I don't know enough to really understand the change (see other comments for details).

Comment on lines 87 to 88
# returns types of dict contents
# if data_type == Dict[str, int], arg_data_type = (str, int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same confusion. What about a change like the following? Note that this was copied from above, so I'd make the same fix there as well.

Suggested change
# returns types of dict contents
# if data_type == Dict[str, int], arg_data_type = (str, int)
# Returns types of dict contents.
# Example: if data_type == Dict[str, int], arg_data_type = (str, int)

@@ -63,7 +63,7 @@ def _create_avro_field_definition(data_key, data_type, previously_seen_types,
field["type"] = field_type
# Case 2: data_type is a simple type that can be converted directly to an Avro type
elif data_type in PYTHON_TYPE_TO_AVRO_MAPPING:
if PYTHON_TYPE_TO_AVRO_MAPPING[data_type] in ["record", "array"]:
if PYTHON_TYPE_TO_AVRO_MAPPING[data_type] in ["map", "array"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, right? Were dicts only introduced and used with the forums events, which you plan to break anyway? You don't have a changelog entry or a major version change (yet), but is that your plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, dictionaries were never fully supported. Besides this list with known unserializable events, which contain dictionaries in their payloads, we can confirm this by generating a schema for an event using dictionaries. Let's take this event as an example:

# openedx_events/learning/signals.py@main
MY_EVENT = OpenEdxPublicSignal(
    event_type="org.openedx.learning.my_event.v1",
    data={
        "my_data": MyEventData,
    }
)

# openedx_events/learning/data.py@main
@attr.s(frozen=True)
class MyEventData:

    event_type = attr.ib(type=str)
    event_data = attr.ib(type=dict[str, str])

Which immediately raises a serialization error:

openedx-events/openedx_events/event_bus/avro/schema.py", line 107, in _create_avro_field_definition
    raise TypeError(
TypeError: Data type dict[str, str] is not supported. The data type needs to either be one of the types in PYTHON_TYPE_TO_AVRO_MAPPING, an attrs decorated class, or one of the types defined in custom_type_to_avro_type.

Because dicts were not a type supported in event_bus/avro/schema.py::_create_avro_field_definition before this PR. So no events with type dicts were ever sent through the event bus. Forum events were not supported either since they were listed in the known unserialiazable events, but I included them here as an example and will remove them shortly.

To maintain backward compatibility, I'll include the map type as an additional type instead of changing "record" to "map".

Copy link
Contributor

@robrap robrap Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mariajgrimaldi.

  1. It sounds like changing from "record" to "map" is not a breaking change for dicts if they aren't yet in use, and is really just a fix.
  2. Whether or not dropping "record" is a breaking change elsewhere is still a question. No tests broke when you had dropped it. Does that mean it isn't used, or that we are missing test coverage.
  3. Highly related, why do I not see the /schemas introduced in https://github.com/openedx/openedx-events/pull/225/files for unit testing? Is that part of the missing coverage?

UPDATE: If we are missing test coverage, it would be great to get that back. Or, if this is a non-breaking change, then you should drop "record" to clean up the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some tests to solve the missing coverage. However, I still wonder if we're breaking something by replacing "record" with "map". I also checked the /schemas added in https://github.com/openedx/openedx-events/pull/225/files and they are still to be in the repository. Could you clarify which schemas are missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I now see the schemas. In theory, if our tests are working as planned, these should be snapshots of all the schemas and if there is no change detected, all would be good.

You say you fear dropping record is a breaking change. Is it possible to create a breaking test?

Copy link
Member Author

@mariajgrimaldi mariajgrimaldi Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, changing "record" to "map" should not be a breaking change since dicts were not supported before this PR. By testing the code with dicts without any support, I can infer that L66 was only used to warn developers that dicts (mapped to avro records, which is a type we can't use because it clashes with the data attributes mapping) without annotations were not allowed, but either dicts with annotations since they were not supported. In any case, I added a test case that checks that the schema generated for dicts should map to map for future reference.

Although I said I wondered whether this was a breaking change given my findings and considering what you mentioned before, I think we should trust the repo's coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why there are no old test schemas? I would have expected a change from old (deleted files) to new (these files). They would have shown before/after.

My head isn't in this world enough to understand the difference between record and map, and why it is important to move from one to the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the confusion with these events. Forum events were never supported by the events because they included dict types in their payloads, that's why there is not an old schema. As for why using maps instead of records, it's mainly because of this:

This PR supports Python dict types mapped to Avro Map type for avro schema generation, aiming to add support to a broader list of events payloads. This was previously attempted by mapping Python dicts -> records but this approach considers Python dicts -> maps to avoid conflicts with data attributes -> record mapping, also by using maps we avoid this kind of errors when we don't know the content of dictionaries:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This does clarify for forum events.

Comment on lines 63 to 64
if arg_data_type[1] in SIMPLE_PYTHON_TYPE_TO_AVRO_MAPPING:
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to check whether the dict key type is a simple Python type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the check to study the behavior, and I got a few errors because the keys are unhashable by nature. Therefore, only basic Python types were allowed. Do you think we need an explicit check for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about keys being unhashable? What's a situation where that happens?

Copy link
Member Author

@mariajgrimaldi mariajgrimaldi Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using types as keys that are not simple python types, let's say SIGNAL = create_simple_signal({"dict_input": Dict[ComplexAttrs, int]}) where ComplexAttrs is unhashable I get this error:

FAILED openedx_events/event_bus/avro/tests/test_deserializer.py::TestAvroSignalDeserializerCache::test_deserialization_of_dicts_with_keys_of_complex_types_fails - TypeError: unhashable type: 'ComplexAttrs'

But keys of type CourseKey work. So, as you suggested, I'm going to add the additional check.

@mariajgrimaldi mariajgrimaldi force-pushed the MJG/add-support-for-dicts branch from f86464a to 8f67ea2 Compare January 8, 2025 10:10
Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariajgrimaldi: 👍. Please follow your normal process for landing this. I don't see any blockers. The only reason I am not approving is because I don't plan to review all the final code (tests) in depth. Thanks for checking in on this.

Also, just a reminder to update version and changelog at some point.

Copy link
Contributor

@bmtcril bmtcril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the existing comments and my tests I think this is good and non-breaking. Just needs the usual version bump stuff. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FC Relates to an Axim Funded Contribution project open-source-contribution PR author is not from Axim or 2U
Projects
Status: In Eng Review
Development

Successfully merging this pull request may close these issues.

7 participants