-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [FC-0074] add support for annotated python dicts as avro map type #433
base: main
Are you sure you want to change the base?
Conversation
Thanks for the pull request, @mariajgrimaldi! What's next?Please work through the following steps to get your changes ready for engineering review: 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. 🔘 Let us know that your PR is ready for review:Who will review my changes?This repository is currently maintained by Where can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:
When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
dict[str, str]: {'key': 'value'}, | ||
dict[str, int]: {'key': 1}, | ||
dict[str, float]: {'key': 1.0}, | ||
dict[str, bool]: {'key': True}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about union types:
dict[str, Union[str, int]]: ...,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more test cases and this covers even more than I initially thought: a701f78. Thanks for the suggestion!
e71efcc
to
acdcaa9
Compare
# returns types of dict contents | ||
# if data_type == Dict[str, int], arg_data_type = (str, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove code comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's commented code but an explanation of what arg_data_type is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same confusion. What about a change like the following? Note that this was copied from above, so I'd make the same fix there as well.
# returns types of dict contents | |
# if data_type == Dict[str, int], arg_data_type = (str, int) | |
# Returns types of dict contents. | |
# Example: if data_type == Dict[str, int], arg_data_type = (str, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Thanks for the suggestion!
openedx_events/learning/signals.py
Outdated
@@ -297,7 +297,7 @@ | |||
# .. event_data: DiscussionThreadData | |||
# .. event_warning: This event is currently incompatible with the event bus, list/dict cannot be serialized yet | |||
FORUM_THREAD_CREATED = OpenEdxPublicSignal( | |||
event_type="org.openedx.learning.thread.created.v1", | |||
event_type="org.openedx.learning.forum.thread.created.v1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be changed?
Should this emit a FORUM_THREAD_CREATED_V2
signal?
I imagine this will break existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, as I mentioned in the cover letter those changes are there for testing. That should be addressed in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and concerns around changing existing code and some code cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff of supporting dicts only when annotated seems completely fair to me.
I agree that the refactoring of the forum events should go in a separate PR. Do you think this needs that we bump the versioning of the events or are they backwards compatible?
I do think it's backwards compatible. We'd be affecting the event bus support, which we didn't have previously, so I think we're okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, but I haven't been in this code much so I'll refrain from thumbing it.
I'm tagging @robrap @timmc-edx for a review since they're more familiar with the event bus code. Can you help us out here? Do you think this looks reasonable? |
We're currently on break but should be able to review after Jan 6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariajgrimaldi: I gave some comments, but I feel like I don't know enough to really understand the change (see other comments for details).
# returns types of dict contents | ||
# if data_type == Dict[str, int], arg_data_type = (str, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same confusion. What about a change like the following? Note that this was copied from above, so I'd make the same fix there as well.
# returns types of dict contents | |
# if data_type == Dict[str, int], arg_data_type = (str, int) | |
# Returns types of dict contents. | |
# Example: if data_type == Dict[str, int], arg_data_type = (str, int) |
@@ -63,7 +63,7 @@ def _create_avro_field_definition(data_key, data_type, previously_seen_types, | |||
field["type"] = field_type | |||
# Case 2: data_type is a simple type that can be converted directly to an Avro type | |||
elif data_type in PYTHON_TYPE_TO_AVRO_MAPPING: | |||
if PYTHON_TYPE_TO_AVRO_MAPPING[data_type] in ["record", "array"]: | |||
if PYTHON_TYPE_TO_AVRO_MAPPING[data_type] in ["map", "array"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, right? Were dicts only introduced and used with the forums events, which you plan to break anyway? You don't have a changelog entry or a major version change (yet), but is that your plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, dictionaries were never fully supported. Besides this list with known unserializable events, which contain dictionaries in their payloads, we can confirm this by generating a schema for an event using dictionaries. Let's take this event as an example:
# openedx_events/learning/signals.py@main
MY_EVENT = OpenEdxPublicSignal(
event_type="org.openedx.learning.my_event.v1",
data={
"my_data": MyEventData,
}
)
# openedx_events/learning/data.py@main
@attr.s(frozen=True)
class MyEventData:
event_type = attr.ib(type=str)
event_data = attr.ib(type=dict[str, str])
Which immediately raises a serialization error:
openedx-events/openedx_events/event_bus/avro/schema.py", line 107, in _create_avro_field_definition
raise TypeError(
TypeError: Data type dict[str, str] is not supported. The data type needs to either be one of the types in PYTHON_TYPE_TO_AVRO_MAPPING, an attrs decorated class, or one of the types defined in custom_type_to_avro_type.
Because dicts were not a type supported in event_bus/avro/schema.py::_create_avro_field_definition before this PR. So no events with type dicts were ever sent through the event bus. Forum events were not supported either since they were listed in the known unserialiazable events, but I included them here as an example and will remove them shortly.
To maintain backward compatibility, I'll include the map type as an additional type instead of changing "record" to "map".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mariajgrimaldi.
- It sounds like changing from "record" to "map" is not a breaking change for dicts if they aren't yet in use, and is really just a fix.
- Whether or not dropping "record" is a breaking change elsewhere is still a question. No tests broke when you had dropped it. Does that mean it isn't used, or that we are missing test coverage.
- Highly related, why do I not see the
/schemas
introduced in https://github.com/openedx/openedx-events/pull/225/files for unit testing? Is that part of the missing coverage?
UPDATE: If we are missing test coverage, it would be great to get that back. Or, if this is a non-breaking change, then you should drop "record" to clean up the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some tests to solve the missing coverage. However, I still wonder if we're breaking something by replacing "record" with "map". I also checked the /schemas added in https://github.com/openedx/openedx-events/pull/225/files and they are still to be in the repository. Could you clarify which schemas are missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I now see the schemas. In theory, if our tests are working as planned, these should be snapshots of all the schemas and if there is no change detected, all would be good.
You say you fear dropping record is a breaking change. Is it possible to create a breaking test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, changing "record" to "map" should not be a breaking change since dicts were not supported before this PR. By testing the code with dicts without any support, I can infer that L66
was only used to warn developers that dicts (mapped to avro records, which is a type we can't use because it clashes with the data attributes mapping) without annotations were not allowed, but either dicts with annotations since they were not supported. In any case, I added a test case that checks that the schema generated for dicts should map to map
for future reference.
Although I said I wondered whether this was a breaking change given my findings and considering what you mentioned before, I think we should trust the repo's coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why there are no old test schemas? I would have expected a change from old (deleted files) to new (these files). They would have shown before/after.
My head isn't in this world enough to understand the difference between record and map, and why it is important to move from one to the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for all the confusion with these events. Forum events were never supported by the events because they included dict types in their payloads, that's why there is not an old schema. As for why using maps instead of records, it's mainly because of this:
This PR supports Python dict types mapped to Avro Map type for avro schema generation, aiming to add support to a broader list of events payloads. This was previously attempted by mapping Python dicts -> records but this approach considers Python dicts -> maps to avoid conflicts with data attributes -> record mapping, also by using maps we avoid this kind of errors when we don't know the content of dictionaries:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This does clarify for forum events.
if arg_data_type[1] in SIMPLE_PYTHON_TYPE_TO_AVRO_MAPPING: | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to check whether the dict key type is a simple Python type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the check to study the behavior, and I got a few errors because the keys are unhashable by nature. Therefore, only basic Python types were allowed. Do you think we need an explicit check for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about keys being unhashable? What's a situation where that happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using types as keys that are not simple python types, let's say SIGNAL = create_simple_signal({"dict_input": Dict[ComplexAttrs, int]})
where ComplexAttrs
is unhashable I get this error:
FAILED openedx_events/event_bus/avro/tests/test_deserializer.py::TestAvroSignalDeserializerCache::test_deserialization_of_dicts_with_keys_of_complex_types_fails - TypeError: unhashable type: 'ComplexAttrs'
But keys of type CourseKey
work. So, as you suggested, I'm going to add the additional check.
f86464a
to
8f67ea2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariajgrimaldi: 👍. Please follow your normal process for landing this. I don't see any blockers. The only reason I am not approving is because I don't plan to review all the final code (tests) in depth. Thanks for checking in on this.
Also, just a reminder to update version and changelog at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the existing comments and my tests I think this is good and non-breaking. Just needs the usual version bump stuff. 👍
Description
This PR supports Python dict types mapped to Avro Map type for avro schema generation, aiming to add support to a broader list of events payloads. This was previously attempted by mapping Python dicts -> records but this approach considers Python dicts -> maps to avoid conflicts with data attributes -> record mapping, also by using maps we avoid this kind of errors when we don't know the content of dictionaries:
This PR also refactors forum-related events so they can be sent through the event bus. Because of backward compatibility, those changes should be studied in a different PR but are here for testing simplicity.
Supporting information
This PR addresses #428 (comment)
Testing instructions
To test with the event bus:
You should see a log similar to this in the LMS:
Deadline
None
Other information
This was previously attempted here: but using record type instead of the map type: #232
Checklists
Check off if complete or not applicable:
Merge Checklist:
Post Merge:
finished.