-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile html5lib with Cython #524
base: master
Are you sure you want to change the base?
Conversation
This added a fair bit of complexity, and notable made the Phase classes dynamically generated. However, by doing this, we no longer include "process the token using the rules for" phases in the debug log.
This allows us to define the argument as an int in Cython
This is in preparation for Cython using C function pointers for _state
The current _ascii module is a placeholder, because I accidentally deleted the original implementation of it (but I needed to rewrite it to be even quicker anyway!)
This makes duplicate checking much quicker, and avoids the conversion to a dict at the end
# then stop | ||
if self.chunkOffset != self.chunkSize: | ||
while True: | ||
# this really should be a slice of self.chunk, but https://github.com/cython/cython/issues/3536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with cython.boundscheck(False), cython.wraparound(False): | ||
c = self.chunk[i] | ||
if c > 0x7F and opposite or c <= 0x7F and (bitmap[index(c)] & bit(c)): | ||
cyrv += self.chunk[self.chunkOffset:i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref cython/cython#3887 (IIRC, this was a significant perf issue)
Thanks @gsnedders - I see that the attribute names/values are no longer accessed via tuple/list indexes (a nice improvement), and that the element attributes are initialized as attribute maps (rather than being converted into them at the end of the process). Those are similar to the approach taken in #521 as well. What's the general approach to review, quality/performance assurance and merge for each pull request? It looks tricky to review this changeset as one combined change (given the size of the diff) - could we create a list of the proposed changes (with dependencies / ordering between them) and queue them up for review? |
Yeah, I was mostly just throwing this up as one PR so at least the branch exists somewhere that isn't my local disk! Also not sure why it fails on 2.7 or 3.6 on CI; it doesn't fail locally! |
Use Cython to make the parser quicker; see #445. This builds on top of #272.
This is a long way from ready to land, but shows potential. We probably also want to split this up so many of the earlier changes land first if they make performance sense without Cython.
The change to attribute representation especially might be of interest to #521 (cc @jayaddison).
There's also some API changes towards the end of the branch which we may well want to delay landing even beyond the rest of the Cython stuff.