Skip to content

Performance Improvements #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Performance Improvements #63

wants to merge 8 commits into from

Conversation

Asachoo
Copy link

@Asachoo Asachoo commented Apr 8, 2025

By using closures instead of lambda / caching field parsers methods / precompiling regular objects, a performance improvement of about 15% was achieved (mainly from closures)

(pystdf) ➜  pystdf git:(speed_up) ✗ python test.py                                             
Time taken to import data/lot3.stdf: 0.4714 seconds for 4.35 MB of data (lot3)
Time taken to import data/lot2.stdf: 0.4827 seconds for 4.21 MB of data (lot2)
Time taken to import data/demofile.stdf: 0.4606 seconds for 4.35 MB of data (demofile)
(pystdf) ➜  pystdf git:(speed_up) ✗ git switch master
(pystdf) ➜  pystdf git:(master) ✗ python test.py
Time taken to import data/lot3.stdf: 0.5451 seconds for 4.35 MB of data (lot3)
Time taken to import data/lot2.stdf: 0.5531 seconds for 4.21 MB of data (lot2)
Time taken to import data/demofile.stdf: 0.5512 seconds for 4.35 MB of data (demofile)

Here are my benchmark codes:

from pystdf.Importer import ImportSTDF
from time import time
from pathlib import Path


if __name__ == "__main__":
    root = Path(".")
    for std_file in (root / "data").glob("*.stdf"):
        start_time = time()
        data = ImportSTDF(std_file)
        print(
            f"Time taken to import {std_file}: {(time() - start_time):.4f} seconds for {std_file.stat().st_size / 1024**2:.2f} MB of data ({std_file.stem})"
        )

@cmars
Copy link
Owner

cmars commented Apr 8, 2025

Awesome, thanks for this! Will try to get to this later this evening.

Copy link
Owner

@cmars cmars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions below.

I'm a little nervous about these changes without good test coverage in place to guard against regressions.

pystdf/IO.py Outdated
@@ -193,17 +184,26 @@ def parse(self, count=0):
raise

def getFieldParser(self, fieldType):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you get the same effect with a memoizing decorator here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decorator has been added.

try:
for parse_field in field_parsers:
fields.append(parse_field(parser, header, fields))
except EndOfRecordException: pass
Copy link
Owner

@cmars cmars Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a change in behavior [swallowing the end of record exception, I mean]. Why was this added? Could it subtly change the behavior of how the parser is used in an unexpected way? I'm concerned that it might... can you provide more context here to derisk that concern?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the addition of the EndOfRecordException exception handling in the createRecordParser method originates from the original appendFieldParser method.

The original appendFieldParser method served the purpose of organizing the order of actions and handling exceptions.

Now, the memorize decorator only provides the caching functionality. The organization of actions and exception handling have been moved to the createRecordParser method (the original caller of appendFieldParser) for explicit implementation.

@Asachoo Asachoo marked this pull request as draft April 9, 2025 06:31
@Asachoo Asachoo marked this pull request as ready for review April 9, 2025 09:03
@Asachoo
Copy link
Author

Asachoo commented Apr 9, 2025

A new batchReadFields method is added to read data of the same Field in batches (the number of consecutive identical Fields is obtained through groupConsecutiveDuplicates).

The performance is further optimized (the time consumed is about 75% of the master)

(pystdf) ➜  pystdf git:(speed_up) ✗ python test.py                                                                        
Time taken to import data/lot3.stdf: 0.4064 seconds for 4.35 MB of data (lot3)
Time taken to import data/lot2.stdf: 0.4240 seconds for 4.21 MB of data (lot2)
Time taken to import data/demofile.stdf: 0.4097 seconds for 4.35 MB of data (demofile)

@Asachoo
Copy link
Author

Asachoo commented Apr 9, 2025

From a performance standpoint, batch processing of readCn seems like a promising next step for performance improvements.

However, this PR is already quite complex, and introducing batchReadFields has made it even more difficult to review and maintain.

To keep things manageable, I think it would be better to address the optimization of readCn in a separate PR.

This way, we can focus on refining the current changes while ensuring clarity and maintainability.

img_v3_02l6_9a4fae9f-90aa-4efc-8493-58a0e6cf1f8g

Let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants