Skip to content

Add async support #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
VMRuiz opened this issue Jan 17, 2023 · 8 comments
Open

Add async support #169

VMRuiz opened this issue Jan 17, 2023 · 8 comments
Assignees

Comments

@VMRuiz
Copy link

VMRuiz commented Jan 17, 2023

Scrapy now supports calling async methods from many components. It would be great to provide the option to call Scrapinghub API using async methods to avoid either blocking calls or having to work with twisted deferreds and threads.

In order to do that, there are several design decisions that needs to be taken:

Define Async interface

  • Do we add an async version for each method already available?
  • Do we create an AsyncScrapinghubClient that only provides support for async methods?

How to implement async methods

Possible obstacles?

If we want to use python-scrapinghub in standalone mode, will it create their own task loop? or should it expect it to be already created?
If we run it inside Scrapy, could this cause conflict issues?

I don't have much experience designing async libs so It would be great if someone with a better understanding could provide some pointers here.

@Gallaecio
Copy link
Member

I think a separate client class make sense. As for everything else, I think an approach like that of https://github.com/zytedata/python-zyte-api would be best.

@pawelmhm
Copy link
Member

pawelmhm commented Jan 18, 2023

If we want to use python-scrapinghub in standalone mode, will it create their own task loop?

asyncio.get_running_loop() should be fine, but I'm not sure about interop with Twisted in Scrapy. Because asyncio is still optional for Scrapy, many users would use Twisted, and there might be conflicts asyncio with Twisted. Still I'm not sure we need to worry too much about it because there are several use cases of python-scrapinghub and many are outside Scrapy. If async would be optional users can simply opt-out and use blocking version without worries.

I guess main problem will be that async API will be breaking for many applications. You will have to call await before each method call, users would have to update their code. It would be nice to have just one way of doing things, and make library async by default, but for backward compatibility it might make more sense to have async and non async at the same time. This will add complexity and require more development effort to maintain.

So maybe indeed separate client would be better? Then we could phase out this library and replace it with new client.

@VMRuiz
Copy link
Author

VMRuiz commented Jan 18, 2023

I guess main problem will be that async API will be breaking for many applications

That's right, my ideas is to keep offering the old interface and allow people to migrate to the new one if they want.

So maybe indeed separate client would be better? Then we could phase out this library and replace it with new client.

Do you think it would be better to create a new library instead? I was reviewing the code and I found lot of legacy code which I'm not sure if it's still used anywhere. That would make adding new features inside this package much more difficult.

@Gallaecio
Copy link
Member

Using the asyncio reactor in Scrapy should not be an issue, and we are promoting the asyncio Twisted reactor in Scrapy at the moment (the settings.py file generated by scrapy startproject sets the asyncio reactor).

I don’t think maintaining both interfaces, sync and async, in the same library, will add complexity. If anything, it will remove it. We can implement new features only in the async interface, to avoid duplicate work. Or we can port them to the sync interface but have the sync interface basically call the asyncio code. In any case, I would keep it in the same library.

@kmike
Copy link
Member

kmike commented Jan 19, 2023

I think if the interface is the same or similar, it makes sense to have it a single library.

If API is significantly different, then a separate library looks good to me. It also allows to have separate releases, etc., and it makes it easier to justify not supporting all the features from the day 1, which could be an expectation from the users.

I'm thinking about an example of scrapy-autoextract library, where we provided 2 ways of integration, and it seems it caused nothing but confusion.

@Gallaecio
Copy link
Member

Well, another point for the separate library is that it gives us a chance for rebranding, i.e. python-zyte-scrapycloud.

@VMRuiz
Copy link
Author

VMRuiz commented Jan 20, 2023

If we scope the lib to the APIs of Scrapy Cloud then it should be python-scrapycloud or python-zyte-scrapycloud as @Gallaecio proposes.

Also, given that we are still using the endpoints app.scrapinghub.com and storage.scrapinghub.com someone could argue that it makes sense to keep scrapinghub in the name. However, I think we shouldn't link the endpoint name to the library name.

@Ousret
Copy link

Ousret commented Jan 21, 2024

Maybe, you could consider Niquests as part of the solution? Its a drop-in replacement for Requests, with both sync and async interfaces (and more...).
I am willing to help on the subject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants