Add async support #169

VMRuiz · 2023-01-17T11:54:39Z

Scrapy now supports calling async methods from many components. It would be great to provide the option to call Scrapinghub API using async methods to avoid either blocking calls or having to work with twisted deferreds and threads.

In order to do that, there are several design decisions that needs to be taken:

Define Async interface

Do we add an async version for each method already available?
Do we create an AsyncScrapinghubClient that only provides support for async methods?

How to implement async methods

Python coroutines
asyncio coroutines

Possible obstacles?

If we want to use python-scrapinghub in standalone mode, will it create their own task loop? or should it expect it to be already created?
If we run it inside Scrapy, could this cause conflict issues?

I don't have much experience designing async libs so It would be great if someone with a better understanding could provide some pointers here.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2023-01-17T12:05:48Z

I think a separate client class make sense. As for everything else, I think an approach like that of https://github.com/zytedata/python-zyte-api would be best.

pawelmhm · 2023-01-18T06:42:47Z

If we want to use python-scrapinghub in standalone mode, will it create their own task loop?

asyncio.get_running_loop() should be fine, but I'm not sure about interop with Twisted in Scrapy. Because asyncio is still optional for Scrapy, many users would use Twisted, and there might be conflicts asyncio with Twisted. Still I'm not sure we need to worry too much about it because there are several use cases of python-scrapinghub and many are outside Scrapy. If async would be optional users can simply opt-out and use blocking version without worries.

I guess main problem will be that async API will be breaking for many applications. You will have to call await before each method call, users would have to update their code. It would be nice to have just one way of doing things, and make library async by default, but for backward compatibility it might make more sense to have async and non async at the same time. This will add complexity and require more development effort to maintain.

So maybe indeed separate client would be better? Then we could phase out this library and replace it with new client.

VMRuiz · 2023-01-18T07:41:00Z

I guess main problem will be that async API will be breaking for many applications

That's right, my ideas is to keep offering the old interface and allow people to migrate to the new one if they want.

So maybe indeed separate client would be better? Then we could phase out this library and replace it with new client.

Do you think it would be better to create a new library instead? I was reviewing the code and I found lot of legacy code which I'm not sure if it's still used anywhere. That would make adding new features inside this package much more difficult.

Gallaecio · 2023-01-18T08:50:40Z

Using the asyncio reactor in Scrapy should not be an issue, and we are promoting the asyncio Twisted reactor in Scrapy at the moment (the settings.py file generated by scrapy startproject sets the asyncio reactor).

I don’t think maintaining both interfaces, sync and async, in the same library, will add complexity. If anything, it will remove it. We can implement new features only in the async interface, to avoid duplicate work. Or we can port them to the sync interface but have the sync interface basically call the asyncio code. In any case, I would keep it in the same library.

kmike · 2023-01-19T14:47:04Z

I think if the interface is the same or similar, it makes sense to have it a single library.

If API is significantly different, then a separate library looks good to me. It also allows to have separate releases, etc., and it makes it easier to justify not supporting all the features from the day 1, which could be an expectation from the users.

I'm thinking about an example of scrapy-autoextract library, where we provided 2 ways of integration, and it seems it caused nothing but confusion.

Gallaecio · 2023-01-19T18:47:46Z

Well, another point for the separate library is that it gives us a chance for rebranding, i.e. python-zyte-scrapycloud.

VMRuiz · 2023-01-20T10:17:48Z

If we scope the lib to the APIs of Scrapy Cloud then it should be python-scrapycloud or python-zyte-scrapycloud as @Gallaecio proposes.

Also, given that we are still using the endpoints app.scrapinghub.com and storage.scrapinghub.com someone could argue that it makes sense to keep scrapinghub in the name. However, I think we shouldn't link the endpoint name to the library name.

Ousret · 2024-01-21T03:56:12Z

Maybe, you could consider Niquests as part of the solution? Its a drop-in replacement for Requests, with both sync and async interfaces (and more...).
I am willing to help on the subject.

VMRuiz assigned kalessin, Gallaecio, elacuesta and VMRuiz Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add async support #169

Add async support #169

VMRuiz commented Jan 17, 2023

Gallaecio commented Jan 17, 2023

pawelmhm commented Jan 18, 2023 •

edited

Loading

VMRuiz commented Jan 18, 2023

Gallaecio commented Jan 18, 2023

kmike commented Jan 19, 2023

Gallaecio commented Jan 19, 2023

VMRuiz commented Jan 20, 2023

Ousret commented Jan 21, 2024 •

edited

Loading

Add async support #169

Add async support #169

Comments

VMRuiz commented Jan 17, 2023

Define Async interface

How to implement async methods

Possible obstacles?

Gallaecio commented Jan 17, 2023

pawelmhm commented Jan 18, 2023 • edited Loading

VMRuiz commented Jan 18, 2023

Gallaecio commented Jan 18, 2023

kmike commented Jan 19, 2023

Gallaecio commented Jan 19, 2023

VMRuiz commented Jan 20, 2023

Ousret commented Jan 21, 2024 • edited Loading

pawelmhm commented Jan 18, 2023 •

edited

Loading

Ousret commented Jan 21, 2024 •

edited

Loading