Skip to content

Commit 35f900b

Browse files
✨ Add shutdown event and save per page option (#102)
* ✨ Add shutdown event and save per page option * Update documentation and tests * Lint * Add docstring * Bump version * Update feature list * Add test for shutdown
1 parent a2f3097 commit 35f900b

27 files changed

+622
-230
lines changed

README.md

+3-9
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,9 @@ The output in `data.json` should contain the actual URL and the metadata prepend
113113
- [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`
114114
- [Pyppeteer](https://roniemartinez.github.io/dude/advanced/12_pyppeteer.html) - `pip install pydude[pyppeteer]`
115115
- [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`
116-
- Option to follow all links indefinitely (Crawler/Spider). WARNING: Do not use yet until https://github.com/roniemartinez/dude/pull/27 has been implemented.
116+
- Option to follow all links indefinitely (Crawler/Spider).
117+
- Events - attach functions to startup, pre-setup, post-setup and shutdown events.
118+
- Option to save data on every page.
117119

118120
## Supported Parser Backends
119121

@@ -219,14 +221,6 @@ Here is the summary of features supported by each parser backend.
219221
Read the complete documentation at [https://roniemartinez.github.io/dude/](https://roniemartinez.github.io/dude/).
220222
All the advanced and useful features are documented there.
221223

222-
## Support
223-
224-
This project is at a very early stage. This dude needs some love! ❤️
225-
226-
Contribute to this project by feature requests, idea discussions, reporting bugs, opening pull requests, or through Github Sponsors. Your help is highly appreciated.
227-
228-
[![Github Sponsors](https://img.shields.io/github/sponsors/roniemartinez?label=github%20sponsors&logo=github%20sponsors&style=for-the-badge)](https://github.com/sponsors/roniemartinez)
229-
230224
## Requirements
231225

232226
- ✅ Any dude should know how to work with selectors (CSS or XPath).

docs/advanced/06_custom_storage.md

+38-1
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,43 @@ The custom storage above can then be called using any of the options below.
4242
dude.run(urls=["<url>"], pages=2, format="table")
4343
```
4444

45+
## Saving on every page
46+
47+
It is possible to call the save functions after each page.
48+
This is useful when running in spider mode to prevent lost of data.
49+
To make use of this option, the flag `is_per_page` in the `@save()` should be set to `True`.
50+
51+
=== "Python"
52+
53+
```python
54+
@save("table", is_per_page=True)
55+
def save_table(data, output) -> bool:
56+
...
57+
```
58+
59+
To run the scraper in per-page save, pass `--save-per-page` argument.
60+
61+
=== "Terminal"
62+
63+
```bash
64+
dude scrape --url "<url>" path/to/script.py --format table --save-per-page
65+
```
66+
67+
=== "Python"
68+
69+
```python
70+
if __name__ == "__main__":
71+
import dude
72+
73+
dude.run(urls=["<url>"], pages=2, format="table", save_per_page=True)
74+
```
75+
76+
!!! note
77+
78+
The option `--save-per-page` is best used with events to make sure that connections or file handles are opened
79+
and closed properly. Check the examples below.
80+
4581
## Examples
4682

47-
A more extensive example can be found at [examples/custom_storage.py](https://github.com/roniemartinez/dude/tree/master/examples/custom_storage.py).
83+
A more extensive example can be found at [examples/custom_storage.py](https://github.com/roniemartinez/dude/tree/master/examples/custom_storage.py) and
84+
[examples/save_per_page.py](https://github.com/roniemartinez/dude/tree/master/examples/save_per_page.py).

docs/advanced/14_events.md

+17
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,20 @@ def print_pdf(page):
6767
unique_name = str(uuid.uuid4())
6868
page.pdf(path=SAVE_DIR / f"{unique_name}.pdf")
6969
```
70+
71+
## Shutdown Event
72+
73+
The Shutdown event is executed before terminating the application.
74+
75+
The `@shutdown()` decorator can be used to register a function for shutdown.
76+
77+
```python
78+
import shutil
79+
80+
from dude import shutdown
81+
82+
83+
@shutdown()
84+
def zip_all():
85+
shutil.make_archive("images-and-pdfs", "zip", SAVE_DIR)
86+
```

docs/cli.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
=== "CLI"
44

55
```commandline
6-
usage: dude scrape [-h] --url URL [--playwright | --bs4 | --parsel | --lxml | --pyppeteer | --selenium] [--headed] [--browser {chromium,firefox,webkit}] [--pages PAGES] [--output OUTPUT]
7-
[--format FORMAT] [--proxy-server PROXY_SERVER] [--proxy-user PROXY_USER] [--proxy-pass PROXY_PASS] [--follow-urls]
8-
PATH [PATH ...]
9-
6+
usage: dude scrape [-h] --url URL [--playwright | --bs4 | --parsel | --lxml | --pyppeteer | --selenium] [--headed] [--browser {chromium,firefox,webkit}] [--pages PAGES] [--output OUTPUT] [--format FORMAT] [--proxy-server PROXY_SERVER] [--proxy-user PROXY_USER]
7+
[--proxy-pass PROXY_PASS] [--follow-urls] [--save-per-page]
8+
PATH [PATH ...]
9+
1010
Run the dude scraper.
1111

1212
optional arguments:
@@ -28,13 +28,13 @@
2828
Browser type to use.
2929
--pages PAGES Maximum number of pages to crawl before exiting (default=1). This is only valid when a navigate handler is defined.
3030
--output OUTPUT Output file. If not provided, prints into the terminal.
31-
--format FORMAT Output file format. If not provided, uses the extension of the output file or defaults to "json". Supports "json", "yaml/yml", and "csv" but can be extended using the @save()
32-
decorator.
31+
--format FORMAT Output file format. If not provided, uses the extension of the output file or defaults to "json". Supports "json", "yaml/yml", and "csv" but can be extended using the @save() decorator.
3332
--proxy-server PROXY_SERVER
3433
Proxy server.
3534
--proxy-user PROXY_USER
3635
Proxy username.
3736
--proxy-pass PROXY_PASS
3837
Proxy password.
3938
--follow-urls Automatically follow URLs.
39+
--save-per-page Flag to save data on every page extraction or not. If not, saves all the data at the end.If --follow-urls is set to true, this variable will be automatically set to true.
4040
```

docs/diagrams/events.png

27.7 KB
Loading

docs/features.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,6 @@
1515
- [lxml](https://lxml.de/) - `pip install pydude[lxml]`
1616
- [Pyppeteer](https://github.com/pyppeteer/pyppeteer) - `pip install pydude[pyppeteer]`
1717
- [Selenium](https://github.com/SeleniumHQ/Selenium) - `pip install pydude[selenium]`
18-
- Option to follow all links indefinitely (Crawler/Spider). WARNING: Do not use yet until https://github.com/roniemartinez/dude/pull/27 has been implemented.
18+
- Option to follow all links indefinitely (Crawler/Spider).
19+
- Events - attach functions to startup, pre-setup, post-setup and shutdown events.
20+
- Option to save data on every page.

dude/__init__.py

+11-2
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@
33
from pathlib import Path
44
from typing import Any
55

6-
from .context import group, post_setup, pre_setup, run, save, select, startup # noqa: F401
6+
from .context import group, post_setup, pre_setup, run, save, select, shutdown, startup # noqa: F401
77
from .scraper import Scraper # noqa: F401
88

9-
__al__ = ["Scraper", "group", "run", "save", "select", "startup", "pre_setup", "post_setup"]
9+
__al__ = ["Scraper", "group", "run", "save", "select", "startup", "shutdown", "pre_setup", "post_setup"]
1010

1111

1212
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
@@ -148,6 +148,14 @@ def cli() -> None: # pragma: no cover
148148
action="store_true",
149149
help="Automatically follow URLs.",
150150
)
151+
optional.add_argument(
152+
"--save-per-page",
153+
dest="save_per_page",
154+
default=False,
155+
action="store_true",
156+
help="Flag to save data on every page extraction or not. If not, saves all the data at the end."
157+
"If --follow-urls is set to true, this variable will be automatically set to true.",
158+
)
151159
arguments = parser.parse_args()
152160

153161
if arguments.version:
@@ -204,4 +212,5 @@ def cli() -> None: # pragma: no cover
204212
output=arguments.output,
205213
format=arguments.format,
206214
follow_urls=arguments.follow_urls,
215+
save_per_page=arguments.save_per_page,
207216
)

dude/base.py

+55-16
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,14 @@ def __init__(
4343
self,
4444
rules: List[Rule] = None,
4545
groups: Dict[Callable, Selector] = None,
46-
save_rules: Dict[str, Any] = None,
46+
save_rules: Dict[Tuple[str, bool], Any] = None,
4747
events: Optional[DefaultDict] = None,
4848
has_async: bool = False,
4949
scraper: Optional["ScraperAbstract"] = None,
5050
) -> None:
5151
self.rules: List[Rule] = rules or []
5252
self.groups: Dict[Callable, Selector] = groups or {}
53-
self.save_rules: Dict[str, Any] = save_rules or {"json": save_json}
53+
self.save_rules: Dict[Tuple[str, bool], Any] = save_rules or {("json", False): save_json}
5454
self.events: DefaultDict = events or collections.defaultdict(list)
5555
self.has_async = has_async
5656
self.scraper = scraper
@@ -67,6 +67,7 @@ def run(
6767
output: Optional[str],
6868
format: str,
6969
follow_urls: bool = False,
70+
save_per_page: bool = False,
7071
) -> None:
7172
"""
7273
Abstract method for executing the scraper.
@@ -77,6 +78,7 @@ def run(
7778
:param output: Output file. If not provided, prints in the terminal.
7879
:param format: Output file format. If not provided, uses the extension of the output file or defaults to json.
7980
:param follow_urls: Automatically follow URLs.
81+
:param save_per_page: Flag to save data on every page extraction or not. If not, saves all the data at the end.
8082
"""
8183
raise NotImplementedError # pragma: no cover
8284

@@ -188,19 +190,20 @@ def wrapper(func: Callable) -> Union[Callable, Coroutine]:
188190

189191
return wrapper
190192

191-
def save(self, format: str) -> Callable:
193+
def save(self, format: str, is_per_page: bool = False) -> Callable:
192194
"""
193195
Decorator to register a save function to a format.
194196
195197
:param format: Format (json, csv, or any custom string).
198+
:param is_per_page: Flag to identify if func will be called after each page.
196199
"""
197200

198201
def wrapper(func: Callable) -> Callable:
199202
if asyncio.iscoroutinefunction(func):
200203
self.has_async = True
201204

202205
save_rules = self.scraper.save_rules if self.scraper else self.save_rules
203-
save_rules[format] = func
206+
save_rules[format, is_per_page] = func
204207
return func
205208

206209
return wrapper
@@ -258,6 +261,24 @@ def wrapper(func: Callable) -> Callable:
258261

259262
return wrapper
260263

264+
def shutdown(self) -> Callable:
265+
"""
266+
Decorator to register a function to the shutdown events.
267+
268+
Shutdown events are executed before terminating the application for cleaning up or closing resources like
269+
files and database sessions.
270+
"""
271+
272+
def wrapper(func: Callable) -> Callable:
273+
if asyncio.iscoroutinefunction(func):
274+
self.has_async = True
275+
276+
events = self.scraper.events if self.scraper else self.events
277+
events["shutdown"].append(func)
278+
return func
279+
280+
return wrapper
281+
261282
def iter_urls(self) -> Iterable[str]:
262283
try:
263284
while True:
@@ -291,11 +312,20 @@ def event_startup(self) -> None:
291312
"""
292313
Run all startup events
293314
"""
315+
self.run_event("startup")
316+
317+
def event_shutdown(self) -> None:
318+
"""
319+
Run all shutdown events
320+
"""
321+
self.run_event("shutdown")
322+
323+
def run_event(self, event_name: str) -> None:
294324
loop = None
295325
if self.has_async:
296326
loop = asyncio.get_event_loop()
297327

298-
for func in self.events["startup"]:
328+
for func in self.events[event_name]:
299329
if asyncio.iscoroutinefunction(func):
300330
assert loop is not None
301331
loop.run_until_complete(func())
@@ -308,7 +338,7 @@ def __init__(
308338
self,
309339
rules: List[Rule] = None,
310340
groups: Dict[Callable, Selector] = None,
311-
save_rules: Dict[str, Any] = None,
341+
save_rules: Dict[Tuple[str, bool], Any] = None,
312342
events: Optional[DefaultDict] = None,
313343
has_async: bool = False,
314344
) -> None:
@@ -441,29 +471,39 @@ def get_flattened_data(self) -> List[Dict]:
441471
items.append(item)
442472
return items
443473

444-
def _save(self, format: str, output: Optional[str] = None) -> None:
474+
def _save(self, format: str, output: Optional[str] = None, save_per_page: bool = False) -> None:
445475
if output:
446476
extension = Path(output).suffix.lower()[1:]
447477
format = extension
448-
449-
data = self.get_flattened_data()
450478
try:
451-
if self.save_rules[format](data, output):
479+
handler = self.save_rules[format, save_per_page]
480+
data = self.get_flattened_data()
481+
if not len(data):
482+
logger.info(
483+
"No data was scraped. Skipped saving %s.",
484+
dict(format=format, output=format, save_per_page=save_per_page),
485+
)
486+
return
487+
if handler(data, output):
452488
self.collected_data.clear()
453489
else:
454490
raise Exception("Failed to save output %s.", {"output": output, "format": format})
455491
except KeyError:
456-
self.collected_data.clear()
457492
raise
458493

459-
async def _save_async(self, format: str, output: Optional[str] = None) -> None:
494+
async def _save_async(self, format: str, output: Optional[str] = None, save_per_page: bool = False) -> None:
460495
if output:
461496
extension = Path(output).suffix.lower()[1:]
462497
format = extension
463-
464-
data = self.get_flattened_data()
465498
try:
466-
handler = self.save_rules[format]
499+
handler = self.save_rules[format, save_per_page]
500+
data = self.get_flattened_data()
501+
if not len(data):
502+
logger.info(
503+
"No data was scraped. Skipped saving %s.",
504+
dict(format=format, output=format, save_per_page=save_per_page),
505+
)
506+
return
467507
if asyncio.iscoroutinefunction(handler):
468508
is_successful = await handler(data, output)
469509
else:
@@ -473,7 +513,6 @@ async def _save_async(self, format: str, output: Optional[str] = None) -> None:
473513
else:
474514
raise Exception("Failed to save output %s.", {"output": output, "format": format})
475515
except KeyError:
476-
self.collected_data.clear()
477516
raise
478517

479518
@staticmethod

dude/context.py

+1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
save = _scraper.save
1212
select = _scraper.select
1313
startup = _scraper.startup
14+
shutdown = _scraper.shutdown
1415
pre_setup = _scraper.pre_setup
1516
post_setup = _scraper.post_setup
1617

0 commit comments

Comments
 (0)