What's Changed

✨ Add shutdown event and save per page option by @roniemartinez in #102

Other

⬆️ Bump playwright from 1.20.0 to 1.20.1 by @dependabot in #101
⬆️ Bump mypy from 0.941 to 0.942 by @dependabot in #104
⬆️ Bump mkdocs-material from 8.2.6 to 8.2.7 by @dependabot in #105

Full Changelog: 0.11.0...0.12.0

✨ Save data on each page

You can now save data after scraping a page. Save functions should be decorated with is_per_page=True and execute the scraper with --save-per-page to use it.

@save("jsonl", is_per_page=True)
def save_jsonl(data, output) -> bool:
    global jsonl_file
    jsonl_file.writelines((json.dumps(item) + "\n" for item in data))
    return True

✨ Shutdown event

The shutdown even is called before the application terminates. This is useful when freeing resources, file handles, databases or other use-cases before ending.

@shutdown()
def zip_all():
    global SAVE_DIR
    shutil.make_archive("images-and-pdfs", "zip", SAVE_DIR)

✨ How dude runs internally

What's Changed

Features

✨ Events by @roniemartinez in #99
🔗 Follow URLs by @roniemartinez in #90

Documentation

📚 Update docs by @roniemartinez in #93

Fixes

💚 Fix Actions rate limit error by @roniemartinez in #81
🐛 Fix DevToolsActivePort file doesn't exist by @roniemartinez in #84
🐛 Fix selenium failing on Windows by @roniemartinez in #94

Other

⬆️ Bump selenium-wire from 4.6.2 to 4.6.3 by @dependabot in #80
⬆️ Bump mypy from 0.931 to 0.941 by @dependabot in #82
⬆️ Bump pytest from 7.0.1 to 7.1.0 by @dependabot in #78
⬆️ Bump braveblock from 0.1.13 to 0.2.0 by @dependabot in #83
⬆️ Bump playwright from 1.19.1 to 1.20.0 by @dependabot in #87
⬆️ Bump types-pyyaml from 6.0.4 to 6.0.5 by @dependabot in #88
⬆️ Bump pytest from 7.1.0 to 7.1.1 by @dependabot in #91
⬆️ Bump webdriver-manager from 3.5.3 to 3.5.4 by @dependabot in #97
⬆️ Bump mkdocs-material from 8.2.5 to 8.2.6 by @dependabot in #100

✨ Basic Spider

Example

dude scrape ... --follow-urls

if __name__ == "__main__":
    import dude

    dude.run(..., follow_urls=True)

✨ Events

More details at https://roniemartinez.github.io/dude/advanced/14_events.html

Example

import uuid
from pathlib import Path

from dude import post_setup, pre_setup, startup

SAVE_DIR: Path


@startup()
def initialize_csv():
    """
    Connection to databases or API and other use-cases can be done here before the web scraping process is started.
    """
    global SAVE_DIR
    SAVE_DIR = Path(__file__).resolve().parent / "temp"
    SAVE_DIR.mkdir(exist_ok=True)


@pre_setup()
def screenshot(page):
    """
    Perform actions here after loading a page (or after a successful HTTP response) and before modifying things in the
    setup stage.
    """
    unique_name = str(uuid.uuid4())
    page.screenshot(path=SAVE_DIR / f"{unique_name}.png")  # noqa


@post_setup()
def print_pdf(page):
    """
    Perform actions here after running the setup stage.
    """
    unique_name = str(uuid.uuid4())
    page.pdf(path=SAVE_DIR / f"{unique_name}.pdf")  # noqa


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"])

Diagram showing when events are executed

Full Changelog: 0.10.1...0.11.0

Releases: roniemartinez/dude

🔨 Run adblock on HTTPX request event hook

What's Changed

New Contributors

Contributors

✨ Use fnmatch

What's Changed

Other

fnmatch: URL pattern matcher now uses Unix style wildcards (fnmatch) instead of regex

Contributors

✨ Make return value of decorated functions optional

What's Changed

Contributors

🐛 Fix PlaywrightScraper overwriting output file

What's Changed

Contributors

🔨 Refactor for Alpha

What's Changed

Contributors

✨ Add shutdown event and save per page option

What's Changed

Other

✨ Save data on each page

✨ Shutdown event

✨ How dude runs internally

Contributors

✨ Events and Basic Spider

What's Changed

Features

Documentation

Fixes

Other

✨ Basic Spider

Example

✨ Events

Example

Diagram showing when events are executed

Contributors

🏁 Fix Windows support

What's Changed

Contributors

✨ Block ads

What's Changed

Added

Changed

Contributors

🔧 Disable notifications

What's Changed

Contributors