Releases: roniemartinez/dude
Releases · roniemartinez/dude
🔨 Run adblock on HTTPX request event hook
What's Changed
- 🔨 Run adblock on HTTPX request event hook by @roniemartinez in #126
- docs: add roniemartinez as a contributor for maintenance, code, doc, infra by @allcontributors in #125
New Contributors
- @allcontributors made their first contribution in #125
Full Changelog: 0.14.0...0.15.0
✨ Use fnmatch
What's Changed
- ✨ Use fnmatch by @roniemartinez in #122
Other
- ⬆️ Bump pyproject-flake8 from 0.0.1a2 to 0.0.1a3 by @dependabot in #120
- ⬆️ Bump black from 22.1.0 to 22.3.0 by @dependabot in #121
fnmatch: URL pattern matcher now uses Unix style wildcards (fnmatch) instead of regex
See: https://docs.python.org/3/library/fnmatch.html
Wildcards are easier to understand and simpler to use compared to regular expressions
- @select(css=".title", url=r".*\.com")
+ @select(css=".title", url="*.com/*")
def result_title(element):
return {"title": element.text_content()}
Full Changelog: 0.13.0...0.14.0
✨ Make return value of decorated functions optional
What's Changed
- ✨ Make return value of decorated functions optional by @roniemartinez in #119
Full Changelog: 0.12.2...0.13.0
🐛 Fix PlaywrightScraper overwriting output file
What's Changed
- 🐛 Fix PlaywrightScraper overwriting output file by @roniemartinez in #118
Full Changelog: 0.12.1...0.12.2
🔨 Refactor for Alpha
✨ Add shutdown event and save per page option
What's Changed
- ✨ Add shutdown event and save per page option by @roniemartinez in #102
Other
- ⬆️ Bump playwright from 1.20.0 to 1.20.1 by @dependabot in #101
- ⬆️ Bump mypy from 0.941 to 0.942 by @dependabot in #104
- ⬆️ Bump mkdocs-material from 8.2.6 to 8.2.7 by @dependabot in #105
Full Changelog: 0.11.0...0.12.0
✨ Save data on each page
You can now save data after scraping a page. Save functions should be decorated with is_per_page=True
and execute the scraper with --save-per-page
to use it.
@save("jsonl", is_per_page=True)
def save_jsonl(data, output) -> bool:
global jsonl_file
jsonl_file.writelines((json.dumps(item) + "\n" for item in data))
return True
✨ Shutdown event
The shutdown even is called before the application terminates. This is useful when freeing resources, file handles, databases or other use-cases before ending.
@shutdown()
def zip_all():
global SAVE_DIR
shutil.make_archive("images-and-pdfs", "zip", SAVE_DIR)
✨ How dude runs internally
✨ Events and Basic Spider
What's Changed
Features
- ✨ Events by @roniemartinez in #99
- 🔗 Follow URLs by @roniemartinez in #90
Documentation
- 📚 Update docs by @roniemartinez in #93
Fixes
- 💚 Fix Actions rate limit error by @roniemartinez in #81
- 🐛 Fix DevToolsActivePort file doesn't exist by @roniemartinez in #84
- 🐛 Fix selenium failing on Windows by @roniemartinez in #94
Other
- ⬆️ Bump selenium-wire from 4.6.2 to 4.6.3 by @dependabot in #80
- ⬆️ Bump mypy from 0.931 to 0.941 by @dependabot in #82
- ⬆️ Bump pytest from 7.0.1 to 7.1.0 by @dependabot in #78
- ⬆️ Bump braveblock from 0.1.13 to 0.2.0 by @dependabot in #83
- ⬆️ Bump playwright from 1.19.1 to 1.20.0 by @dependabot in #87
- ⬆️ Bump types-pyyaml from 6.0.4 to 6.0.5 by @dependabot in #88
- ⬆️ Bump pytest from 7.1.0 to 7.1.1 by @dependabot in #91
- ⬆️ Bump webdriver-manager from 3.5.3 to 3.5.4 by @dependabot in #97
- ⬆️ Bump mkdocs-material from 8.2.5 to 8.2.6 by @dependabot in #100
✨ Basic Spider
Example
dude scrape ... --follow-urls
or
if __name__ == "__main__":
import dude
dude.run(..., follow_urls=True)
✨ Events
More details at https://roniemartinez.github.io/dude/advanced/14_events.html
Example
import uuid
from pathlib import Path
from dude import post_setup, pre_setup, startup
SAVE_DIR: Path
@startup()
def initialize_csv():
"""
Connection to databases or API and other use-cases can be done here before the web scraping process is started.
"""
global SAVE_DIR
SAVE_DIR = Path(__file__).resolve().parent / "temp"
SAVE_DIR.mkdir(exist_ok=True)
@pre_setup()
def screenshot(page):
"""
Perform actions here after loading a page (or after a successful HTTP response) and before modifying things in the
setup stage.
"""
unique_name = str(uuid.uuid4())
page.screenshot(path=SAVE_DIR / f"{unique_name}.png") # noqa
@post_setup()
def print_pdf(page):
"""
Perform actions here after running the setup stage.
"""
unique_name = str(uuid.uuid4())
page.pdf(path=SAVE_DIR / f"{unique_name}.pdf") # noqa
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh"])
Diagram showing when events are executed
Full Changelog: 0.10.1...0.11.0
🏁 Fix Windows support
✨ Block ads
What's Changed
Added
- ✨ Block ads by @roniemartinez in #74
Changed
- 🔨 Refactor and update docs by @roniemartinez in #75
Full Changelog: 0.9.2...0.10.0