Skip to content

Commit babff66

Browse files
🔨 Refactor and update docs (#75)
* 🔨 Refactor and update docs * Change to workflow_dispatch
1 parent d51bb02 commit babff66

16 files changed

+55
-55
lines changed

.github/workflows/documentation.yml

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ on:
77
push:
88
tags:
99
- '*'
10+
workflow_dispatch:
1011

1112
concurrency:
1213
group: ${{ github.ref }}

README.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -76,24 +76,24 @@ dude scrape --url "<url>" --output data.json path/to/script.py
7676
- Navigate function - enable navigation steps to move to other pages.
7777
- Custom storage - option to save data to other formats or database.
7878
- Async support - write async handlers.
79-
- Option to use other parsers aside from Playwright.
79+
- Option to use other parser backends aside from Playwright.
8080
- [BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html) - `pip install pydude[bs4]`
8181
- [Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html) - `pip install pydude[parsel]`
8282
- [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`
8383
- [Pyppeteer](https://roniemartinez.github.io/dude/advanced/12_pyppeteer.html) - `pip install pydude[pyppeteer]`
8484
- [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`
8585

86-
## Supported Parsers
86+
## Supported Parser Backends
8787

88-
By default, Dude uses Playwright but gives you an option to use parsers that you are familiar with.
88+
By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
8989
It is possible to use parser backends like
90-
[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/),
91-
[Parsel](https://github.com/scrapy/parsel),
92-
[lxml](https://lxml.de/),
93-
[Pyppeteer](https://github.com/pyppeteer/pyppeteer),
94-
and [Selenium](https://github.com/SeleniumHQ/Selenium).
90+
[BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html),
91+
[Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html),
92+
[lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html),
93+
[Pyppeteer](https://roniemartinez.github.io/dude/advanced/12_pyppeteer.html),
94+
and [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html).
9595

96-
Here is the summary of features supported by each parser.
96+
Here is the summary of features supported by each parser backend.
9797

9898
<table>
9999
<thead>

docs/advanced/09_beautifulsoup4.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BeautifulSoup4 Scraper
22

3-
Option to use [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) as parser instead of Playwright has been added in [Release 0.2.0](https://github.com/roniemartinez/dude/releases/tag/0.2.0).
3+
Option to use [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) as parser backend instead of Playwright has been added in [Release 0.2.0](https://github.com/roniemartinez/dude/releases/tag/0.2.0).
44
BeautifulSoup4 is an optional dependency and can only be installed via `pip` using the command below.
55

66
=== "Terminal"
@@ -11,7 +11,7 @@ BeautifulSoup4 is an optional dependency and can only be installed via `pip` usi
1111

1212
## Required changes to your script in order to use BeautifulSoup4
1313

14-
Instead of ElementHandle objects when using Playwright as parser, Soup objects are passed to the decorated functions.
14+
Instead of ElementHandle objects when using Playwright as parser backend, Soup objects are passed to the decorated functions.
1515

1616

1717
=== "Python"
@@ -36,7 +36,7 @@ Instead of ElementHandle objects when using Playwright as parser, Soup objects a
3636

3737
## Running Dude with BeautifulSoup4
3838

39-
You can run BeautifulSoup4 parser using the `--bs4` command-line argument or `parser="bs4"` parameter to `run()`.
39+
You can run BeautifulSoup4 parser backend using the `--bs4` command-line argument or `parser="bs4"` parameter to `run()`.
4040

4141

4242
=== "Terminal"

docs/advanced/10_parsel.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Parsel Scraper
22

3-
Option to use [Parsel](https://github.com/scrapy/parsel) as parser instead of Playwright has been added in [Release 0.5.0](https://github.com/roniemartinez/dude/releases/tag/0.5.0).
3+
Option to use [Parsel](https://github.com/scrapy/parsel) as parser backend instead of Playwright has been added in [Release 0.5.0](https://github.com/roniemartinez/dude/releases/tag/0.5.0).
44
Parsel is an optional dependency and can only be installed via `pip` using the command below.
55

66
=== "Terminal"
@@ -11,7 +11,7 @@ Parsel is an optional dependency and can only be installed via `pip` using the c
1111

1212
## Required changes to your script in order to use Parsel
1313

14-
Instead of ElementHandle objects when using Playwright as parser, Selector objects are passed to the decorated functions.
14+
Instead of ElementHandle objects when using Playwright as parser backend, Selector objects are passed to the decorated functions.
1515

1616

1717
=== "Python"
@@ -37,7 +37,7 @@ Instead of ElementHandle objects when using Playwright as parser, Selector objec
3737

3838
## Running Dude with Parsel
3939

40-
You can run Parsel parser using the `--parsel` command-line argument or `parser="parsel"` parameter to `run()`.
40+
You can run Parsel parser backend using the `--parsel` command-line argument or `parser="parsel"` parameter to `run()`.
4141

4242

4343
=== "Terminal"

docs/advanced/11_lxml.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# lxml Scraper
22

3-
Option to use [lxml](https://lxml.de/) as parser instead of Playwright has been added in [Release 0.6.0](https://github.com/roniemartinez/dude/releases/tag/0.6.0).
3+
Option to use [lxml](https://lxml.de/) as parser backend instead of Playwright has been added in [Release 0.6.0](https://github.com/roniemartinez/dude/releases/tag/0.6.0).
44
lxml is an optional dependency and can only be installed via `pip` using the command below.
55

66
=== "Terminal"
@@ -11,7 +11,7 @@ lxml is an optional dependency and can only be installed via `pip` using the com
1111

1212
## Required changes to your script in order to use lxml
1313

14-
Instead of ElementHandle objects when using Playwright as parser, [Element, "smart" strings, etc.](https://lxml.de/xpathxslt.html#xpath-return-values) objects are passed to the decorated functions.
14+
Instead of ElementHandle objects when using Playwright as parser backend, [Element, "smart" strings, etc.](https://lxml.de/xpathxslt.html#xpath-return-values) objects are passed to the decorated functions.
1515

1616

1717
=== "Python"
@@ -24,10 +24,10 @@ Instead of ElementHandle objects when using Playwright as parser, [Element, "sma
2424
def result_url(href):
2525
return {"url": href} # (2)
2626

27-
28-
# Option to get url using cssselect
29-
@select(css="a.url", priority=2)
30-
def result_url(element):
27+
28+
"""Option to get url using cssselect""" # style.css hides a comment
29+
@select(css="a.url")
30+
def result_url_css(element):
3131
return {"url_css": element.attrib["href"]} # (3)
3232

3333

@@ -44,7 +44,7 @@ Instead of ElementHandle objects when using Playwright as parser, [Element, "sma
4444

4545
## Running Dude with lxml
4646

47-
You can run lxml parser using the `--lxml` command-line argument or `parser="lxml"` parameter to `run()`.
47+
You can run lxml parser backend using the `--lxml` command-line argument or `parser="lxml"` parameter to `run()`.
4848

4949

5050
=== "Terminal"

docs/advanced/12_pyppeteer.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Pyppeteer Scraper
22

3-
Option to use [Pyppeteer](https://github.com/pyppeteer/pyppeteer) as parser instead of Playwright has been added in [Release 0.8.0](https://github.com/roniemartinez/dude/releases/tag/0.8.0).
3+
Option to use [Pyppeteer](https://github.com/pyppeteer/pyppeteer) as parser backend instead of Playwright has been added in [Release 0.8.0](https://github.com/roniemartinez/dude/releases/tag/0.8.0).
44
Pyppeteer is an optional dependency and can only be installed via `pip` using the command below.
55

66
=== "Terminal"
@@ -14,7 +14,7 @@ Pyppeteer is an optional dependency and can only be installed via `pip` using th
1414

1515
## Required changes to your script in order to use Pyppeteer
1616

17-
Instead of Playwright's `ElementHandle` objects when using Playwright as parser, Pyppeteer has its own `ElementHandle` objects that are passed to the decorated functions.
17+
Instead of Playwright's `ElementHandle` objects when using Playwright as parser backend, Pyppeteer has its own `ElementHandle` objects that are passed to the decorated functions.
1818
The decorated functions will need to accept 2 arguments, `element` and `page` objects.
1919
This is needed because Pyppeteer elements does not expose a convenient function to get the text content.
2020

@@ -46,7 +46,7 @@ This is needed because Pyppeteer elements does not expose a convenient function
4646

4747
## Running Dude with Pyppeteer
4848

49-
You can run Pyppeteer parser using the `--pyppeteer` command-line argument or `parser="pyppeteer"` parameter to `run()`.
49+
You can run Pyppeteer parser backend using the `--pyppeteer` command-line argument or `parser="pyppeteer"` parameter to `run()`.
5050

5151
=== "Terminal"
5252

docs/advanced/13_selenium.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Selenium Scraper
22

3-
Option to use [Selenium](https://github.com/SeleniumHQ/Selenium) as parser instead of Playwright has been added in [Release 0.9.0](https://github.com/roniemartinez/dude/releases/tag/0.9.0).
3+
Option to use [Selenium](https://github.com/SeleniumHQ/Selenium) as parser backend instead of Playwright has been added in [Release 0.9.0](https://github.com/roniemartinez/dude/releases/tag/0.9.0).
44
Selenium is an optional dependency and can only be installed via `pip` using the command below.
55

66
=== "Terminal"
@@ -11,7 +11,7 @@ Selenium is an optional dependency and can only be installed via `pip` using the
1111

1212
## Required changes to your script in order to use Selenium
1313

14-
Instead of Playwright's `ElementHandle` objects when using Playwright as parser, `WebElement` objects are passed to the decorated functions.
14+
Instead of Playwright's `ElementHandle` objects when using Playwright as parser backend, `WebElement` objects are passed to the decorated functions.
1515

1616
=== "Python"
1717

@@ -31,7 +31,7 @@ Instead of Playwright's `ElementHandle` objects when using Playwright as parser,
3131

3232
## Running Dude with Selenium
3333

34-
You can run Selenium parser using the `--selenium` command-line argument or `parser="selenium"` parameter to `run()`.
34+
You can run Selenium parser backend using the `--selenium` command-line argument or `parser="selenium"` parameter to `run()`.
3535

3636
=== "Terminal"
3737

docs/features.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
- Navigate function - enable navigation steps to move to other pages.
1010
- Custom storage - option to save data to other formats or database.
1111
- Async support - write async handlers.
12-
- Option to use other parsers aside from Playwright.
12+
- Option to use other parser backends aside from Playwright.
1313
- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - `pip install pydude[bs4]`
1414
- [Parsel](https://github.com/scrapy/parsel) - `pip install pydude[parsel]`
1515
- [lxml](https://lxml.de/) - `pip install pydude[lxml]`

docs/supported_parsers/index.md docs/supported_parser_backends/index.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# Supported Parsers
1+
# Supported Parser Backends
22

3-
By default, Dude uses Playwright but gives you an option to use parsers that you are familiar with.
3+
By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
44
It is possible to use parser backends like [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [Parsel](https://github.com/scrapy/parsel) and [lxml](https://lxml.de/).
55

6-
Here is the summary of features supported by each parser.
6+
Here is the summary of features supported by each parser backend.
77

88
<table>
99
<thead>

dude/optional/lxml_scraper.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
class LxmlScraper(ScraperAbstract):
1818
"""
19-
Scraper using lxml parser and HTTPX for requests
19+
Scraper using lxml parser backend and HTTPX for requests
2020
"""
2121

2222
def run(

dude/optional/parsel_scraper.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
class ParselScraper(ScraperAbstract):
1717
"""
18-
Scraper using Parsel parser and HTTPX for requests
18+
Scraper using Parsel parser backend and HTTPX for requests
1919
"""
2020

2121
def run(

dude/playwright_scraper.py

+12-13
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import asyncio
22
import itertools
33
import logging
4-
from typing import Any, AsyncIterable, Callable, Iterable, Optional, Sequence, Tuple, Union
4+
from typing import Any, AsyncIterable, Callable, Dict, Iterable, Optional, Sequence, Tuple, Union
55

66
from playwright import async_api, sync_api
77
from playwright.async_api import async_playwright
@@ -141,6 +141,13 @@ async def navigate_async(self, page: async_api.Page = None) -> bool:
141141
return True
142142
return False
143143

144+
@staticmethod
145+
def _get_launch_kwargs(browser_type: str) -> Dict[str, Any]:
146+
args = []
147+
if browser_type == "chromium":
148+
args.append("--disable-notifications")
149+
return {"args": args, "firefox_user_prefs": {"dom.webnotifications.enabled": False}}
150+
144151
def _run_sync(
145152
self,
146153
urls: Sequence[str],
@@ -151,14 +158,10 @@ def _run_sync(
151158
output: Optional[str],
152159
format: str,
153160
) -> None:
161+
launch_kwargs = self._get_launch_kwargs(browser_type)
154162
# FIXME: Coverage fails to cover anything within this context manager block
155163
with sync_playwright() as p:
156-
args = []
157-
if browser_type == "chromium":
158-
args.append("--disable-notifications")
159-
browser = p[browser_type].launch(
160-
headless=headless, proxy=proxy, args=args, firefox_user_prefs={"dom.webnotifications.enabled": False}
161-
)
164+
browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
162165
page = browser.new_page()
163166
self._scrape_sync(page, urls, pages)
164167
browser.close()
@@ -186,13 +189,9 @@ async def _run_async(
186189
output: Optional[str],
187190
format: str,
188191
) -> None:
192+
launch_kwargs = self._get_launch_kwargs(browser_type)
189193
async with async_playwright() as p:
190-
args = []
191-
if browser_type == "chromium":
192-
args.append("--disable-notifications")
193-
browser = await p[browser_type].launch(
194-
headless=headless, proxy=proxy, args=args, firefox_user_prefs={"dom.webnotifications.enabled": False}
195-
)
194+
browser = await p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
196195
page = await browser.new_page()
197196
for url in urls:
198197
await page.goto(url)

dude/scraper.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,15 @@ def run(
2525
browser_type: str = "chromium",
2626
) -> None:
2727
"""
28-
Convenience method to handle switching between different types of parsers.
28+
Convenience method to handle switching between different types of parser backends.
2929
3030
:param urls: List of website URLs.
3131
:param pages: Maximum number of pages to crawl before exiting (default=1). This is only used when a navigate handler is defined. # noqa
3232
:param proxy: Proxy settings.
3333
:param output: Output file. If not provided, prints in the terminal.
3434
:param format: Output file format. If not provided, uses the extension of the output file or defaults to json.
3535
36-
:param parser: Parser type ["playwright" (default), "bs4", "parsel, "lxml", "pyppeteer" or "selenium"]
36+
:param parser: Parser backend ["playwright" (default), "bs4", "parsel, "lxml", "pyppeteer" or "selenium"]
3737
:param headless: Enables headless browser. (default=True)
3838
:param browser_type: Playwright supported browser types ("chromium", "chrome", "webkit", or "firefox").
3939
"""

dude/storage.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def _save_json(data: List[Dict], output: str) -> None: # pragma: no cover
2727

2828
with open(output, "w") as f:
2929
json.dump(data, f, indent=2)
30-
logger.info("Data saved to %s", output)
30+
logger.info("%d items saved to %s.", len(data), output)
3131

3232

3333
def save_csv(data: List[Dict], output: Optional[str]) -> bool:
@@ -79,12 +79,12 @@ def _save_csv(data: List[Dict], output: str) -> None: # pragma: no cover
7979
writer = csv.DictWriter(f, fieldnames=headers)
8080
writer.writeheader()
8181
writer.writerows(rows)
82-
logger.info("Data saved to %s", output)
82+
logger.info("%d items saved to %s.", len(data), output)
8383

8484

8585
def _save_yaml(data: List[Dict], output: str) -> None: # pragma: no cover
8686
import yaml
8787

8888
with open(output, "w") as f:
8989
yaml.safe_dump(data, f)
90-
logger.info("Data saved to %s", output)
90+
logger.info("%d items saved to %s.", len(data), output)

mkdocs.yml

+3-3
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@ nav:
4141
- lxml Scraper: advanced/11_lxml.md
4242
- Pyppeteer Scraper: advanced/12_pyppeteer.md
4343
- Selenium Scraper: advanced/13_selenium.md
44-
- Supported Parsers:
45-
- supported_parsers/index.md
46-
- Migrating Your Web Scrapers to Dude: supported_parsers/migrating.md
44+
- Supported Parser Backends:
45+
- supported_parser_backends/index.md
46+
- Migrating Your Web Scrapers to Dude: supported_parser_backends/migrating.md
4747
- cli.md
4848
- reference.md
4949

0 commit comments

Comments
 (0)