🔨 Refactor and update docs (#75)

roniemartinez · web-flow · commit babff66202dc · 2022-03-13T14:13:42.000+01:00
* 🔨 Refactor and update docs

* Change to workflow_dispatch
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -7,6 +7,7 @@ on:
   push:
     tags:
       - '*'
+  workflow_dispatch:
 
 concurrency:
   group: ${{ github.ref }}
diff --git a/README.md b/README.md
@@ -76,24 +76,24 @@ dude scrape --url "<url>" --output data.json path/to/script.py
 - Navigate function - enable navigation steps to move to other pages.
 - Custom storage - option to save data to other formats or database.
 - Async support - write async handlers.
-- Option to use other parsers aside from Playwright.
+- Option to use other parser backends aside from Playwright.
   - [BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html) - `pip install pydude[bs4]`
   - [Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html) - `pip install pydude[parsel]`
   - [lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html) - `pip install pydude[lxml]`
   - [Pyppeteer](https://roniemartinez.github.io/dude/advanced/12_pyppeteer.html) - `pip install pydude[pyppeteer]`
   - [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html) - `pip install pydude[selenium]`
 
-## Supported Parsers
+## Supported Parser Backends
 
-By default, Dude uses Playwright but gives you an option to use parsers that you are familiar with.
+By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
 It is possible to use parser backends like 
-[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), 
-[Parsel](https://github.com/scrapy/parsel),
-[lxml](https://lxml.de/),
-[Pyppeteer](https://github.com/pyppeteer/pyppeteer), 
-and [Selenium](https://github.com/SeleniumHQ/Selenium).
+[BeautifulSoup4](https://roniemartinez.github.io/dude/advanced/09_beautifulsoup4.html), 
+[Parsel](https://roniemartinez.github.io/dude/advanced/10_parsel.html),
+[lxml](https://roniemartinez.github.io/dude/advanced/11_lxml.html),
+[Pyppeteer](https://roniemartinez.github.io/dude/advanced/12_pyppeteer.html), 
+and [Selenium](https://roniemartinez.github.io/dude/advanced/13_selenium.html).
 
-Here is the summary of features supported by each parser.
+Here is the summary of features supported by each parser backend.
 
 <table>
 <thead>
diff --git a/docs/advanced/09_beautifulsoup4.md b/docs/advanced/09_beautifulsoup4.md
@@ -1,6 +1,6 @@
 # BeautifulSoup4 Scraper
 
-Option to use [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) as parser instead of Playwright has been added in [Release 0.2.0](https://github.com/roniemartinez/dude/releases/tag/0.2.0).
+Option to use [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) as parser backend instead of Playwright has been added in [Release 0.2.0](https://github.com/roniemartinez/dude/releases/tag/0.2.0).
 BeautifulSoup4 is an optional dependency and can only be installed via `pip` using the command below.
 
 === "Terminal"
@@ -11,7 +11,7 @@ BeautifulSoup4 is an optional dependency and can only be installed via `pip` usi
 
 ## Required changes to your script in order to use BeautifulSoup4
 
-Instead of ElementHandle objects when using Playwright as parser, Soup objects are passed to the decorated functions.
+Instead of ElementHandle objects when using Playwright as parser backend, Soup objects are passed to the decorated functions.
 
 
 === "Python"
@@ -36,7 +36,7 @@ Instead of ElementHandle objects when using Playwright as parser, Soup objects a
 
 ## Running Dude with BeautifulSoup4 
 
-You can run BeautifulSoup4 parser using the `--bs4` command-line argument or `parser="bs4"` parameter to `run()`.
+You can run BeautifulSoup4 parser backend using the `--bs4` command-line argument or `parser="bs4"` parameter to `run()`.
 
 
 === "Terminal"
diff --git a/docs/advanced/10_parsel.md b/docs/advanced/10_parsel.md
@@ -1,6 +1,6 @@
 # Parsel Scraper
 
-Option to use [Parsel](https://github.com/scrapy/parsel) as parser instead of Playwright has been added in [Release 0.5.0](https://github.com/roniemartinez/dude/releases/tag/0.5.0).
+Option to use [Parsel](https://github.com/scrapy/parsel) as parser backend instead of Playwright has been added in [Release 0.5.0](https://github.com/roniemartinez/dude/releases/tag/0.5.0).
 Parsel is an optional dependency and can only be installed via `pip` using the command below.
 
 === "Terminal"
@@ -11,7 +11,7 @@ Parsel is an optional dependency and can only be installed via `pip` using the c
 
 ## Required changes to your script in order to use Parsel
 
-Instead of ElementHandle objects when using Playwright as parser, Selector objects are passed to the decorated functions.
+Instead of ElementHandle objects when using Playwright as parser backend, Selector objects are passed to the decorated functions.
 
 
 === "Python"
@@ -37,7 +37,7 @@ Instead of ElementHandle objects when using Playwright as parser, Selector objec
 
 ## Running Dude with Parsel 
 
-You can run Parsel parser using the `--parsel` command-line argument or `parser="parsel"` parameter to `run()`.
+You can run Parsel parser backend using the `--parsel` command-line argument or `parser="parsel"` parameter to `run()`.
 
 
 === "Terminal"
diff --git a/docs/advanced/11_lxml.md b/docs/advanced/11_lxml.md
@@ -1,6 +1,6 @@
 # lxml Scraper
 
-Option to use [lxml](https://lxml.de/) as parser instead of Playwright has been added in [Release 0.6.0](https://github.com/roniemartinez/dude/releases/tag/0.6.0).
+Option to use [lxml](https://lxml.de/) as parser backend instead of Playwright has been added in [Release 0.6.0](https://github.com/roniemartinez/dude/releases/tag/0.6.0).
 lxml is an optional dependency and can only be installed via `pip` using the command below.
 
 === "Terminal"
@@ -11,7 +11,7 @@ lxml is an optional dependency and can only be installed via `pip` using the com
 
 ## Required changes to your script in order to use lxml
 
-Instead of ElementHandle objects when using Playwright as parser, [Element, "smart" strings, etc.](https://lxml.de/xpathxslt.html#xpath-return-values) objects are passed to the decorated functions.
+Instead of ElementHandle objects when using Playwright as parser backend, [Element, "smart" strings, etc.](https://lxml.de/xpathxslt.html#xpath-return-values) objects are passed to the decorated functions.
 
 
 === "Python"
@@ -24,10 +24,10 @@ Instead of ElementHandle objects when using Playwright as parser, [Element, "sma
     def result_url(href):
         return {"url": href} # (2)
     
-    
-    # Option to get url using cssselect
-    @select(css="a.url", priority=2)
-    def result_url(element):
+
+    """Option to get url using cssselect"""  # style.css hides a comment
+    @select(css="a.url")
+    def result_url_css(element):
         return {"url_css": element.attrib["href"]} # (3)
     
     
@@ -44,7 +44,7 @@ Instead of ElementHandle objects when using Playwright as parser, [Element, "sma
 
 ## Running Dude with lxml 
 
-You can run lxml parser using the `--lxml` command-line argument or `parser="lxml"` parameter to `run()`.
+You can run lxml parser backend using the `--lxml` command-line argument or `parser="lxml"` parameter to `run()`.
 
 
 === "Terminal"
diff --git a/docs/advanced/12_pyppeteer.md b/docs/advanced/12_pyppeteer.md
@@ -1,6 +1,6 @@
 # Pyppeteer Scraper
 
-Option to use [Pyppeteer](https://github.com/pyppeteer/pyppeteer) as parser instead of Playwright has been added in [Release 0.8.0](https://github.com/roniemartinez/dude/releases/tag/0.8.0).
+Option to use [Pyppeteer](https://github.com/pyppeteer/pyppeteer) as parser backend instead of Playwright has been added in [Release 0.8.0](https://github.com/roniemartinez/dude/releases/tag/0.8.0).
 Pyppeteer is an optional dependency and can only be installed via `pip` using the command below.
 
 === "Terminal"
@@ -14,7 +14,7 @@ Pyppeteer is an optional dependency and can only be installed via `pip` using th
 
 ## Required changes to your script in order to use Pyppeteer
 
-Instead of Playwright's `ElementHandle` objects when using Playwright as parser, Pyppeteer has its own `ElementHandle` objects that are passed to the decorated functions.
+Instead of Playwright's `ElementHandle` objects when using Playwright as parser backend, Pyppeteer has its own `ElementHandle` objects that are passed to the decorated functions.
 The decorated functions will need to accept 2 arguments, `element` and `page` objects. 
 This is needed because Pyppeteer elements does not expose a convenient function to get the text content.
 
@@ -46,7 +46,7 @@ This is needed because Pyppeteer elements does not expose a convenient function
 
 ## Running Dude with Pyppeteer 
 
-You can run Pyppeteer parser using the `--pyppeteer` command-line argument or `parser="pyppeteer"` parameter to `run()`.
+You can run Pyppeteer parser backend using the `--pyppeteer` command-line argument or `parser="pyppeteer"` parameter to `run()`.
 
 === "Terminal"
 
diff --git a/docs/advanced/13_selenium.md b/docs/advanced/13_selenium.md
@@ -1,6 +1,6 @@
 # Selenium Scraper
 
-Option to use [Selenium](https://github.com/SeleniumHQ/Selenium) as parser instead of Playwright has been added in [Release 0.9.0](https://github.com/roniemartinez/dude/releases/tag/0.9.0).
+Option to use [Selenium](https://github.com/SeleniumHQ/Selenium) as parser backend instead of Playwright has been added in [Release 0.9.0](https://github.com/roniemartinez/dude/releases/tag/0.9.0).
 Selenium is an optional dependency and can only be installed via `pip` using the command below.
 
 === "Terminal"
@@ -11,7 +11,7 @@ Selenium is an optional dependency and can only be installed via `pip` using the
 
 ## Required changes to your script in order to use Selenium
 
-Instead of Playwright's `ElementHandle` objects when using Playwright as parser, `WebElement` objects are passed to the decorated functions.
+Instead of Playwright's `ElementHandle` objects when using Playwright as parser backend, `WebElement` objects are passed to the decorated functions.
 
 === "Python"
 
@@ -31,7 +31,7 @@ Instead of Playwright's `ElementHandle` objects when using Playwright as parser,
 
 ## Running Dude with Selenium 
 
-You can run Selenium parser using the `--selenium` command-line argument or `parser="selenium"` parameter to `run()`.
+You can run Selenium parser backend using the `--selenium` command-line argument or `parser="selenium"` parameter to `run()`.
 
 === "Terminal"
 
diff --git a/docs/features.md b/docs/features.md
@@ -9,7 +9,7 @@
 - Navigate function - enable navigation steps to move to other pages.
 - Custom storage - option to save data to other formats or database.
 - Async support - write async handlers.
-- Option to use other parsers aside from Playwright.
+- Option to use other parser backends aside from Playwright.
     - [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - `pip install pydude[bs4]`
     - [Parsel](https://github.com/scrapy/parsel) - `pip install pydude[parsel]`
     - [lxml](https://lxml.de/) - `pip install pydude[lxml]`
diff --git a/docs/supported_parser_backends/index.md b/docs/supported_parser_backends/index.md
@@ -1,9 +1,9 @@
-# Supported Parsers
+# Supported Parser Backends
 
-By default, Dude uses Playwright but gives you an option to use parsers that you are familiar with.
+By default, Dude uses Playwright but gives you an option to use parser backends that you are familiar with.
 It is possible to use parser backends like [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [Parsel](https://github.com/scrapy/parsel) and [lxml](https://lxml.de/).
 
-Here is the summary of features supported by each parser.
+Here is the summary of features supported by each parser backend.
 
 <table>
 <thead>
diff --git a/docs/supported_parser_backends/migrating.md b/docs/supported_parser_backends/migrating.md
diff --git a/dude/optional/lxml_scraper.py b/dude/optional/lxml_scraper.py
@@ -16,7 +16,7 @@
 
 class LxmlScraper(ScraperAbstract):
     """
-    Scraper using lxml parser and HTTPX for requests
+    Scraper using lxml parser backend and HTTPX for requests
     """
 
     def run(
diff --git a/dude/optional/parsel_scraper.py b/dude/optional/parsel_scraper.py
@@ -15,7 +15,7 @@
 
 class ParselScraper(ScraperAbstract):
     """
-    Scraper using Parsel parser and HTTPX for requests
+    Scraper using Parsel parser backend and HTTPX for requests
     """
 
     def run(
diff --git a/dude/playwright_scraper.py b/dude/playwright_scraper.py
@@ -1,7 +1,7 @@
 import asyncio
 import itertools
 import logging
-from typing import Any, AsyncIterable, Callable, Iterable, Optional, Sequence, Tuple, Union
+from typing import Any, AsyncIterable, Callable, Dict, Iterable, Optional, Sequence, Tuple, Union
 
 from playwright import async_api, sync_api
 from playwright.async_api import async_playwright
@@ -141,6 +141,13 @@ async def navigate_async(self, page: async_api.Page = None) -> bool:
                 return True
         return False
 
+    @staticmethod
+    def _get_launch_kwargs(browser_type: str) -> Dict[str, Any]:
+        args = []
+        if browser_type == "chromium":
+            args.append("--disable-notifications")
+        return {"args": args, "firefox_user_prefs": {"dom.webnotifications.enabled": False}}
+
     def _run_sync(
         self,
         urls: Sequence[str],
@@ -151,14 +158,10 @@ def _run_sync(
         output: Optional[str],
         format: str,
     ) -> None:
+        launch_kwargs = self._get_launch_kwargs(browser_type)
         # FIXME: Coverage fails to cover anything within this context manager block
         with sync_playwright() as p:
-            args = []
-            if browser_type == "chromium":
-                args.append("--disable-notifications")
-            browser = p[browser_type].launch(
-                headless=headless, proxy=proxy, args=args, firefox_user_prefs={"dom.webnotifications.enabled": False}
-            )
+            browser = p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
             page = browser.new_page()
             self._scrape_sync(page, urls, pages)
             browser.close()
@@ -186,13 +189,9 @@ async def _run_async(
         output: Optional[str],
         format: str,
     ) -> None:
+        launch_kwargs = self._get_launch_kwargs(browser_type)
         async with async_playwright() as p:
-            args = []
-            if browser_type == "chromium":
-                args.append("--disable-notifications")
-            browser = await p[browser_type].launch(
-                headless=headless, proxy=proxy, args=args, firefox_user_prefs={"dom.webnotifications.enabled": False}
-            )
+            browser = await p[browser_type].launch(headless=headless, proxy=proxy, **launch_kwargs)
             page = await browser.new_page()
             for url in urls:
                 await page.goto(url)
diff --git a/dude/scraper.py b/dude/scraper.py
@@ -25,15 +25,15 @@ def run(
         browser_type: str = "chromium",
     ) -> None:
         """
-        Convenience method to handle switching between different types of parsers.
+        Convenience method to handle switching between different types of parser backends.
 
         :param urls: List of website URLs.
         :param pages: Maximum number of pages to crawl before exiting (default=1). This is only used when a navigate handler is defined. # noqa
         :param proxy: Proxy settings.
         :param output: Output file. If not provided, prints in the terminal.
         :param format: Output file format. If not provided, uses the extension of the output file or defaults to json.
 
-        :param parser: Parser type ["playwright" (default), "bs4", "parsel, "lxml", "pyppeteer" or "selenium"]
+        :param parser: Parser backend ["playwright" (default), "bs4", "parsel, "lxml", "pyppeteer" or "selenium"]
         :param headless: Enables headless browser. (default=True)
         :param browser_type: Playwright supported browser types ("chromium", "chrome", "webkit", or "firefox").
         """
diff --git a/dude/storage.py b/dude/storage.py
@@ -27,7 +27,7 @@ def _save_json(data: List[Dict], output: str) -> None:  # pragma: no cover
 
     with open(output, "w") as f:
         json.dump(data, f, indent=2)
-    logger.info("Data saved to %s", output)
+    logger.info("%d items saved to %s.", len(data), output)
 
 
 def save_csv(data: List[Dict], output: Optional[str]) -> bool:
@@ -79,12 +79,12 @@ def _save_csv(data: List[Dict], output: str) -> None:  # pragma: no cover
         writer = csv.DictWriter(f, fieldnames=headers)
         writer.writeheader()
         writer.writerows(rows)
-    logger.info("Data saved to %s", output)
+    logger.info("%d items saved to %s.", len(data), output)
 
 
 def _save_yaml(data: List[Dict], output: str) -> None:  # pragma: no cover
     import yaml
 
     with open(output, "w") as f:
         yaml.safe_dump(data, f)
-    logger.info("Data saved to %s", output)
+    logger.info("%d items saved to %s.", len(data), output)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,9 +41,9 @@ nav:
       - lxml Scraper: advanced/11_lxml.md
       - Pyppeteer Scraper: advanced/12_pyppeteer.md
       - Selenium Scraper: advanced/13_selenium.md
-  - Supported Parsers:
-      - supported_parsers/index.md
-      - Migrating Your Web Scrapers to Dude: supported_parsers/migrating.md
+  - Supported Parser Backends:
+      - supported_parser_backends/index.md
+      - Migrating Your Web Scrapers to Dude: supported_parser_backends/migrating.md
   - cli.md
   - reference.md