##---Use https://github.com/noanchovies/scraper-engine instead ---
---This version was cool but not good enough. scraper-engine is faster, plug&play, "more better" ---
Keeping old version for log and learning purposes
A generic base template project for building web scrapers using Python, Selenium, BeautifulSoup, and Typer. Designed to be easily copied and adapted for various scraping targets.
Note: For quickly briefing AI assistants (like Google Gemini) on this template's structure and how to adapt it, refer to the AI_CONTEXT.txt
file in the project root. It includes a summary and an adaptation checklist for next steps.
- Selenium WebDriver: Uses Selenium with
webdriver-manager
for automated browser control, capable of handling dynamic JavaScript-heavy websites. Configurable headless mode. - HTML Parsing: Integrates BeautifulSoup for parsing HTML structure obtained via Selenium.
- Configurable: Easily configure target URLs, output filenames, wait times, and headless mode via
src/basescraper/config.py
,.env
files, or command-line arguments. - CLI Interface: Uses Typer to provide a clean command-line interface for running the scraper.
- Modular Structure: Separates concerns into configuration (
config.py
), core scraping logic (scraper.py
), and CLI (cli.py
) within a standardsrc
layout. - Placeholder Implementation: Core data extraction (
extract_data
) and data handling (handle_data
) functions are provided as clear placeholders (NotImplementedError
) that must be implemented for each specific scraping project. - Structured Output (Example): Includes an optional pattern for saving data to CSV using Pandas (
save_to_csv
function commented out withinhandle_data
placeholder).
- Language: Python 3.8+
- Browser Automation: Selenium
- Driver Management: webdriver-manager
- HTML Parsing: BeautifulSoup4
- Data Handling (Example): Pandas (for CSV saving pattern)
- CLI: Typer, Rich
- Configuration: python-dotenv
- Packaging: setuptools, pyproject.toml
- Copy Template: Create a new project by copying this entire
base-scraper-py
directory. - Navigate:
cd
into your new project directory. - Create Virtual Environment:
python -m venv venv
- Activate Environment:
- Windows:
.\venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
- Windows:
- Install Dependencies:
pip install -r requirements.txt
- (Optional) Git Init: If desired, delete the copied
.git
folder, rungit init
, create a new remote repository, and link it (git remote add origin <url>
).
- Implement Logic: Follow the detailed steps in
HOW_TO_USE_BASE.txt
to implement the requiredextract_data
andhandle_data
functions withinsrc/basescraper/scraper.py
for your specific target website. - Configure: Set your target URL and other parameters in
.env
orsrc/basescraper/config.py
. - Run from CLI:
- Option A (Run as module):
python -m src.basescraper.cli run [OPTIONS]
- Option B (If installed editable
pip install -e .
):basescraper run [OPTIONS]
- Option A (Run as module):
- CLI Options: Use
--help
to see available options:Example:python -m src.basescraper.cli run --help # or basescraper run --help
basescraper run --url "your-target-url.com" -o "my_output.csv" --no-headless
The core adaptation steps are detailed in HOW_TO_USE_BASE.txt
. Primarily involves implementing:
extract_data(page_source)
: Add logic using BeautifulSoup selectors to parse the HTML (page_source
) from your target site and return a list of dictionaries.handle_data(data, output_target)
: Add logic to process the list of dictionaries returned byextract_data
(e.g., save to CSV, database, API).
MIT License (Update LICENSE
file and pyproject.toml
if using a different license).
(Add details here if you plan for others to contribute or how to contact you).