Skip to content

okkymabruri/news-watch

Repository files navigation

news-watch: Indonesia's top news websites scraper

PyPI version Build Status PyPI Downloads

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

⚠️ Ethical Considerations & Disclaimer ⚠️

Purpose: This project is intended for educational and research purposes only. It is not designed for commercial use that could be detrimental to the news source providers.

User Responsibility:

  • Users of this software are solely responsible for their actions and must comply with the Terms of Service and robots.txt file of each news website they intend to scrape.
  • Aggressive scraping or any use that violates a website's terms may lead to IP blocking or other consequences from the website owners.
  • We strongly advise users to scrape responsibly, respect website limitations, and avoid overloading servers.

Installation

You can install newswatch via pip:

pip install news-watch

To install the development version:

pip install git+https://github.com/okkymabruri/news-watch.git@dev

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v

Command-Line Arguments

--keywords, -k: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

--start_date, -sd: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2025-01-01).

--scrapers, -s: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

--output_format, -of: Optional. Specify the output format (currently support csv, xlsx).

--verbose, -v: Optional. Show all logging output (silent by default).

--list_scrapers: Optional. List supported scrapers.

Examples

Scrape articles related to "ihsg" from January 1st, 2025:

newswatch --keywords ihsg --start_date 2025-01-01

Scrape articles for multiple keywords (ihsg, bank, keuangan) with verbose logging:

newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 -v

List supported scrapers:

newswatch --list_scrapers

Scrape articles for specific news website (detik) with excel output format:

newswatch -k "ihsg" -s "detik" --output_format xlsx

Run on Google Colab

You can run news-watch on Google Colab Open In Colab

Output

The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

  • title
  • publish_date
  • author
  • content
  • keyword
  • category
  • source
  • link

Supported Websites

Note:

  • Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.
  • Limitation: Kontan.co.id scraper can process a maximum of 50 pages.

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.

Citation

If you use this software, please cite it using the following:

DOI

@software{mabruri_newswatch,
  author       = {Okky Mabruri},
  title        = {news-watch},
  version      = {0.2.2},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.14912258},
  url          = {https://doi.org/10.5281/zenodo.14912258}
}

Available on Zenodo: DOI

Related Work

About

news-watch: Indonesia's top news websites scraper

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published