news-watch: Indonesia's top news websites scraper

news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research

⚠️ Ethical Considerations & Disclaimer ⚠️

Purpose: This project is intended for educational and research purposes only. It is not designed for commercial use that could be detrimental to the news source providers.

User Responsibility:

Users of this software are solely responsible for their actions and must comply with the Terms of Service and robots.txt file of each news website they intend to scrape.

Aggressive scraping or any use that violates a website's terms may lead to IP blocking or other consequences from the website owners.

We strongly advise users to scrape responsibly, respect website limitations, and avoid overloading servers.

Installation

You can install newswatch via pip:

pip install news-watch

To install the development version:

pip install git+https://github.com/okkymabruri/news-watch.git@dev

Usage

To run the scraper from the command line:

newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v

Command-Line Arguments

--keywords, -k: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").

--start_date, -sd: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2025-01-01).

--scrapers, -s: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.

--output_format, -of: Optional. Specify the output format (currently support csv, xlsx).

--verbose, -v: Optional. Show all logging output (silent by default).

--list_scrapers: Optional. List supported scrapers.

Examples

Scrape articles related to "ihsg" from January 1st, 2025:

newswatch --keywords ihsg --start_date 2025-01-01

Scrape articles for multiple keywords (ihsg, bank, keuangan) with verbose logging:

newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 -v

List supported scrapers:

newswatch --list_scrapers

Scrape articles for specific news website (detik) with excel output format:

newswatch -k "ihsg" -s "detik" --output_format xlsx

Run on Google Colab

You can run news-watch on Google Colab

Output

The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH.

The output file contains the following columns:

title
publish_date
author
content
keyword
category
source
link

Supported Websites

Note:

Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.

Limitation: Kontan.co.id scraper can process a maximum of 50 pages.

Contributing

Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.

Citation

If you use this software, please cite it using the following:

@software{mabruri_newswatch,
  author       = {Okky Mabruri},
  title        = {news-watch},
  version      = {0.2.2},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.14912258},
  url          = {https://doi.org/10.5281/zenodo.14912258}
}

Available on Zenodo:

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
newswatch		newswatch
notebook		notebook
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

news-watch: Indonesia's top news websites scraper

⚠️ Ethical Considerations & Disclaimer ⚠️

Installation

Usage

Examples

Run on Google Colab

Output

Supported Websites

Contributing

License

Citation

Related Work

About

Uh oh!

Releases 7

Packages

Uh oh!

Languages

License

okkymabruri/news-watch

Folders and files

Latest commit

History

Repository files navigation

news-watch: Indonesia's top news websites scraper

⚠️ Ethical Considerations & Disclaimer ⚠️

Installation

Usage

Examples

Run on Google Colab

Output

Supported Websites

Contributing

License

Citation

Related Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Languages

Packages