news-watch is a Python package that scrapes structured news data from Indonesia's top news websites, offering keyword and date filtering queries for targeted research
Purpose: This project is intended for educational and research purposes only. It is not designed for commercial use that could be detrimental to the news source providers.
User Responsibility:
- Users of this software are solely responsible for their actions and must comply with the Terms of Service and
robots.txt
file of each news website they intend to scrape.- Aggressive scraping or any use that violates a website's terms may lead to IP blocking or other consequences from the website owners.
- We strongly advise users to scrape responsibly, respect website limitations, and avoid overloading servers.
You can install newswatch via pip:
pip install news-watch
To install the development version:
pip install git+https://github.com/okkymabruri/news-watch.git@dev
To run the scraper from the command line:
newswatch -k <keywords> -sd <start_date> -s [<scrapers>] -of <output_format> -v
Command-Line Arguments
--keywords
, -k
: Required. A comma-separated list of keywords to scrape (e.g., -k "ojk,bank,npl").
--start_date
, -sd
: Required. The start date for scraping in YYYY-MM-DD format (e.g., -sd 2025-01-01).
--scrapers
, -s
: Optional. A comma-separated list of scrapers to use (e.g., -s "kompas,viva"). If not provided, all scrapers will be used by default.
--output_format
, -of
: Optional. Specify the output format (currently support csv, xlsx).
--verbose
, -v
: Optional. Show all logging output (silent by default).
--list_scrapers
: Optional. List supported scrapers.
Scrape articles related to "ihsg" from January 1st, 2025:
newswatch --keywords ihsg --start_date 2025-01-01
Scrape articles for multiple keywords (ihsg, bank, keuangan) with verbose logging:
newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 -v
List supported scrapers:
newswatch --list_scrapers
Scrape articles for specific news website (detik) with excel output format:
newswatch -k "ihsg" -s "detik" --output_format xlsx
You can run news-watch on Google Colab
The scraped articles are saved as a CSV or XLSX file in the current working directory with the format news-watch-{keywords}-YYYYMMDD_HH
.
The output file contains the following columns:
title
publish_date
author
content
keyword
category
source
link
- Bisnis.com
- Bloomberg Technoz
- CNBC Indonesia
- Detik.com
- Jawapos.com
- Katadata.co.id
- Kompas.com
- Kontan.co.id
- Metrotvnews.com
- Tempo.co
- Viva.co.id
Note:
- Running Kontan.co.id and Jawapos on the cloud currently leads to errors due to Cloudflare restrictions.
- Limitation: Kontan.co.id scraper can process a maximum of 50 pages.
Contributions are welcome! If you'd like to add support for more websites or improve the existing code, please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details. The authors assume no liability for misuse of this software.
If you use this software, please cite it using the following:
@software{mabruri_newswatch,
author = {Okky Mabruri},
title = {news-watch},
version = {0.2.2},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.14912258},
url = {https://doi.org/10.5281/zenodo.14912258}
}