This documentation provides an overview and details of a Robocorp RPA bot built using Python. The bot aims to scrape articles from the Los Angeles Times website based on specific topics and keywords, and then saves the data in an Excel file.
tasks.py
: Entry point for Robocorp tasks.src/main.py
: Main script to initiate the scraping process.src/scraper.py
: Contains the Scraper class responsible for web scraping.src/article.py
: Defines the Article class to store article details.src/browser_manager.py
: Manages browser operations using Selenium.src/utils/logger.py
: Handles logging functionalities.src/utils/date_validator.py
: Provides date validation and filtering functionalities.output/
: Folder containing logs, images, and Excel files.
- Get Robocorp Code -extension for VS Code.
- You'll get an easy-to-use side panel and powerful command-palette commands for running, debugging, code completion, docs, etc.
- Get RCC
- Use the command:
rcc run
To execute the bot, you can run the rpa_challenge
task defined in tasks.py
in visual studio.
🚀 After running the bot, check out the following directories in the output
folder:
- Excel Files:
output/files/excel
- Contains the Excel files with scraped article data. - Images:
output/files/images
- Contains the images downloaded from the articles.
🚀 After running the bot, check out the log.html
under the output
-folder.
- Purpose: Initiates the RPA Challenge by setting up the logging, defining scraping parameters, and running the Scraper.
url
: Target URL (https://www.latimes.com)search_phrase
: Keyword to search for (e.g., "trump")topic
: Topic/category for filtering articles (e.g., "sports")months
: Number of past months to consider (e.g., 6)categories
: Dictionary of categories and their associated keywords
- Initialize logger: Sets up logging for the script.
- Define scraping parameters: Defines the URL, search phrase, topic, months, and categories.
- Instantiate Scraper: Creates a Scraper object with the defined parameters.
- Run Scraper: Calls the
run()
method of the Scraper object to start the scraping process.
- Purpose: Handles web scraping functionalities.
url
: Target URLsearch_phrase
: Keyword to search fortopic
: Topic/category for filtering articlesmonths
: Number of past months to considervalid_months
: List of valid months for filteringbrowser_manager
: BrowserManager objectkeep_going_to_next_page
: Flag to control paginationarticles
: List to store scraped articlescategories
: Dictionary of categories and their associated keywords
search_phrase_handler()
: Handles the search functionality on the website.apply_topic_filters(all_topic_elements, category_keywords)
: Applies topic filters based on category keywords.filter_by_category()
: Filters articles based on the selected topic.sort_by_newest()
: Sorts articles by newest.next_page()
: Navigates to the next page of search results.get_articles()
: Scrapes article details from the current page.save_articles()
: Saves scraped articles to an Excel file.run()
: Main method to run the scraping process.
- Purpose: Represents an article object with its details.
title
: Article titledate
: Article publication datedescription
: Article descriptionpicture_filename
: Filename of the article imagetitle_description_search_count
: Count of search keyword occurrences in title and descriptioncontains_money
: Flag indicating if the article mentions money
- Purpose: Manages browser operations using Selenium.
open_browser(url)
: Opens a browser with the specified URL.close_browser()
: Closes the browser.find_element(selector)
: Finds and returns a web element using the provided selector.wait_until_element_is_visible(selector)
: Waits until the specified element is visible.
- Purpose: Provides logging functionalities.
- Purpose: Provides date validation and filtering functionalities, as well as algorithms for handling the number of months to be retrieved.
robocorp.tasks
: For task definitions.pandas
: For data manipulation and Excel file handling.requests
: For making HTTP requests.beautifulsoup4
: For parsing HTML content.selenium
: For browser automation.logging
: For logging functionalities.
This bot demonstrates the use of Robocorp, Python, and Selenium for web scraping tasks. It scrapes articles from the Los Angeles Times website based on specific topics and keywords, filters and sorts them, and saves the data in an Excel file.