RPA Challenge Documentation

Overview

This documentation provides an overview and details of a Robocorp RPA bot built using Python. The bot aims to scrape articles from the Los Angeles Times website based on specific topics and keywords, and then saves the data in an Excel file.

Files Structure

tasks.py: Entry point for Robocorp tasks.
src/main.py: Main script to initiate the scraping process.
src/scraper.py: Contains the Scraper class responsible for web scraping.
src/article.py: Defines the Article class to store article details.
src/browser_manager.py: Manages browser operations using Selenium.
src/utils/logger.py: Handles logging functionalities.
src/utils/date_validator.py: Provides date validation and filtering functionalities.
output/: Folder containing logs, images, and Excel files.

Running the Robot

VS Code

Get Robocorp Code -extension for VS Code.
You'll get an easy-to-use side panel and powerful command-palette commands for running, debugging, code completion, docs, etc.

Command line

Get RCC
Use the command: rcc run

How to Run manually

To execute the bot, you can run the rpa_challenge task defined in tasks.py in visual studio.

Results

🚀 After running the bot, check out the following directories in the output folder:

Excel Files: output/files/excel - Contains the Excel files with scraped article data.
Images: output/files/images - Contains the images downloaded from the articles.

🚀 After running the bot, check out the log.html under the output -folder.

Main.py

Function: main()

Purpose: Initiates the RPA Challenge by setting up the logging, defining scraping parameters, and running the Scraper.

Parameters

url: Target URL (https://www.latimes.com)
search_phrase: Keyword to search for (e.g., "trump")
topic: Topic/category for filtering articles (e.g., "sports")
months: Number of past months to consider (e.g., 6)
categories: Dictionary of categories and their associated keywords

Steps

Initialize logger: Sets up logging for the script.
Define scraping parameters: Defines the URL, search phrase, topic, months, and categories.
Instantiate Scraper: Creates a Scraper object with the defined parameters.
Run Scraper: Calls the run() method of the Scraper object to start the scraping process.

Scraper.py

Class: Scraper

Purpose: Handles web scraping functionalities.

Attributes

url: Target URL
search_phrase: Keyword to search for
topic: Topic/category for filtering articles
months: Number of past months to consider
valid_months: List of valid months for filtering
browser_manager: BrowserManager object
keep_going_to_next_page: Flag to control pagination
articles: List to store scraped articles
categories: Dictionary of categories and their associated keywords

Methods

search_phrase_handler(): Handles the search functionality on the website.
apply_topic_filters(all_topic_elements, category_keywords): Applies topic filters based on category keywords.
filter_by_category(): Filters articles based on the selected topic.
sort_by_newest(): Sorts articles by newest.
next_page(): Navigates to the next page of search results.
get_articles(): Scrapes article details from the current page.
save_articles(): Saves scraped articles to an Excel file.
run(): Main method to run the scraping process.

Article.py

Class: Article

Purpose: Represents an article object with its details.

Attributes

title: Article title
date: Article publication date
description: Article description
picture_filename: Filename of the article image
title_description_search_count: Count of search keyword occurrences in title and description
contains_money: Flag indicating if the article mentions money

Browser_manager.py

Class: BrowserManager

Purpose: Manages browser operations using Selenium.

Methods

open_browser(url): Opens a browser with the specified URL.
close_browser(): Closes the browser.
find_element(selector): Finds and returns a web element using the provided selector.
wait_until_element_is_visible(selector): Waits until the specified element is visible.

Utils

Logger.py

Purpose: Provides logging functionalities.

Date_validator.py

Purpose: Provides date validation and filtering functionalities, as well as algorithms for handling the number of months to be retrieved.

Dependencies

robocorp.tasks: For task definitions.
pandas: For data manipulation and Excel file handling.
requests: For making HTTP requests.
beautifulsoup4: For parsing HTML content.
selenium: For browser automation.
logging: For logging functionalities.

Conclusion

This bot demonstrates the use of Robocorp, Python, and Selenium for web scraping tasks. It scrapes articles from the Los Angeles Times website based on specific topics and keywords, filters and sorts them, and saves the data in an Excel file.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.idea		.idea
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conda.yaml		conda.yaml
robot.yaml		robot.yaml
tasks.py		tasks.py
workitem_input.json		workitem_input.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RPA Challenge Documentation

Overview

Files Structure

Running the Robot

VS Code

Command line

How to Run manually

Results

Main.py

Function: main()

Parameters

Steps

Scraper.py

Class: Scraper

Attributes

Methods

Article.py

Class: Article

Attributes

Browser_manager.py

Class: BrowserManager

Methods

Utils

Logger.py

Date_validator.py

Dependencies

Conclusion

About

Releases

Packages

Languages

License

C-Maringe/rpa-challenge

Folders and files

Latest commit

History

Repository files navigation

RPA Challenge Documentation

Overview

Files Structure

Running the Robot

VS Code

Command line

How to Run manually

Results

Main.py

Function: main()

Parameters

Steps

Scraper.py

Class: Scraper

Attributes

Methods

Article.py

Class: Article

Attributes

Browser_manager.py

Class: BrowserManager

Methods

Utils

Logger.py

Date_validator.py

Dependencies

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages