
- 🤖 Introduction
- ⚙️ Tech Stack
- 🔋 Features
- 🤸 Quick Start
- 🕸️ Snippets
- 🕸️ Challenges
This repository contains the source code of an API developed for web scraping on the Open Food Facts website. This API was created as part of the Junior Developer position selection process at Devnology.
- NestJs
- TypeScript
- Puppeteer
👉 Fetch Products: The API allows the retrieval of products based on specific criteria, such as Nutri-Score and NOVA.
👉 Product Details: Provides comprehensive details of a specific product, fetched by id, including nutritional information, ingredients, Nutri-Score classification, etc.
Follow these steps to set up the project locally on your machine.
Prerequisites
Make sure you have the following installed on your machine:
- Git
- Node.js
- npm (Node Package Manager)
- NestJs CLI
Cloning the Repository
git clone https://github.com/Steravy/web-crawler.git
cd web-crawler
Installation
Install the project dependencies using npm:
npm install
Running the Project
npm run dev
Getting familiar with the api
In your browser, visit http://localhost:5000/api to open the Swagger UI. The api full documentation.
Scraping Data
Open an API testing tool like Postman, JetClient or the good old CURL
. I will demonstrate using CURL
in the following code snippets.
Fetch Products
curl -H "Accept: application/json" 'http://localhost:5000/products?nutrition=A&nova=1' | jq
[
{
"id": "3155250349793",
"name": "Creme Chantilly Président - 250 g (241 ml)",
"nutrition": {
"score": "D",
"title": "Qualidade nutricional baixa"
},
"nova": {
"score": "4",
"title": "Alimentos ultra-processados"
}
},
{
"id": "3046920010603",
"name": "Chocolate meio amargo com framboesa - Lindt - 100 g e",
"nutrition": {
"score": "E",
"title": "Má qualidade nutricional"
},
"nova": {
"score": "4",
"title": "Alimentos ultra-processados"
}
},
]
Product Details
curl -H "Accept: application/json" 'http://localhost:5000/products/7891167011724' | jq
{
"title": "Futuro Burger - Fazenda Futuro - 230 g",
"quantity": "230 g",
"ingredients": {
"hasPalmOil": "unknown",
"isVegan": false,
"isVegetarian": false,
"list": [
"Água, preparado proteico (proteína texturizada de soja, proteína isolada de soja e proteína de ervilha), gordura de coco, óleo de canola, aroma natural, estabilizante metilcelulose, sal, beterraba em pó e corante carvão vegetal."
]
},
"nutrition": {
"score": "D",
"values": [
[
"moderate",
"Gorduras/lípidos em quantidade moderada (11.9%)"
],
[
"high",
"Gorduras/lípidos/ácidos gordos saturados em quantidade elevada (8%)"
],
[
"low",
"Açúcares em quantidade baixa (0%)"
]
],
"servingSize": "80 g",
"data": {
"Energia": {
"per100g": "814 kj(194 kcal)",
"perServing": "651 kj(155 kcal)"
},
"Gorduras/lípidos": {
"per100g": "11,9 g",
"perServing": "9,5 g"
},
"Carboidratos": {
"per100g": "7,88 g",
"perServing": "6,3 g"
},
"Fibra alimentar": {
"per100g": "?",
"perServing": "?"
},
"Proteínas": {
"per100g": "13,8 g",
"perServing": "11 g"
},
"Sal": {
"per100g": "0,565 g",
"perServing": "0,452 g"
}
}
},
"nova": {
"score": 4,
"title": "Alimentos ultra-processados"
}
}
The first challenge I encountered was the time factor because I received the task very close to the deadline. However, I knew I could overcome it, and here is the result. Another challenge I faced was the dynamic nature of the Open Food Facts website; it is not static. Therefore, I had to use a headless browser to scrape the data. I utilized Puppeteer for this purpose, and it performed exceptionally well. The element patterns vary from product to product, requiring me to establish a set of rules for data scraping. Discovering a pattern that would effectively accomplish the task took some time. While there is still much work to be done, the deadline must be respected. Scraping data from the website proved to be the most challenging aspect of the project, yet it provided significant learning opportunities. I now understand the concept of data mining. Open Food Facts is a comprehensive website, but its complex structure and indentation layers make data scraping a bit challenging.