GitHub - Steravy/web-crawler: A web scraper api that scraps the website "Open Food Facts" and returns a set of products and/or products details filtered by criteria like Nutri-Score and NOVA.

Web Scraper API

This project was build is response to the Devnology Junior Developer position job application. A first challenge in the selection process.

📋 Table of Contents

🤖 Introduction
⚙️ Tech Stack
🔋 Features
🤸 Quick Start
🕸️ Snippets
🕸️ Challenges

🤖 Introduction

This repository contains the source code of an API developed for web scraping on the Open Food Facts website. This API was created as part of the Junior Developer position selection process at Devnology.

⚙️ Tech Stack

NestJs
TypeScript
Puppeteer

🔋 Features

👉 Fetch Products: The API allows the retrieval of products based on specific criteria, such as Nutri-Score and NOVA.

👉 Product Details: Provides comprehensive details of a specific product, fetched by id, including nutritional information, ingredients, Nutri-Score classification, etc.

🤸 Quick Start

Follow these steps to set up the project locally on your machine.

Prerequisites

Make sure you have the following installed on your machine:

Git
Node.js
npm (Node Package Manager)
NestJs CLI

Cloning the Repository

git clone https://github.com/Steravy/web-crawler.git
cd web-crawler

Installation

Install the project dependencies using npm:

npm install

Running the Project

npm run dev

Getting familiar with the api

In your browser, visit http://localhost:5000/api to open the Swagger UI. The api full documentation.

Scraping Data

Open an API testing tool like Postman, JetClient or the good old CURL. I will demonstrate using CURL in the following code snippets.

🕸️ Code Snippets

Fetch Products

curl -H "Accept: application/json" 'http://localhost:5000/products?nutrition=A&nova=1' | jq

[
  {
    "id": "3155250349793",
    "name": "Creme Chantilly Président - 250 g (241 ml)",
    "nutrition": {
      "score": "D",
      "title": "Qualidade nutricional baixa"
    },
    "nova": {
      "score": "4",
      "title": "Alimentos ultra-processados"
    }
  },
  {
    "id": "3046920010603",
    "name": "Chocolate meio amargo com framboesa - Lindt - 100 g e",
    "nutrition": {
      "score": "E",
      "title": "Má qualidade nutricional"
    },
    "nova": {
      "score": "4",
      "title": "Alimentos ultra-processados"
    }
  },
]

Product Details

curl -H "Accept: application/json" 'http://localhost:5000/products/7891167011724' | jq

{
  "title": "Futuro Burger - Fazenda Futuro - 230 g",
  "quantity": "230 g",
  "ingredients": {
  "hasPalmOil": "unknown",
    "isVegan": false,
    "isVegetarian": false,
    "list": [
    "Água, preparado proteico (proteína texturizada de soja, proteína isolada de soja e proteína de ervilha), gordura de coco, óleo de canola, aroma natural, estabilizante metilcelulose, sal, beterraba em pó e corante carvão vegetal."
  ]
},
  "nutrition": {
  "score": "D",
    "values": [
    [
      "moderate",
      "Gorduras/lípidos em quantidade moderada (11.9%)"
    ],
    [
      "high",
      "Gorduras/lípidos/ácidos gordos saturados em quantidade elevada (8%)"
    ],
    [
      "low",
      "Açúcares em quantidade baixa (0%)"
    ]
  ],
    "servingSize": "80 g",
    "data": {
    "Energia": {
      "per100g": "814 kj(194 kcal)",
        "perServing": "651 kj(155 kcal)"
    },
    "Gorduras/lípidos": {
      "per100g": "11,9 g",
        "perServing": "9,5 g"
    },
    "Carboidratos": {
      "per100g": "7,88 g",
        "perServing": "6,3 g"
    },
    "Fibra alimentar": {
      "per100g": "?",
        "perServing": "?"
    },
    "Proteínas": {
      "per100g": "13,8 g",
        "perServing": "11 g"
    },
    "Sal": {
      "per100g": "0,565 g",
        "perServing": "0,452 g"
    }
  }
},
  "nova": {
  "score": 4,
    "title": "Alimentos ultra-processados"
}
}

🕸️ Challenges

The first challenge I encountered was the time factor because I received the task very close to the deadline. However, I knew I could overcome it, and here is the result. Another challenge I faced was the dynamic nature of the Open Food Facts website; it is not static. Therefore, I had to use a headless browser to scrape the data. I utilized Puppeteer for this purpose, and it performed exceptionally well. The element patterns vary from product to product, requiring me to establish a set of rules for data scraping. Discovering a pattern that would effectively accomplish the task took some time. While there is still much work to be done, the deadline must be respected. Scraping data from the website proved to be the most challenging aspect of the project, yet it provided significant learning opportunities. I now understand the concept of data mining. Open Food Facts is a comprehensive website, but its complex structure and indentation layers make data scraping a bit challenging.

Author

Stefan Vitoria

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
src		src
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
nest-cli.json		nest-cli.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper API

📋 Table of Contents

🤖 Introduction

⚙️ Tech Stack

🔋 Features

🤸 Quick Start

🕸️ Code Snippets

🕸️ Challenges

Author

About

Releases

Packages

Languages

Steravy/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Scraper API

📋 Table of Contents

Author

About

Resources

Stars

Watchers

Forks

Languages