|
1 |
| -## PlaywrightCrawler template |
| 1 | +# Youtube AutoComplete Scraper |
2 | 2 |
|
3 |
| -This template is a production ready boilerplate for developing an [Actor](https://apify.com/actors) with `PlaywrightCrawler`. Use this to bootstrap your projects using the most up-to-date code. |
| 3 | +A TypeScript library for scraping YouTube's autocomplete suggestions with intelligent deduplication. |
4 | 4 |
|
5 |
| -> We decided to split Apify SDK into two libraries, Crawlee and Apify SDK v3. Crawlee will retain all the crawling and scraping-related tools and will always strive to be the best [web scraping](https://apify.com/web-scraping) library for its community. At the same time, Apify SDK will continue to exist, but keep only the Apify-specific features related to building actors on the Apify platform. Read the upgrading guide to learn about the changes. |
6 |
| -> |
| 5 | +## Features |
7 | 6 |
|
8 |
| -## Resources |
| 7 | +- Scrapes YouTube's autocomplete API to get search suggestions |
| 8 | +- Uses pglite for efficient similarity filtering |
| 9 | +- Removes near-duplicate suggestions using trigram similarity |
| 10 | +- Configurable similarity threshold |
| 11 | +- TypeScript support |
| 12 | +- Ready to deploy on Apify platform |
9 | 13 |
|
10 |
| -If you're looking for examples or want to learn more visit: |
| 14 | +## Installation |
11 | 15 |
|
12 |
| -- [Crawlee + Apify Platform guide](https://crawlee.dev/docs/guides/apify-platform) |
13 |
| -- [Documentation](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler) and [examples](https://crawlee.dev/docs/examples/playwright-crawler) |
14 |
| -- [Node.js tutorials](https://docs.apify.com/academy/node-js) in Academy |
15 |
| -- [Scraping single-page applications with Playwright](https://blog.apify.com/scraping-single-page-applications-with-playwright/) |
16 |
| -- [How to scale Puppeteer and Playwright](https://blog.apify.com/how-to-scale-puppeteer-and-playwright/) |
17 |
| -- [Integration with Zapier](https://apify.com/integrations), Make, GitHub, Google Drive and other apps |
18 |
| -- [Video guide on getting scraped data using Apify API](https://www.youtube.com/watch?v=ViYYDHSBAKM) |
19 |
| -- A short guide on how to build web scrapers using code templates: |
| 16 | +```bash |
| 17 | +git clone https://github.com/yourusername/youtube-autocomplete-scraper.git |
| 18 | +cd youtube-autocomplete-scraper |
| 19 | +pnpm install |
| 20 | +``` |
20 | 21 |
|
21 |
| -[web scraper template](https://www.youtube.com/watch?v=u-i-Korzf8w) |
| 22 | +## Usage |
22 | 23 |
|
| 24 | +There are two ways to use this scraper: |
23 | 25 |
|
24 |
| -## Getting started |
| 26 | +### 1. Local Development |
25 | 27 |
|
26 |
| -For complete information [see this article](https://docs.apify.com/platform/actors/development#build-actor-locally). To run the actor use the following command: |
| 28 | +Run the scraper locally by setting the required environment variables and using `pnpm start`: |
27 | 29 |
|
28 | 30 | ```bash
|
29 |
| -apify run |
| 31 | +# Set your input |
| 32 | +export INPUT='{"query": "how to make"}' |
| 33 | + |
| 34 | +# Run the scraper |
| 35 | +pnpm start |
| 36 | +``` |
| 37 | + |
| 38 | +The scraper will output results to the console and save them in the `apify_storage` directory. |
| 39 | + |
| 40 | +### 2. Deploy to Apify |
| 41 | + |
| 42 | +This scraper is designed to run on the Apify platform. To deploy: |
| 43 | + |
| 44 | +1. Push this code to your Apify actor |
| 45 | +2. Set the input JSON in Apify console: |
| 46 | + |
| 47 | +```json |
| 48 | +{ |
| 49 | + "query": "how to make", |
| 50 | + "similarityThreshold": 0.7, |
| 51 | + "maxResults": 100, |
| 52 | + "language": "en", |
| 53 | + "region": "US" |
| 54 | +} |
30 | 55 | ```
|
31 | 56 |
|
32 |
| -## Deploy to Apify |
| 57 | +## How it Works |
33 | 58 |
|
34 |
| -### Connect Git repository to Apify |
| 59 | +Under the hood, this scraper does a few key things: |
35 | 60 |
|
36 |
| -If you've created a Git repository for the project, you can easily connect to Apify: |
| 61 | +1. **API Querying**: Makes requests to YouTube's internal autocomplete API endpoint to get raw suggestions |
37 | 62 |
|
38 |
| -1. Go to [Actor creation page](https://console.apify.com/actors/new) |
39 |
| -2. Click on **Link Git Repository** button |
| 63 | +2. **Deduplication**: Uses pglite (a lightweight Postgres implementation) to filter out near-duplicate results: |
40 | 64 |
|
41 |
| -### Push project on your local machine to Apify |
| 65 | + - Converts suggestions to trigrams (3-letter sequences) |
| 66 | + - Calculates similarity scores between suggestions using trigram matching |
| 67 | + - Filters out suggestions that are too similar based on a configurable threshold |
| 68 | + - For example, "how to cook pasta" and "how to cook noodles" might be considered unique, while "how to make pancake" and "how to make pancakes" would be filtered as duplicates |
42 | 69 |
|
43 |
| -You can also deploy the project on your local machine to Apify without the need for the Git repository. |
| 70 | +3. **Result Processing**: Cleans and normalizes the suggestions before returning them |
44 | 71 |
|
45 |
| -1. Log in to Apify. You will need to provide your [Apify API Token](https://console.apify.com/account/integrations) to complete this action. |
| 72 | +## Input Schema |
| 73 | + |
| 74 | +The scraper accepts the following input parameters: |
| 75 | + |
| 76 | +```typescript |
| 77 | +interface Input { |
| 78 | + query: string // The search query to get suggestions for |
| 79 | + similarityThreshold?: number // How similar suggestions need to be to be considered duplicates (0-1) |
| 80 | + maxResults?: number // Maximum number of suggestions to return |
| 81 | + language?: string // Language code for suggestions |
| 82 | + region?: string // Region code for suggestions |
| 83 | +} |
| 84 | +``` |
46 | 85 |
|
47 |
| - ```bash |
48 |
| - apify login |
49 |
| - ``` |
| 86 | +## Output |
50 | 87 |
|
51 |
| -2. Deploy your Actor. This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my). |
| 88 | +The scraper outputs an array of unique autocomplete suggestions. Results are saved to the default dataset in Apify storage and can be accessed via the Apify API or console. |
52 | 89 |
|
53 |
| - ```bash |
54 |
| - apify push |
55 |
| - ``` |
| 90 | +## Contributing |
56 | 91 |
|
57 |
| -## Documentation reference |
| 92 | +Contributions are welcome! Please feel free to submit a Pull Request. |
58 | 93 |
|
59 |
| -To learn more about Apify and Actors, take a look at the following resources: |
| 94 | +## License |
60 | 95 |
|
61 |
| -- [Apify SDK for JavaScript documentation](https://docs.apify.com/sdk/js) |
62 |
| -- [Apify SDK for Python documentation](https://docs.apify.com/sdk/python) |
63 |
| -- [Apify Platform documentation](https://docs.apify.com/platform) |
64 |
| -- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU) |
| 96 | +MIT |
0 commit comments