Aioscpy

A powerful, high-performance asynchronous web crawling and scraping framework built on Python's asyncio ecosystem.

English | 中文

Overview

Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It draws inspiration from Scrapy and scrapy_redis but is designed from the ground up to leverage the full power of asynchronous programming.

Key Features

Fully Asynchronous: Built on Python's asyncio for high-performance concurrent operations
Scrapy-like API: Familiar API for those coming from Scrapy
Distributed Crawling: Support for distributed crawling using Redis
Multiple HTTP Backends: Support for aiohttp, httpx, and requests
Dynamic Variable Injection: Powerful dependency injection system
Flexible Middleware System: Customizable request/response processing pipeline
Robust Item Processing: Pipeline for processing scraped items

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Installation

Basic Installation

pip install aioscpy

With All Dependencies

pip install aioscpy[all]

With Specific HTTP Backends

pip install aioscpy[aiohttp,httpx]

Latest Version from GitHub

pip install git+https://github.com/ihandmine/aioscpy

Quick Start

Creating a New Project

aioscpy startproject myproject
cd myproject

Creating a Spider

aioscpy genspider myspider

This will create a basic spider in the spiders directory.

Example Spider

from aioscpy.spider import Spider


class QuotesSpider(Spider):
    name = 'quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Creating a Single Spider Script

aioscpy onespider single_quotes

Advanced Spider Example

from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat


class SingleQuotesSpider(Spider):
    name = 'single_quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/',
    ]

    async def process_request(self, request):
        request.headers = Header(url=request.url, platform='windows', connection=True).random
        return request

    async def process_response(self, request, response):
        if response.status in [404, 503]:
            return request
        return response

    async def process_exception(self, request, exc):
        raise exc

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    async def process_item(self, item):
        self.logger.info("{item}", **{'item': pformat(item)})


if __name__ == '__main__':
    quotes = SingleQuotesSpider()
    quotes.start()

Running Spiders

# Run a spider from a project
aioscpy crawl quotes

# Run a single spider script
aioscpy runspider quotes.py

Running from Python Code

from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings

# Method 1: Load all spiders from a directory
def load_spiders_from_directory():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.load_spider(path='./spiders')
    process.start()

# Method 2: Run a specific spider by name
def run_specific_spider():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.crawl('myspider')
    process.start()

if __name__ == '__main__':
    run_specific_spider()

Configuration

Aioscpy can be configured through the settings.py file in your project. Here are the most important settings:

Concurrency Settings

# Maximum number of concurrent items being processed
CONCURRENT_ITEMS = 100

# Maximum number of concurrent requests
CONCURRENT_REQUESTS = 16

# Maximum number of concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Maximum number of concurrent requests per IP
CONCURRENT_REQUESTS_PER_IP = 0

Download Settings

# Delay between requests (in seconds)
DOWNLOAD_DELAY = 0

# Timeout for requests (in seconds)
DOWNLOAD_TIMEOUT = 20

# Whether to randomize the download delay
RANDOMIZE_DOWNLOAD_DELAY = True

# HTTP backend to use
DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.httpx.HttpxDownloadHandler"
# Other options:
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler"
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.requests.RequestsDownloadHandler"

Scheduler Settings

# Scheduler to use (memory-based or Redis-based)
SCHEDULER = "aioscpy.core.scheduler.memory.MemoryScheduler"
# For distributed crawling:
# SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"

# Redis connection settings (for Redis scheduler)
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Response API

Aioscpy provides a rich API for working with responses:

Extracting Data

# Using CSS selectors
title = response.css('title::text').get()
all_links = response.css('a::attr(href)').getall()

# Using XPath
title = response.xpath('//title/text()').get()
all_links = response.xpath('//a/@href').getall()

Following Links

# Follow a link
yield response.follow('next-page.html', self.parse)

# Follow a link with a callback
yield response.follow('details.html', self.parse_details)

# Follow all links matching a CSS selector
yield from response.follow_all(css='a.product::attr(href)', callback=self.parse_product)

More Commands

aioscpy -h

Distributed Crawling

To enable distributed crawling with Redis:

Configure Redis in settings:

SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Run multiple instances of your spider on different machines, all connecting to the same Redis server.

Contributing

Please submit your suggestions to the owner by creating an issue.

Thanks

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
aioscpy		aioscpy
cegex		cegex
doc		doc
example		example
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aioscpy

Overview

Key Features

Requirements

Installation

Basic Installation

With All Dependencies

With Specific HTTP Backends

Latest Version from GitHub

Quick Start

Creating a New Project

Creating a Spider

Example Spider

Creating a Single Spider Script

Advanced Spider Example

Running Spiders

Running from Python Code

Configuration

Concurrency Settings

Download Settings

Scheduler Settings

Response API

Extracting Data

Following Links

More Commands

Distributed Crawling

Contributing

Thanks

About

Releases 4

Packages

Languages

License

ihandmine/aioscpy

Folders and files

Latest commit

History

Repository files navigation

Aioscpy

Overview

Key Features

Requirements

Installation

Basic Installation

With All Dependencies

With Specific HTTP Backends

Latest Version from GitHub

Quick Start

Creating a New Project

Creating a Spider

Example Spider

Creating a Single Spider Script

Advanced Spider Example

Running Spiders

Running from Python Code

Configuration

Concurrency Settings

Download Settings

Scheduler Settings

Response API

Extracting Data

Following Links

More Commands

Distributed Crawling

Contributing

Thanks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages