Skip to content

Commit

Permalink
feat: add markdownify and localscraper
Browse files Browse the repository at this point in the history
  • Loading branch information
PeriniM committed Dec 5, 2024
1 parent ae1cde3 commit 6296510
Show file tree
Hide file tree
Showing 14 changed files with 672 additions and 111 deletions.
31 changes: 28 additions & 3 deletions scrapegraph-py/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,36 @@ Thank you for your interest in contributing to **ScrapeGraphAI**! We welcome con

## Getting Started

To get started with contributing, follow these steps:
### Development Setup

1. Fork the repository on GitHub **(FROM pre/beta branch)**.
2. Clone your forked repository to your local machine.
3. Install the necessary dependencies from requirements.txt or via pyproject.toml as you prefere :).
2. Clone your forked repository:
```bash
git clone https://github.com/ScrapeGraphAI/scrapegraph-sdk.git
cd scrapegraph-sdk/scrapegraph-py
```

3. Install dependencies using uv (recommended):
```bash
# Install uv if you haven't already
pip install uv

# Install dependencies
uv sync

# Install pre-commit hooks
uv run pre-commit install
```

4. Run tests:
```bash
# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_client.py
```

4. Make your changes or additions.
5. Test your changes thoroughly.
6. Commit your changes with descriptive commit messages.
Expand Down
203 changes: 107 additions & 96 deletions scrapegraph-py/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,164 +6,175 @@
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://scrapegraph-py.readthedocs.io/en/latest/?badge=latest)

Official Python SDK for the ScrapeGraph AI API - Smart web scraping powered by AI.

## 🚀 Features

- ✨ Smart web scraping with AI
- 🔄 Both sync and async clients
- 📊 Structured output with Pydantic schemas
- 🔍 Detailed logging with emojis
- ⚡ Automatic retries and error handling
- 🔐 Secure API authentication
Official Python SDK for the ScrapeGraph API - Smart web scraping powered by AI.

## 📦 Installation

### Using pip

```
```bash
pip install scrapegraph-py
```

### Using uv
## 🚀 Features

We recommend using [uv](https://docs.astral.sh/uv/) to install the dependencies and pre-commit hooks.
- 🤖 AI-powered web scraping
- 🔄 Both sync and async clients
- 📊 Structured output with Pydantic schemas
- 🔍 Detailed logging
- ⚡ Automatic retries
- 🔐 Secure authentication

```
# Install uv if you haven't already
pip install uv
## 🎯 Quick Start

# Install dependencies
uv sync
```python
from scrapegraph_py import Client

# Install pre-commit hooks
uv run pre-commit install
client = Client(api_key="your-api-key-here")
```

## 🔧 Quick Start

> [!NOTE]
> If you prefer, you can use the environment variables to configure the API key and load them using `load_dotenv()`
> You can set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()`
```python
from scrapegraph_py import SyncClient
from scrapegraph_py.logger import get_logger
## 📚 Available Endpoints

### 🔍 SmartScraper

# Enable debug logging
logger = get_logger(level="DEBUG")
Scrapes any webpage using AI to extract specific information.

```python
from scrapegraph_py import Client

# Initialize client
sgai_client = SyncClient(api_key="your-api-key-here")
client = Client(api_key="your-api-key-here")

# Make a request
response = sgai_client.smartscraper(
# Basic usage
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main heading and description"
)

print(response["result"])
```

## 🎯 Examples

### Async Usage

```python
import asyncio
from scrapegraph_py import AsyncClient

async def main():
async with AsyncClient(api_key="your-api-key-here") as sgai_client:
response = await sgai_client.smartscraper(
website_url="https://example.com",
user_prompt="Summarize the main content"
)
print(response["result"])

asyncio.run(main())
print(response)
```

<details>
<summary><b>With Output Schema</b></summary>
<summary>Output Schema (Optional)</summary>

```python
from pydantic import BaseModel, Field
from scrapegraph_py import SyncClient
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

class WebsiteData(BaseModel):
title: str = Field(description="The page title")
description: str = Field(description="The meta description")

sgai_client = SyncClient(api_key="your-api-key-here")
response = sgai_client.smartscraper(
response = client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the title and description",
output_schema=WebsiteData
)

print(response["result"])
```

</details>

## 📚 Documentation
### 📝 Markdownify

For detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com)
Converts any webpage into clean, formatted markdown.

## 🛠️ Development
```python
from scrapegraph_py import Client

### Setup
client = Client(api_key="your-api-key-here")

1. Clone the repository:
```
git clone https://github.com/ScrapeGraphAI/scrapegraph-sdk.git
cd scrapegraph-sdk/scrapegraph-py
```
response = client.markdownify(
website_url="https://example.com"
)

2. Install dependencies:
```
uv sync
print(response)
```

3. Install pre-commit hooks:
```
uv run pre-commit install
```
### 💻 LocalScraper

### Running Tests
Extracts information from HTML content using AI.

```
# Run all tests
uv run pytest
```python
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: contact@example.com</p>
</div>
</body>
</html>
"""

response = client.localscraper(
user_prompt="Extract the company description",
website_html=html_content
)

# Run specific test file
poetry run pytest tests/test_client.py
print(response)
```

## 📝 License
## ⚡ Async Support

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
All endpoints support async operations:

```python
import asyncio
from scrapegraph_py import AsyncClient

## 🤝 Contributing
async def main():
async with AsyncClient() as client:
response = await client.smartscraper(
website_url="https://example.com",
user_prompt="Extract the main content"
)
print(response)

asyncio.run(main())
```

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## 📖 Documentation

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
For detailed documentation, visit [scrapegraphai.com/docs](https://scrapegraphai.com/docs)

## 🔗 Links
## 🛠️ Development

- [Website](https://scrapegraphai.com)
- [Documentation](https://scrapegraphai.com/documentation)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)
For information about setting up the development environment and contributing to the project, see our [Contributing Guide](CONTRIBUTING.md).

## 💬 Support
## 💬 Support & Feedback

- 📧 Email: support@scrapegraphai.com
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new)
- ⭐ API Feedback: You can also submit feedback programmatically using the feedback endpoint:
```python
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

client.submit_feedback(
request_id="your-request-id",
rating=5,
feedback_text="Great results!"
)
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🔗 Links

- [Website](https://scrapegraphai.com)
- [Documentation](https://scrapegraphai.com/docs)
- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk)

---

Expand Down
37 changes: 37 additions & 0 deletions scrapegraph-py/examples/async_markdownify_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import asyncio

from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")


async def main():
# Initialize async client
sgai_client = AsyncClient(api_key="your-api-key-here")

# Concurrent markdownify requests
urls = [
"https://scrapegraphai.com/",
"https://github.com/ScrapeGraphAI/Scrapegraph-ai",
]

tasks = [sgai_client.markdownify(website_url=url) for url in urls]

# Execute requests concurrently
responses = await asyncio.gather(*tasks, return_exceptions=True)

# Process results
for i, response in enumerate(responses):
if isinstance(response, Exception):
print(f"\nError for {urls[i]}: {response}")
else:
print(f"\nPage {i+1} Markdown:")
print(f"URL: {urls[i]}")
print(f"Result: {response['result']}")

await sgai_client.close()


if __name__ == "__main__":
asyncio.run(main())
31 changes: 31 additions & 0 deletions scrapegraph-py/examples/localscraper_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

# Initialize the client
sgai_client = Client(api_key="your-api-key-here")

# Example HTML content
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: contact@example.com</p>
<p>Phone: (555) 123-4567</p>
</div>
</body>
</html>
"""

# LocalScraper request
response = sgai_client.localscraper(
user_prompt="Extract the company description and contact information",
website_html=html_content,
)

# Print the response
print(f"Request ID: {response['request_id']}")
print(f"Result: {response['result']}")
16 changes: 16 additions & 0 deletions scrapegraph-py/examples/markdownify_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

# Initialize the client
sgai_client = Client(api_key="your-api-key-here")

# Markdownify request
response = sgai_client.markdownify(
website_url="https://example.com",
)

# Print the response
print(f"Request ID: {response['request_id']}")
print(f"Result: {response['result']}")
Loading

0 comments on commit 6296510

Please sign in to comment.