pdf-extraction

Here are 47 public repositories matching this topic...

Goldziher / kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

python ocr async mcp pandoc tesseract text-extraction metadata-extraction table-extraction pdfium rag pdf-extraction document-intelligence

Updated Aug 23, 2025
Python

24eme / signaturepdf

Star

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

php pdf js signature pdf-manipulation pdf-merge pdf-format pdf-rotate pdf-merger pdf-meta-editor pdf-tools pdf-signature pdf-compression pdf-editor pdf-sign pdf-extraction pdf-signer pdf-metadata pdf-compressor

Updated Aug 8, 2025
JavaScript

pytr-org / pytr

Star

Use TradeRepublic in terminal and mass download all documents

portfolio finance terminal-app portfolio-performance pdf-extraction traderepublic-statements traderepublic

Updated Aug 8, 2025
Python

ArtifexSoftware / mupdf.js

Star

JavaScript bindings for MuPDF

javascript pdf typescript wasm mupdf pdf-viewer pdf-extraction

Updated Jun 4, 2025

mateogon / pdf-narrator

Star

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

pdf text-to-speech audiobook tts epub low-resource pdf-extraction pdf-to-audiobook immersive-reading kokoro-tts audiobook-generator pdf-audiobook

Updated Mar 28, 2025
Python

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

pcschreiber1 / PDF_Extraction-Translation

Star

Translate many large PDF Reports for free using Python.

python pdf-extraction pdf-translation

Updated Dec 31, 2022
Jupyter Notebook

MarkShawn2020 / video2ppt

Star

Extract presentation slides from videos with accurate timestamps

python opencv video-processing cli-tool frame-extraction pdf-extraction video-to-slides presentation-extraction

Updated Aug 16, 2025
Python

adobe / pdftools-extract-java-sdk-samples

Star

This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

java pdf extract pdf-extraction

Updated Apr 8, 2024
Java

aidalinfo / extract-kit

Star

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

pdf document-processing ai-sdk pdf-extraction vision-llm

Updated Aug 12, 2025
TypeScript

heshiming / paddlefish

Star

A Python + C implementation for image-based PDF page layout analysis and content extraction.

pdf image-processing image-segmentation image-analysis pdf-extractor table-extraction layout-analysis pdf-extraction

Updated Apr 13, 2023
C++

anyparser / anyparserjs

Star

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.

crawler ocr microsoft-word web-crawler text-extraction artificial-intelligence knowledgebase ms-office microsoft-office etl-pipeline rag pdf-extraction n8n-nodes langchain retrieval-augmented-generation graph-rag cache-augmented-generation anyparser

Updated Feb 26, 2025
TypeScript

souvik03-136 / TenderBot

Star

rrayhka / GRI-Extractor

Star

A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.

python nlp machine-learning pattern-matching tf-idf gri groq pdf-extraction streamlit sustainability-developoment-goals llm sustainability-reporting

Updated Jun 9, 2025
Python

bylickilabs / pdfAnalyzer

Sponsor

Star

PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.

python cli open-source metadata pdf text-mining automation reporting document-analysis document-processing file-analyzer pdf-extraction streamlit pdf-analysis file-inspector

Updated Jun 30, 2025
Python

heijul / pdf2gtfs

Star

A python tool to extract schedule data from PDF timetables and output it in GTFS.

gtfs pdf-extraction

Updated Sep 5, 2023
Python

billy-enrizky / pdf-extraction

Star

Scalable PDF Extraction using Multimodal GPT 4o

pdf-extraction llm gpt-4o

Updated Aug 11, 2025
Python

vatsalmehta2001 / MLPapers_scraper-summarizer

Star

A web application that scrapes ML research papers from arXiv and generates summaries using either OpenAI or Claude API.

flask machine-learning summarization research-papers pdf-extraction arxiv-papers openai-api claude-api

Updated Apr 21, 2025
Python

nickchristopherson / duluth-tourism-analysis

Star

End-to-End Data Pipeline for Tourism Industry Analysis

python tourism jupyter pandas data-visualization data-analysis economic-analysis pdf-extraction duluth

Updated Jun 25, 2025
HTML

RaghuSharma14 / PDF-Reader

Star

A PDF Reader application powered by AI, allowing users to upload PDF documents and extract meaningful information using advanced NLP models. Built with Streamlit, Transformers, and Langchain, this app provides a seamless interface for interacting with and analyzing PDF content.

machine-learning automation transformers text-extraction pdf-reader pdf-extraction streamlit pdf-analysis langchain natural-language-processing-nlp

Updated Apr 24, 2025
Python

Improve this page

Add a description, image, and links to the pdf-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-extraction

Here are 47 public repositories matching this topic...

Goldziher / kreuzberg

24eme / signaturepdf

pytr-org / pytr

ArtifexSoftware / mupdf.js

mateogon / pdf-narrator

iamarunbrahma / pdf-to-markdown

pcschreiber1 / PDF_Extraction-Translation

MarkShawn2020 / video2ppt

adobe / pdftools-extract-java-sdk-samples

aidalinfo / extract-kit

heshiming / paddlefish

anyparser / anyparserjs

souvik03-136 / TenderBot

rrayhka / GRI-Extractor

bylickilabs / pdfAnalyzer

heijul / pdf2gtfs

billy-enrizky / pdf-extraction

vatsalmehta2001 / MLPapers_scraper-summarizer

nickchristopherson / duluth-tourism-analysis

RaghuSharma14 / PDF-Reader

Improve this page

Add this topic to your repo