Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 428 89

  2. cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 177 11

  3. cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 117 10

  4. cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 38 18

  5. cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 20 3

  6. cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 52 10

Repositories

Showing 10 of 69 repositories
  • cc-host-index Public

    Tools for working with the host index

    Python 0 0 0 0 Updated Apr 16, 2025
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 177 Apache-2.0 11 0 0 Updated Apr 14, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    39 43 0 0 Updated Apr 12, 2025
  • wac2025-cc-annotator-poster Public

    A proof of concept pipeline for WARC annotation

    Rust 1 Apache-2.0 0 0 0 Updated Apr 10, 2025
  • cc-webgraph-statistics Public

    Statistics of Common Crawl monthly Web Graphs

    Python 3 Apache-2.0 0 0 0 Updated Apr 10, 2025
  • wac2025-webgraph-workshop Public

    Introduction to WebGraphs - Workshop at the IIPC Web Archiving Conference 2025

    Shell 3 MIT 0 0 0 Updated Apr 10, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    Java 90 Apache-2.0 5 2 (1 issue needs help) 0 Updated Apr 4, 2025
  • arc2warc-conversion Public

    Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

    0 0 0 0 Updated Apr 3, 2025
  • cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 52 Apache-2.0 10 0 0 Updated Apr 1, 2025
  • nutch Public Forked from Aloisius/nutch

    Common Crawl fork of Apache Nutch

    Java 33 Apache-2.0 1,258 6 (1 issue needs help) 0 Updated Apr 1, 2025