What's in the box? a preliminary analysis of undesirable content in the common crawl corpus

AS Luccioni, JD Viviano - arXiv preprint arXiv:2105.02732, 2021 - arxiv.org
… The Common Crawl has been used to train many of the recent neural language models in …
In the current article, we present an initial analysis of the Common Crawl, highlighting the pres…

N-gram counts and language models from the common crawl

C Buck, K Heafield, B Van Ooyen - Proceedings of the Language …, 2014 - research.ed.ac.uk
… Finally, we investigate the relation between the amount of Common Crawl data used and …
cannot rule out the possibility that some of the segments appear in the Common Crawl data. …

Introduction to common crawl datasets

JM Patel, JM Patel - Getting structured data from the internet: running web …, 2020 - Springer
… When we take the common crawl data cumulatively, across monthly crawls since 2008, it
represents one of the largest publicly accessible web crawl data corpuses on a petabyte …

Dirt cheap web-scale parallel text from the common crawl

JR Smith, H Saint-Amand, M Plamada, P Koehn… - 2013 - zora.uzh.ch
Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing
more than a set of common … algorithm mined 32 terabytes of the crawl in just under a day, at …

Elastic chatnoir: Search engine for the clueweb and the common crawl

J Bevendorff, B Stein, M Hagen, M Potthast - Advances in Information …, 2018 - Springer
… reference corpora like the ClueWebs and the Common Crawl. ChatNoir is freely available
and … In the future, we plan to incorporate further versions of the Common Crawl, so that …

Understanding regional context of World Wide Web using common crawl corpus

MA Mehmood, HM Shafiq… - 2017 IEEE 13th Malaysia …, 2017 - ieeexplore.ieee.org
… This paper presents large scale web study using Common Crawl Corpus of December 2016.
We examine 200+ terabytes of data with Amazon’s Elastic MapReduce infrastructure. We …

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

M Fröbe, J Bevendorff, L Gienapp, M Völske… - Proceedings of the 44th …, 2021 - dl.acm.org
… With the CopyCat resource, we provide lists of near-duplicates in the commonly used ClueWeb
and Common Crawl datasets and a software toolkit to conduct deduplication on arbitrary …

CCNet: Extracting high quality monolingual datasets from web crawl data

G Wenzek, MA Lachaux, A Conneau… - arXiv preprint arXiv …, 2019 - arxiv.org
… its application to data collected by the Common Crawl project… Common Crawl is a massive
non-curated dataset of … Our pipeline can be applied to any number of Common Crawl

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

S Baack - The 2024 ACM Conference on Fairness, Accountability …, 2024 - dl.acm.org
Common Crawl’s role in generative AI and how LLM builders have typically used its data for
pre-training LLMs, we review Common Crawl’s … We find that Common Crawl’s popularity has …

[PDF][PDF] Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce.

A Stolz, M Hepp - COLD, 2015 - heppnetz.de
crawler, Common Crawl, with the URLs in sitemap files of respective Web sites. We show that
Common Crawl … approach as simple as a sitemap crawl yields much more product pages. …