Google Scholar

What's in the box? a preliminary analysis of undesirable content in the common crawl corpus

AS Luccioni, JD Viviano - arXiv preprint arXiv:2105.02732, 2021 - arxiv.org

… The Common Crawl has been used to train many of the recent neural language models in …
In the current article, we present an initial analysis of the Common Crawl, highlighting the pres…

Save Cite Cited by 115 Related articles All 3 versions View as HTML

[PDF] ed.ac.uk

N-gram counts and language models from the common crawl

C Buck, K Heafield, B Van Ooyen - Proceedings of the Language …, 2014 - research.ed.ac.uk

… Finally, we investigate the relation between the amount of Common Crawl data used and …
cannot rule out the possibility that some of the segments appear in the Common Crawl data. …

Save Cite Cited by 229 Related articles All 17 versions View as HTML

Introduction to common crawl datasets

JM Patel, JM Patel - Getting structured data from the internet: running web …, 2020 - Springer

… When we take the common crawl data cumulatively, across monthly crawls since 2008, it
represents one of the largest publicly accessible web crawl data corpuses on a petabyte …

Save Cite Cited by 17 Related articles

[PDF] uzh.ch

Dirt cheap web-scale parallel text from the common crawl

JR Smith, H Saint-Amand, M Plamada, P Koehn… - 2013 - zora.uzh.ch

… Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing
more than a set of common … algorithm mined 32 terabytes of the crawl in just under a day, at …

Save Cite Cited by 174 Related articles All 14 versions View as HTML

Related searches

[PDF] webis.de

Elastic chatnoir: Search engine for the clueweb and the common crawl

J Bevendorff, B Stein, M Hagen, M Potthast - Advances in Information …, 2018 - Springer

… reference corpora like the ClueWebs and the Common Crawl. ChatNoir is freely available
and … In the future, we plan to incorporate further versions of the Common Crawl, so that …

Save Cite Cited by 60 Related articles All 2 versions

Understanding regional context of World Wide Web using common crawl corpus

MA Mehmood, HM Shafiq… - 2017 IEEE 13th Malaysia …, 2017 - ieeexplore.ieee.org

… This paper presents large scale web study using Common Crawl Corpus of December 2016.
We examine 200+ terabytes of data with Amazon’s Elastic MapReduce infrastructure. We …

Save Cite Cited by 20 Related articles

[PDF] webis.de

CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl

M Fröbe, J Bevendorff, L Gienapp, M Völske… - Proceedings of the 44th …, 2021 - dl.acm.org

… With the CopyCat resource, we provide lists of near-duplicates in the commonly used ClueWeb
and Common Crawl datasets and a software toolkit to conduct deduplication on arbitrary …

Save Cite Cited by 13 Related articles All 2 versions

[PDF] arxiv.org

CCNet: Extracting high quality monolingual datasets from web crawl data

G Wenzek, MA Lachaux, A Conneau… - arXiv preprint arXiv …, 2019 - arxiv.org

… its application to data collected by the Common Crawl project… Common Crawl is a massive
non-curated dataset of … Our pipeline can be applied to any number of Common Crawl …

Save Cite Cited by 539 Related articles All 5 versions View as HTML

[PDF] facctconference.org

A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl

S Baack - The 2024 ACM Conference on Fairness, Accountability …, 2024 - dl.acm.org

… Common Crawl’s role in generative AI and how LLM builders have typically used its data for
pre-training LLMs, we review Common Crawl’s … We find that Common Crawl’s popularity has …

Save Cite Cited by 1 Related articles All 2 versions

[PDF] heppnetz.de

[PDF][PDF] Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce.

A Stolz, M Hepp - COLD, 2015 - heppnetz.de

… crawler, Common Crawl, with the URLs in sitemap files of respective Web sites. We show that
Common Crawl … approach as simple as a sitemap crawl yields much more product pages. …

Save Cite Cited by 10 Related articles All 5 versions View as HTML

Create alert

Cite

Advanced search

Saved to My library