Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
-
Updated
Dec 11, 2024 - C++
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
Process Common Crawl data with Python and Spark
News crawling with StormCrawler - stores content as WARC
A python utility for downloading Common Crawl data
🕷️ The pipeline for the OSCAR corpus
Statistics of Common Crawl monthly archives mined from URL index files
Drill into WARC web archives
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Tools to construct and process webgraphs from Common Crawl data
Various Jupyter notebooks about Common Crawl data
A dataset for knowledge base population research using Common Crawl and DBpedia.
German small and large versions of GPT2.
GlotCC Dataset and Pipline -- NeurIPS 2024
The website of the Oscar Project
Common Crawl's processing tools
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
⛏Extract metadata of a specific target based on the results of "commoncrawl.org"
We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small d…
Various Common Crawl utilities in Clojure.
Distributed download scripts for Common Crawl data
Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.
To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."