Skip to content

πŸ•ΈοΈ Unleash your web archive prowess with powerful tools for extracting and processing Common Crawl data! πŸš€

License

Notifications You must be signed in to change notification settings

WildWestCyberSecurity/commoncrawl-pwnkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ commoncrawl-pwnkit

Unleash your power over web archives with commoncrawl-pwnkit!

This repository equips you with the tools you need to dominate the vast web archives of Common Crawl. Whether you're diving deep into historical web data, extracting specific content, or conducting security research, commoncrawl-pwnkit is your ultimate toolkit.

πŸš€ What’s Inside?

1. common_crawl_pwn.sh

  • 🎯 Purpose: This Bash script is your go-to for extracting metadata from Common Crawl based on specific domains. Whether you have a list of domains or a single target, common_crawl_pwn.sh helps you fetch metadata efficiently and generates complete URLs for easy access to archived content.
  • ✨ Features:
    • Extract metadata for multiple domains or a single domain.
    • Optionally include wildcard subdomains.
    • Retrieve and format metadata for easy analysis.
  • πŸ› οΈ Usage:
    ./common_crawl_pwn.sh -d domains.txt -o output_file.csv
    • πŸ“‹ Arguments:
      • -d: Path to a file containing a list of domains.
      • -t: Specify a single target domain.
      • -o: Output file where the metadata and URLs will be saved.
      • -w: (Optional) Include wildcard subdomains in the search.

2. extract_warc_segments.py

  • 🎯 Purpose: A Python script designed to download specific segments from Common Crawl WARC files based on byte offsets and lengths. extract_warc_segments.py lets you efficiently extract and save content like PDFs, HTML, or any other web data you need.
  • ✨ Features:
    • Download content from precise locations within WARC files.
    • Save extracted data with unique filenames to avoid overwriting.
    • Handle large web archives with ease.
  • πŸ› οΈ Usage:
    python extract_warc_segments.py -l urls.txt -o ./output_directory
    • πŸ“‹ Arguments:
      • -l: Path to a file containing a list of URLs with offsets and lengths.
      • -o: Directory where the extracted content will be saved.

πŸ“‚ How to Get Started

  1. πŸ’Ύ Clone the repository:

    git clone https://github.com/WildWestCyberSecurity/commoncrawl-pwnkit.git
  2. πŸ” Navigate to the directory:

    cd commoncrawl-pwnkit
  3. πŸš€ Run the scripts with your data:

    • Use common_crawl_pwn.sh to extract metadata from specific domains.
    • Use extract_warc_segments.py to download and save content from WARC files.

πŸ› οΈ Requirements

Make sure to install the necessary dependencies before running the Python script:

requests==2.31.0
warcio==1.7.4

Install them via pip:

pip install -r requirements.txt

πŸ“„ License

This project is licensed under the MIT License.


Happy web archiving/bug hunting! πŸ•΅οΈβ€β™‚οΈ

About

πŸ•ΈοΈ Unleash your web archive prowess with powerful tools for extracting and processing Common Crawl data! πŸš€

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published