Skip to content
/ onigumo Public
forked from Glutexo/onigumo

Parallel web scraping system

License

Notifications You must be signed in to change notification settings

nappex/onigumo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Onigumo

About

Onigumo is yet another web-crawler. It “crawls” websites or webapps, storing their data in a structured form suitable for further machine processing.

Architecture

Onigumo is composed of three sequentially interconnected components:

The flowchart below illustrates the flow of data between those parts:

flowchart LR
    start([START])               -->         onigumo_operator[OPERATOR]
    onigumo_operator   -- <hash>.urls ---> onigumo_downloader[DOWNLOADER]
    onigumo_downloader -- <hash>.raw  ---> onigumo_parser[PARSER]
    onigumo_parser     -- <hash>.json ---> onigumo_operator
	
	onigumo_operator          <-.->        spider_operator[OPERATOR]
	onigumo_parser            <-.->        spider_parser[PARSER]

    onigumo_operator           -->         spider_materialization[MATERIALIZER]
	
	subgraph "Onigumo (kernel)"
	    onigumo_operator
		onigumo_downloader
		onigumo_parser
	end

    subgraph "Spider (application)"
       spider_operator
       spider_parser
       spider_materialization
    end
Loading

Operator

The Operator determines URL addresses for the Downloader. A Spider is responsible for adding the URLs, which it gets from the structured form of the data provided by the Parser.

The Operator’s job is to:

  1. initialize a Spider,
  2. extract new URLs from structured data,
  3. insert those URLs onto the Downloader queue.

Downloader

The Downloader fetches and saves the contents and metadata from the unprocessed URL addresses.

The Downloader’s job is to:

  1. read URLs for download,
  2. check for the already downloaded URLs,
  3. fetch the URLs contents along with its metadata,
  4. save the downloaded data.

Parser

Zpracovává data ze staženého obsahu a metadat do strukturované podoby.

Činnost parseru se skládá z:

  1. kontroly stažených URL adres ke zpracování,
  2. zpracovávání obsahu a metadat stažených URL do strukturované podoby,
  3. ukládání strukturovaných dat.

Aplikace (pavouci)

Ze strukturované podoby dat vytáhne potřebné informace.

Podstata výstupních dat či informací je závislá na uživatelských potřebách a také podobě internetového obsahu. Je nemožné vytvořit univerzálního pavouka splňujícího všechny požadavky z kombinace obou výše zmíněných. Z tohoto důvodu je nutné si napsat vlastního pavouka.

Materializer

Usage

Credits

© Glutexo, nappex 2019 – 2022

Licenced under the MIT license.

About

Parallel web scraping system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Elixir 100.0%