Simple Distributed Web Crawler Library

A simple distributed web crawler library that is written in Go.

The library is implemented completed from scratch. As a Golang practice project, it is mainly focused on the distributed structure. One needs to implement their own web parsers as shown in the examples.

It is the capstone project of the imooc's Golang course.

Architecture

As a distributed web crawler, it contains several components

Concurrent engine manages the crawler's core logic among components.
- Queued Scheduler manages workers and requests in queues.
Persistent service is for saving scraped data. Right now it saves parsed data into elasticsearch. More database can be supported.
Crawler worker service is for parsing website.

Components are communicated using JSON-RPC.

Algorithm

The crawler uses breadth first search to scrape website.

Examples

There are two simple examples included:

Coronazaehler scrapes current COVID-19 data of every county in Germany from coronazaehler.de.
mockweb scrapes profile data from a mock dating website.

TODO

separate service for saving data
separate service for parsing web data
frontend for display search results
use testcontainers in tests
separate service for checking duplication
Kubernetes deployment
gRPC and Protobuf version

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
docker		docker
duplicate		duplicate
engine		engine
fetcher		fetcher
frontend		frontend
model		model
persist		persist
rpchelper		rpchelper
scheduler		scheduler
webs		webs
worker		worker
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Distributed Web Crawler Library

Architecture

Algorithm

Examples

TODO

About

Releases

Packages

Languages

MingyiZhang/simple-distributed-crawler-library

Folders and files

Latest commit

History

Repository files navigation

Simple Distributed Web Crawler Library

Architecture

Algorithm

Examples

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages