A simple distributed web crawler library that is written in Go.
The library is implemented completed from scratch. As a Golang practice project, it is mainly focused on the distributed structure. One needs to implement their own web parsers as shown in the examples.
It is the capstone project of the imooc's Golang course.
As a distributed web crawler, it contains several components
- Concurrent engine manages the crawler's core logic among components.
- Queued Scheduler manages workers and requests in queues.
- Persistent service is for saving scraped data. Right now it saves parsed data into elasticsearch. More database can be supported.
- Crawler worker service is for parsing website.
Components are communicated using JSON-RPC.
The crawler uses breadth first search to scrape website.
There are two simple examples included:
- Coronazaehler scrapes current COVID-19 data of every county in Germany from coronazaehler.de.
- mockweb scrapes profile data from a mock dating website.
- separate service for saving data
- separate service for parsing web data
- frontend for display search results
- use testcontainers in tests
- separate service for checking duplication
- Kubernetes deployment
- gRPC and Protobuf version