Skip to content

rrmarichal/simple-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Simple web crawler

Given a URL, it should output a simple textual sitemap, showing the links between pages. The crawler should be limited to one subdomain - so when you start with https://example.com/, it would crawl all pages within example.com, but not follow external links, for example to facebook.com or blog.example.com.

Client

The crawler client is a ReactJS SPA. The interface contains a text input for the URL and a numeric input to limit the maximum depth to crawl.

The sitemap is rendered using an unordered list (ul) of items (li) using indentation to show the hierarchy/relationship between them.

Service

The crawler service (Service.API) is a .NET Core web API written in C#. The API contains a single route:

- POST ~/crawl?url=[url]&max-depth=[max-depth]

BFSCrawler

Breath first search approach, going one level of the tree at the time. At each step we spawn crawl tasks in parallel which will generate nodes for the next step.

DFSCrawler

Depth first search approach, going from the root down in the sitemap tree recursively.

FaultTolerantCrawler

A curated approach to DFSCrawler where we account for errors and retries. A queue processor implementation maintains a list of processing tasks limited by a maximum concurrenly level parameter. When a node fails to crawl, it is re-queued up to a maximum number of times.

Unit Tests

A test project (Service.Tests) contains:

  • UtilsTests - Unit tests for the Utils class, which contains helper methods on URL validation, parsing and transformation.
  • BFSCrawlerTests, DFSCrawlerTests and FaultTolerantCrawlerTests - Using a mock implementation of the IContentProvider interface, we test basic functionality for the different crawler strategies implementations.

Launch

The crawler client will start on port 3000 by default (react-scripts) using:

cd crawler-client && yarn install && yarn start

The crawler service will start on port 3001 (crawler-service/Service.API/Properties/launchSettings.json) using:

cd crawler-service && dotnet run --project Service.API

Production

In order to deliver this implementation on a reasonable time, some trade-offs were made. To make it production-ready, some other details should be addressed:

  • Crawler strategy - state management and recovery

    Independently of the strategies implemented to make the crawling process more resource efficient, scenarios like node crash, network outages, should be addressed by providing state management and recovery.

  • Logs

    Errors should be all tracked in a logging system with relevant context information like time type/description, number of retries, etc.