A configuration describes what you want to crawl, and what you want to do with the responses that are received.
Creating a new configuration is easy:
# myconfig.php
use LastCall\Crawler\Configuration\Configuration;
return new Configuration('http://url.for/my/site');
Configurations are Pimple dependency injection containers. You can use array syntax to extend, redefine, or extend services on the configuration. See the Pimple docs for more information on how to use the container.
The following parameters are simple values.
string
- The base URL is a string representing the URL you want to crawl. It will be set when the container is created.
There is no default value for the base_url.
string[]
- An array containing the file extensions we assume contain HTML content.
string[]
- An array containing the file extensions we assume contain asset content (CSS, images, files).
The following services are registered with the container and can be replaced or extended.
LastCall\Crawler\Uri\MatcherInterface
- The matcher is used to check whether URIs are considered within the scope of the current crawl.
LastCall\Crawler\Uri\MatcherInterface
- The matcher is used to check whether URIs point to HTML content.
LastCall\Crawler\Uri\MatcherInterface
- The matcher is used to check whether URIs point to asset content.
LastCall\Crawler\Uri\MatcherInterface
- The matcher is used to check whether URIs point to HTML content within the scope of the current crawl.
LastCall\Crawler\Uri\MatcherInterface
- The matcher is used to check whether URIs point to asset content within the scope of the current crawl.
LastCall\Crawler\Uri\NormalizerInterface
- The normalizer is used to "fix" URIs by applying some standard formatting rules. This helps prevent duplicate URIs from being added. For example, if the crawler discovers a link to http://GOOGLE.com and a link to http://google.com, the default normalizer will lowercase the domain name, and these links will be treated as equivalent.
LastCall\Crawler\Queue\RequestQueueInterface
- The queue is where requests are stored. Initially, the queue only contains a request to the baseUrl, and the queue is filled by subscribers processing the page.
PSR\Log\LoggerInterface
- A PSR-3 compatible logger instance that will be used for logging request/response events, including exceptions during processing.
Doctrine\DBAL\Connection
- A Doctrine connection object. If the doctrine
service exists on the container, it will be used for the queue backend. This is optional, but highly recommended, as the default array backend uses a lot of memory when it has to store many requests.
There is no default doctrine definition.
string[]
- An array of names of logging subscribers that should be activated. Logging subscribers must be available on the container at logger.ID
, where ID is the name that is used to activate the subscriber.
string[]
- An array of names of URL discovery subscribers that should be activated. Discovery subscribers must be available on the container at discovery.ID
, where ID is the name that is used to activate the subscriber.
string[]
- An array of names of recursor subscribers that should be activated. Recursor subscribers must be available on the container at recursor.ID
, where ID is the name that is used to activate the subscriber.