Content entered directly below the header but before the first section heading is called the preamble.
Extracting web data at scale is extremely hard. Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete.
The right data extraction method can preempt many data-centered issues such as:
- Data that’s out-of-date
- Data that’s inaccurate
- Data with no context
- Data without sources or lineage
- Data that’s unusable due to structure
The right data extraction method for you should also address practical concerns such as:
- Your level of technical know-how
- Your budget
- The schedule on which you need your data updated
- The quantity of data you need
- Whether you need homogeneous or heterogeneous structured data
- Whether you need data from a single site, a range of sites, or from across the whole web
Since websites are becoming more and more complex, nowadays, the only way to scrape websites at scale is to render every page in a real browser. It's really hard to write a distributed browser rendering engine. Actually, PulsarRPA is the only open source solution for a distributed browser rendering engine.
The only way to scrape websites at scale with acceptable accuracy and completion is based on a distributed browser rendering engine.
Here is a complete solution to scrape amazon.com at scale: link:https://github.com/platonai/exotic-amazon[TODO]
It's fair simple to sign in a website before scraping in PulsarRPA:
val options = session.options(args)
val loginHandler = TaobaoLoginHandler(username, password, warnUpUrl = portalUrl)
options.event.loadEvent.onAfterBrowserLaunch.addLast(loginHandler)
session.loadOutPages(portalUrl, options)
The key point is: adding a LoginHandler to perform just after the browser launching. The login event handler will open the login page and automatically type the username, password and other information required to sign in the website.
The example code can be found here: TmallCrawler.
There are many cases we want to download resources directly without a browser rendering:
- The data is returned by an AJAX request and easy to parse
- Non-browser-rendering scraping is super-fast
But the resources are often protected within a browsing session, we can not just issue http requests directly to ask for the resources. To simulate the browsing session, we have to fetch the resources within a browsing context:
There are several ways to scrape data from an SPA:
=## Resource mode
session.loadResource(url)
=## RPA mode
class RPAPaginateHandler(val initPageNumber: Int) : WebPageWebDriverEventHandler() {
override suspend fun invoke(page: WebPage, driver: WebDriver): Any? {
// ...
// extract the text
val text = driver.firstText(selector)
// click the next page link
driver.click(nextPageSelector)
return null
}
}
. Wait for the page loaded completely. There are a number of excellent strategies to determine whether a page is fully loaded . Scroll down to the bottom of the page to ensure all ajax requests are triggered and loaded . Simulate how a human being visits the webpage . If there is still missing fields, consider refresh the page
. Construct the urls . Extract the pagination urls
User a ListenableHyperlink to extract links after a referer page being fetched
TODO:
Use event handler and web driver interface to interact with the browser.
. Basic: call session.submit() with a priority parameter . Advanced: use globalCache.urlPool for the complete control
TODO:
In a busy crawl system, there might be millions of pages are scheduled, all the tasks has to be finished before the midnight, because they have to be refreshed in the second day. In such case, all the unfinished task has to be dropped before 24:00.
Once a task is configured with the load option -deadTime, it will be dropped as soon as possible if now > deadTime.
TODO:
. Check the metrics . Check the logs
We will release a complete solution to crawl the entire amazon website: exotic-amazon