Releases: apify/crawlee
v0.21.4
The request statistics that you may remember from logs are now persisted in key-value store,
so you won't lose count when your actor restarts. We've also added a lot of useful
stats in there which can be useful to you after a run finishes. Besides that,
we fixed some bugs and annoyances and improved the TypeScript experience a bit.
- Add persistence to
Statistics
class and automatically persist it inBasicCrawler
. - Fix issue where inaccessible Apify Proxy would cause
ProxyConfiguration
to throw
a timeout error. - Update default user agent to Chrome 85
- Bump Puppeteer to 5.2.1 which uses Chromium 85
- TypeScript: Fix
RequestAsBrowserOptions
missing some values and addRequestQueueInfo
as a return value fromrequestQueue.getInfo()
v0.21.3
v0.21.2
v0.21.1
We fixed some bugs, improved a few things and bumped Puppeteer to match latest Chrome 84.
- Allow
Apify.createProxyConfiguration
to be used seamlessly with the proxy component
of Actor Input UI. - Fix integration of plugins into
CheerioCrawler
with thecrawler.use()
function. - Fix a race condition which caused
RequestQueueLocal
to fail handling requests. - Fix broken debug logging in
SessionPool
. - Improve
ProxyConfiguration
error message for missing password / token. - Update Puppeteer to 5.2.0
- Improve docs, update packages and so on.
v0.21.0
This release comes with breaking changes that will affect most, if not all of your projects. See the migration guide for more information and examples.
First large change is a redesigned proxy configuration. Cheerio
and Puppeteer
crawlers now accept a proxyConfiguration
parameter, which is an instance of ProxyConfiguration
. This class now exclusively manages both Apify Proxy and custom proxies. Visit the new proxy management guide
We also removed Apify.utils.getRandomUserAgent()
as it was no longer effective in avoiding bot detection and changed the default values for empty properties in Request
instances.
- BREAKING: Removed
Apify.getApifyProxyUrl()
. To get an Apify Proxy url, useproxyConfiguration.newUrl([sessionId])
. - BREAKING: Removed
useApifyProxy
,apifyProxyGroups
andapifyProxySession
parameters from all applications in the SDK. UseproxyConfiguration
in crawlers andproxyUrl
inrequestAsBrowser
andApify.launchPuppeteer
. - BREAKING: Removed
Apify.utils.getRandomUserAgent()
as it was no longer effective in avoiding bot detection. - BREAKING:
Request
instances no longer initialize empty properties withnull
, which means that:- empty
errorMessages
are now represented by[]
, and - empty
loadedUrl
,payload
andhandledAt
areundefined
.
- empty
- Add
Apify.createProxyConfiguration()
async
function to createProxyConfiguration
instances.ProxyConfiguration
itself is not exposed. - Add
proxyConfiguration
toCheerioCrawlerOptions
andPuppeteerCrawlerOptions
. - Add
proxyInfo
toCheerioHandlePageInputs
andPuppeteerHandlePageInputs
. You can use this object to retrieve information about the currently used proxy inPuppeteer
andCheerio
crawlers. - Add click buttons and scroll up options to
Apify.utils.puppeteer.infiniteScroll()
. - Fixed a bug where intercepted requests would never continue.
- Fixed a bug where
Apify.utils.requestAsBrowser()
would get into redirect loops. - Fix
Apify.utils.getMemoryInfo()
crashing the process on AWS Lambda and on systems running in Docker without memory cgroups enabled. - Update Puppeteer to 3.3.0.
v0.20.4
- Add
Apify.utils.waitForRunToFinish()
which simplifies waiting for an actor run to finish. - Add standard prefixes to log messages to improve readability and orientation in logs.
- Add support for
async
handlers inApify.utils.puppeteer.addInterceptRequestHandler()
- EXPERIMENTAL: Add
cheerioCrawler.use()
function to enable attachingCrawlerExtension
to the crawler to modify its behavior. A plugin that extends functionality. - Fix bug with cookie expiry in
SessionPool
. - Fix issues in documentation.
- Updated
@apify/http-request
to fix issue in theproxy-agent
package. - Updated Puppeteer to 3.0.2
v0.20.3
- DEPRECATED:
CheerioCrawlerOptions.requestOptions
is now deprecated. Please use
CheerioCrawlerOptions.prepareRequestFunction
instead. - Add
limit
option toApify.utils.enqueueLinks()
for situations when full crawls are not needed. - Add
suggestResponseEncoding
andforceResponseEncoding
options toCheerioCrawler
to allow
users to provide a fall-back or forced encoding of responses in situations where websites
serve invalid encoding information in their headers. - Add a number of new examples and update existing ones to documentation.
- Fix duplicate file extensions in
Apify.utils.puppeteer.saveSnapshot()
when used locally. - Fix encoding of multi-byte characters in
CheerioCrawler
. - Fix formatting of navigation buttons in documentation.
v0.20.2
v0.20.0
- BREAKING:
Apify.utils.requestAsBrowser()
no longer aborts request on status code 406
or when other thantext/html
type is received. Useoptions.abortFunction
if you want to
retain this functionality. - BREAKING: Added
useInsecureHttpParser
option toApify.utils.requestAsBrowser()
which
istrue
by default and forces the function to use a HTTP parser that is less strict than
default Node 12 parser, but also less secure. It is needed to be able to bypass certain
anti-scraping walls and fetch websites that do not comply with HTTP spec. - BREAKING:
RequestList
now removes all the elements from thesources
array on
initialization. If you need to use the sources somewhere else, make a copy. This change
was added as one of several measures to improve memory management ofRequestList
in scenarios with very large amount ofRequest
instances. - DEPRECATED:
RequestListOptions.persistSourcesKey
is now deprecated. Please use
RequestListOptions.persistRequestsKey
. RequestListOptions.sources
can now be an array ofstring
URLs as well.- Added
sourcesFunction
toRequestListOptions
. It enables dynamic fetching of sources
and will only be called if persistedRequests
were not retrieved from key-value store.
Use it to reduce memory spikes and also to make sure that your sources are not re-created
on actor restarts. - Updated
stealth
hiding ofwebdriver
to avoid recent detections. Apify.utils.log
now points to an updated logger instance which prints colored logs (in TTY)
and supports overriding with custom loggers.- Improved
Apify.launchPuppeteer()
code to prevent triggering bugs in Puppeteer by passing
more than required options topuppeteer.launch()
. - Documented
BasicCrawler.autoscaledPool
property, and addedCheerioCrawler.autoscaledPool
andPuppeteerCrawler.autoscaledPool
properties. SessionPool
now persists state onteardown
. Before, it only persisted state every minute.
This ensures that after a crawler finishes, the state is correctly persisted.- Added TypeScript typings and typedef documentation for all entities used throughout SDK.
- Upgraded
proxy-chain
NPM package from 0.2.7 to 0.4.1 and many other dependencies - Removed all usage of the now deprecated
request
package.