- Based on Amazon Cell Phones Reviews dataset project
- Scrape multiple categories and saves into one and separate files
- Scrapes basic metadata with ratings and reviews
- Use multiple Puppeteer pages as workers
- Configurable timeout for rate limits cooldowns (read more below)
Due to Lazada servers limits unusual requests, this scraper only utilize one worker to scrape search results, while the review scraping process is set to five workers with a five second timeout.
More detailed documentation on this issue coming soon...
You can download pre-scraped datasets at Kaggle (Lazada Indonesian Reviews).
puppeteer
for browser-based scrapingprettier
for formatting source codests-node
for running TypeScript scripts
- Make sure the dependencies are downloaded by running
npm install
oryarn
. - Copy
config.default.ts
(this file is ignored with git) toconfig.ts
and customize config variables onconfig.ts
.
- Open the project directory in Visual Studio Code.
- Select and execute Scrape Search Results in the launch options on the Debug tab (exported to
./data/yyyymmdd-category-items.csv
and./data/yyyymmdd-items.csv
). - Then select and execute Scrape Item Reviews (exported to
./data/yyyymmdd-category-reviews.csv
and./data/yyyymmdd-reviews.csv
).
- Run
npm run scrape:items
oryarn scrape:items
first to scrape initial item results (exported to./data/yyyymmdd-category-items.csv
and./data/yyyymmdd-items.csv
). - Then run
npm run scrape:reviews
oryarn scrape:reviews
to scrape item reviews (exported to./data/yyyymmdd-category-reviews.csv
and./data/yyyymmdd-reviews.csv
).
-
scrape:items
Scrapes and saves entry results for review scraping.
-
scrape:reviews
Scrapes and saves entry reviews based on
scrape:items
data. -
format
Format all
.ts
files. -
format:data
Format
.json
files in/data
.