PulsarRPAPro is the professional version of PulsarRPA, featuring an upgraded server, a collection of top e-commerce site scraping examples, and an advanced AI-powered applet for automatic data extraction.
Never write another web scraper. PulsarRPAPro learns from the website and delivers web data completely and accurately at scale.
There are already dozens of scraping cases for the most popular websites, and we are constantly adding more.
Bilibili: https://www.bilibili.com/video/BV1kM2rYrEFC
- Fully Automated Web Data Extraction——No Rules, Just Results!
- Web spider: browser rendering, ajax data crawling
- High performance: optimized for rendering hundreds of pages in parallel on a single machine without being blocked
- Low cost: scrape 100,000 browser-rendered e-commerce webpages or millions of data points daily with only 8-core CPU/32GB memory
- Web UI: a simple yet powerful web interface to manage spiders and download data
- Machine learning: automatically extract every field in webpages using unsupervised machine learning, generating extraction rules and SQLs
- Data quality assurance: smart retry, accurate scheduling, web data lifecycle management
- Large scale: fully distributed, designed for large-scale crawling
- Simple API: a single line of code to scrape, or a single SQL query to turn a website into a table
- X-SQL: extended SQL to manage web data — web crawling, scraping, content mining, and web BI
- Bot stealth: IP rotation, web driver stealth, and anti-ban mechanisms
- RPA: simulate human behaviors, SPA crawling, or perform other advanced tasks
- Big data: supports various backend storage systems like MongoDB, HBase, and Gora
- Logs & metrics: all events are monitored and recorded for detailed tracking
- Memory 4G+
- JDK 17+
- Google Chrome 90+
- MongoDB started
Download the latest executable jar:
wget http://static.platonic.fun/repo/ai/platon/exotic/PulsarRPAPro.jar
# start MongoDB
docker-compose -f docker/docker-compose.yaml up
java -jar PulsarRPAPro.jar
java -jar PulsarRPAPro.jar harvest "https://www.amazon.com/b?node=1292115011" -diagnose -refresh
Add the following lines to your .m2/settings.xml
:
<mirrors>
<mirror>
<id>maven-default-http-blocker</id>
<mirrorOf>dummy</mirrorOf>
<name>Dummy mirror to override default blocking mirror that blocks http</name>
<url>http://0.0.0.0/</url>
</mirror>
</mirrors>
git clone https://github.com/platonai/PulsarRPAPro.git
cd PulsarRPAPro
./mvnw clean && ./mvnw
cd PulsarRPAPro/target/
# Don't forget to start MongoDB
docker-compose -f docker/docker-compose.yaml up
For Chinese developers, we strongly suggest following this guide to accelerate the build process.
java -jar PulsarRPAPro.jar serve
If PulsarRPAPro is running in GUI mode, the web console should open within a few seconds, or you can open it manually at:
http://localhost:2718/exotic/crawl/
You can use the harvest
command to learn from a set of item pages using unsupervised machine learning.
java -jar PulsarRPAPro.jar harvest "https://www.amazon.com/b?node=1292115011" -diagnose -refresh
The URL in the command should be a portal URL, such as a product listing page URL.
PulsarRPAPro will visit the portal, identify the optimal set of links for item pages, retrieve those pages, and analyze them.
Here is the full page of the auto extraction result in HTML format:
Auto Extraction Result of Amazon
Run the executable jar directly for help and to explore more features:
java -jar PulsarRPAPro.jar
This command will print the help message and some of the most useful examples.
Q: How to use proxies?
A: Follow this guide for proxy rotation.