Skip to content

Commit

Permalink
Docs: 1. improve AI-extraction docs
Browse files Browse the repository at this point in the history
  • Loading branch information
platonai committed Apr 5, 2024
1 parent f01c486 commit b4072af
Show file tree
Hide file tree
Showing 4 changed files with 68 additions and 11 deletions.
9 changes: 5 additions & 4 deletions docs/get-started/14AI-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@ AI Automated Extraction
=======================

Platon.ai's goal is to develop an AI that efficiently collects and reads complex websites, accurately outputting data
and knowledge. We have open-sourced the "efficient collection" component. The "reading comprehension" component is
a long and arduous task. We have released a preview version that "reads and understands webpage structures and
accurately outputs data," which will also be open-sourced in the near future.
and knowledge. We have open-sourced the "efficient collection" component. We have also released a preview version that
"reads and understands webpage structures and accurately outputs data," which will also be open-sourced in the near future.

Platon.ai's algorithm can transform web pages into data with 100% zero human intervention -- without the need for rules,
and even without machine learning training. It is driven by unsupervised machine learning, similar to how humans read
Expand Down Expand Up @@ -34,7 +33,9 @@ Furthermore, given any list page, we can evaluate the linked pages to detect whi
java -jar exotic-standalone*.jar arrange https://www.hua.com/flower/
```

In this way, the problem of web page extraction that originally required manually writing several or even dozens of regular expressions or CSS PATHs can now be solved by simply telling the system the list page link, and web pages that meet this requirement account for the vast majority of web pages on the internet.
In this way, the problem of web page extraction that originally required manually writing several or even dozens of
regular expressions or CSS PATHs can now be solved by simply telling the system the list page link, and web pages that
meet this requirement account for the vast majority of web pages on the internet.

Finally, we have equipped the crawler system and data analysis system with an SQL engine, so we can monitor a website column and extract key data in real-time with just one SQL statement. In fact, with the SQL engine, the internet and local databases can almost be treated as the same (except for the longer response time of internet data).

Expand Down
38 changes: 35 additions & 3 deletions docs/get-started/1home.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,43 @@ Catalogue

------

[PulsarRPA](https://github.com/platonai/PulsarRPA) is the ultimate open-source solution for large-scale web data collection, capable of meeting almost all scales and types of web data collection needs.
💖 PulsarRPA is All You Need! 💖

Large-scale extraction of web data is very challenging. **Websites often change and become increasingly complex, which means that collected web data is often inaccurate or incomplete**. PulsarRPA has developed a series of cutting-edge technologies to address these issues.
[PulsarRPA](https://github.com/platonai/PulsarRPA) is a high-performance, distributed, open-source Robotic Process Automation (RPA) framework. It is designed to handle large-scale RPA tasks with ease, providing a comprehensive solution for browser automation, web content understanding, and data extraction.

PulsarRPA represents the pinnacle of open-source solutions for large-scale web data extraction, leveraging the power of high-performance, distributed RPA. It addresses the inherent challenges of browser automation and extracting accurate, comprehensive web data amidst rapidly evolving and increasingly intricate websites.

*Challenges in Large-Scale Web Data Extraction:*

1. Frequent Website Changes: Online platforms continuously update their layouts, structures, and content, making it difficult to maintain reliable extraction processes over time. Traditional scraping tools may struggle to adapt promptly to these changes, leading to outdated or irrelevant data.
2. Complex Website Architecture: Modern websites often employ sophisticated design patterns, dynamic content loading, and advanced security measures, presenting formidable obstacles for conventional scraping techniques. Extracting data from such sites requires deep understanding of their structure and behavior, as well as the ability to interact with them as a human user would.

*PulsarRPA: A Game-Changer in Web Data Collection*

To conquer these challenges, PulsarRPA incorporates a suite of innovative technologies that ensure efficient, accurate, and scalable web data extraction:

1. **Browser Rendering:** Utilizes browser rendering and AJAX data crawling to extract content from websites.
2. **RPA (Robotic Process Automation):** Employs human-like behaviors to interact with webpages, enabling data collection from modern, complex websites.
3. **Intelligent Scraping**: PulsarRPA employs intelligent scraping technology that can automatically recognize and understand web content, ensuring accurate and timely data extraction. Utilizing smart algorithms and machine learning techniques, PulsarRPA can independently learn and apply data extraction models, significantly improving the efficiency and accuracy of data retrieval.
4. **Advanced DOM Parsing:** Leveraging advanced Document Object Model (DOM) parsing techniques, PulsarRPA can navigate complex website architectures with ease. It accurately identifies and extracts data from elements in modern web pages, handles dynamic content rendering, and bypasses anti-scraping measures, delivering complete and accurate datasets despite website intricacies.
5. **Distributed Architecture:** Built on a distributed architecture, PulsarRPA harnesses the combined processing power of multiple nodes to handle large-scale extraction tasks efficiently. This allows for parallel crawling, faster data retrieval, and seamless scalability as your data requirements grow, without compromising performance or reliability.
6. **Open-Source & Customizable:** As an open-source solution, PulsarRPA offers unparalleled flexibility and extensibility. Developers can easily customize its components, integrate with existing systems, or contribute new features to meet specific project requirements.

In summary, PulsarRPA, with its web content understanding, intelligent scraping, advanced DOM parsing, distributed processing, and open-source features, becomes the preferred open-source solution for large-scale web data extraction. Its unique technology combination allows users to effectively address the complexities and challenges associated with extracting valuable web data on a large scale, ultimately facilitating wiser decision-making and competitive advantage.










We provide a wealth of top-tier site collection examples, from beginner to senior, including various collection patterns,
including top-site **full-site collection** code, and collection examples of sites with anti-crawling ceilings. You can
find a code example, make some changes, and integrate it into your own project:

We provide a wealth of top-tier site collection examples, from beginner to senior, including various collection patterns, including top-site **full-site collection** code, and collection examples of sites with anti-crawling ceilings. You can find a code example, make some changes, and use it for your own project:

- [Exotic Amazon](https://github.com/platonai/exotic-amazon) - A real project for full-site data collection of a top e-commerce website.
- [Exotic Walmart](https://github.com/platonai/exotic/tree/main/exotic-app/exotic-OCR-examples/src/main/kotlin/ai/platon/exotic/examples/sites/walmart) - A data collection example of a top e-commerce website.
Expand Down
4 changes: 2 additions & 2 deletions docs/get-started/zh/14AI-extraction.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
AI 自动提取
=

Platon.ai 的目标是开发一套高效采集并阅读理解复杂网站的 AI,完整精确输出数据和知识。目前我们开源了“高效采集”这一部分,“阅读理解”这一部分是个长期
且艰巨的任务,我们发布了一个“阅读理解**网页结构**并完整精确输出数据”的[预览版](https://github.com/platonai/PulsarRPAPro#run-auto-extract),这个版本在不久的未来也会开源。
Platon.ai 的目标是开发一套高效采集并阅读理解复杂网站的 AI,完整精确输出数据和知识。目前我们开源了“高效采集”这一部分,同时发布了一个“阅读理解
**网页结构**并完整精确输出数据”的[预览版](https://github.com/platonai/PulsarRPAPro#run-auto-extract),这个版本在不久的未来也会开源。

Platon.ai 的算法能够 100% 无人干预将网页变成数据 -- 不需要配规则,甚至也不需要机器学习训练,它是无监督机器学习驱动的,像人一样去阅读理解互联网。

Expand Down
28 changes: 26 additions & 2 deletions docs/get-started/zh/1home.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,33 @@

------

[PulsarRPA](https://github.com/platonai/PulsarRPA)[国内镜像](https://gitee.com/platonai_galaxyeye/PulsarRPA))是大规模采集 Web 数据的终极开源方案,可满足几乎所有规模和性质的网络数据采集需要。
💖 PulsarRPA - 您的全方位自动化解决方案!💖

[PulsarRPA](https://github.com/platonai/PulsarRPA)[国内镜像](https://gitee.com/platonai_galaxyeye/PulsarRPA))是一款高性能、分布式、开源的机器人流程自动化(RPA)框架,专为轻松应对大规模 RPA 任务而设计,为浏览器自动化、网络内容理解和数据提取提供了全面解决方案。

作为面向大规模网络数据抽取领域的顶级开源解决方案,PulsarRPA 结合了高性能、分布式 RPA 的优势,旨在解决在快速演变且日益复杂的网站环境中进行浏览器自动化以及抽取准确、全面网络数据所固有的挑战。

*大规模网页数据提取面临的挑战*

1. **频繁的网站变更**:在线平台不断更新其布局、结构和内容,使得长期保持可靠的提取流程颇具挑战。传统的抓取工具可能难以迅速适应这些变化,导致获取到的数据过时或不再相关。
2. **复杂的网站架构**:现代网站常采用精巧的设计模式、动态内容加载及先进的安全措施,为常规抓取方法设立了严峻的难关。从这类网站中提取数据需深入理解其结构与行为,并具备像人类用户一样与其交互的能力。

*PulsarRPA:革新网页数据采集方式*

为应对上述挑战,PulsarRPA 集成了多项创新技术,确保高效、精准、可扩展的网页数据提取:

1. **浏览器渲染:**利用浏览器渲染和AJAX数据抓取从网站提取内容。
2. **RPA(机器人流程自动化):**采用类人类行为与网页互动,实现从现代复杂网站中收集数据。
3. **智能抓取:**PulsarRPA采用智能抓取技术,能够自动识别并理解网页内容,从而确保数据提取的准确性和及时性。利用智能算法和机器学习技术,PulsarRPA 能够自主学习和应用数据提取模型,显著提高数据检索的效率和精确度。
4. **高级DOM解析:**利用高级文档对象模型(DOM)解析技术,PulsarRPA能够轻松导航复杂的网站结构。它能准确识别并提取现代网页元素中的数据,处理动态内容渲染,绕过反爬虫措施,即使面对网站的复杂性,也能提供完整准确的数据集。
5. **分布式架构:**基于分布式架构构建的PulsarRPA,能够有效地处理大规模提取任务,因为它利用了多个节点组合的计算能力。这使得并行抓取、快速数据检索成为可能,并随着数据需求的增加实现无缝扩展,同时不损害性能或可靠性。
6. **开源与可定制:**作为一个开源解决方案,PulsarRPA提供了无与伦比的灵活性和可扩展性。开发者可以轻松定制其组件、集成现有系统或贡献新功能以满足特定项目需求。

综上所述,PulsarRPA 凭借其网页内容理解、智能抓取、先进 DOM 解析、分布式处理及开源特性,成为大规模网页数据提取首选的开源解决方案。其独特的技术组合使用户能够有效应对与大规模提取宝贵网页数据相关的复杂性和挑战,最终推动更明智的决策制定和竞争优势。




大规模提取 Web 数据非常困难。**网站经常变化并且变得越来越复杂,这意味着收集的网络数据通常不准确或不完整**,PulsarRPA 开发了一系列尖端技术来解决这些问题。

我们提供了大量顶级站点的采集示例,从入门到资深,包含各种采集模式,包括顶尖大站的**全站采集**代码、反爬天花板的站点的采集示例,你可以找一个代码示例改改就可以用于自己的项目:

Expand Down

0 comments on commit b4072af

Please sign in to comment.