Skip to content

Commit

Permalink
Docs: 1. improve AI-extraction docs
Browse files Browse the repository at this point in the history
  • Loading branch information
platonai committed Apr 5, 2024
1 parent ce1d388 commit f01c486
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 6 deletions.
7 changes: 3 additions & 4 deletions docs/get-started/14AI-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,9 @@ Platon.ai's algorithm can transform web pages into data with 100% zero human int
and even without machine learning training. It is driven by unsupervised machine learning, similar to how humans read
and understand the internet.

After rendering each web page in a browser, we use JavaScript to calculate a series of properties for each web page
element, mainly including the element's position and size. At the same time, we construct more interesting implicit
features of web page elements, such as topological and semantic features.
Thus, **a web page can be visualized as a geometric graph composed of many rectangles with attributes, and when
We calculate a series of features for each element on a webpage after rendering it in a browser, including visual,
geometric, topological, and semantic features.
**A web page can be considered as a geometric graph composed of many rectangles with attributes, and when
combined, it resembles a bundle of newspapers. The World Wide Web (WWW) can be viewed as a fiber bundle with a
three-dimensional manifold as the base space.**

Expand Down
6 changes: 4 additions & 2 deletions docs/get-started/zh/14AI-extraction.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
AI 自动提取
=

Platon.ai 的目标是开发一套高效采集并阅读理解复杂网站的 AI,完整精确输出数据和知识。目前我们开源了“高效采集”这一部分,“阅读理解”这一部分是个长期且艰巨的任务,我们发布了一个“阅读理解**网页结构**并完整精确输出数据”的[预览版](https://github.com/platonai/PulsarRPAPro#run-auto-extract),这个版本在不久的未来也会开源。
Platon.ai 的目标是开发一套高效采集并阅读理解复杂网站的 AI,完整精确输出数据和知识。目前我们开源了“高效采集”这一部分,“阅读理解”这一部分是个长期
且艰巨的任务,我们发布了一个“阅读理解**网页结构**并完整精确输出数据”的[预览版](https://github.com/platonai/PulsarRPAPro#run-auto-extract),这个版本在不久的未来也会开源。

Platon.ai 的算法能够 100% 无人干预将网页变成数据 -- 不需要配规则,甚至也不需要机器学习训练,它是无监督机器学习驱动的,像人一样去阅读理解互联网。

我们将每个网页在浏览器中渲染后,通过 js 计算出每个网页元素的一系列属性,主要包括元素的位置和大小。同时,我们构造了网页元素的更多有趣的隐含特征,譬如拓扑和语义相关的特征。目前,包括位置和大小在内,我们为每个网页元素构造了 100 多个独立特征。这样,**一张网页可视作由很多个带属性的矩形组成的几何图形(Geometric graph),将全体网页压到一起,如同一捆报纸,万维网(WWW)可以被视作以三维流形为基空间的纤维丛。**
我们将每个网页在浏览器中渲染后,计算出每个网页元素的一系列特征,包括视觉、几何、拓扑和语义特征。**一张网页可看作由很多个带属性的矩形组成的几何图形
(Geometric graph),将所有网页压到一起,如同一捆报纸,万维网(WWW)可以被视作以三维流形为基空间的纤维丛。**

<div style="text-align: center">
<img width="400px" src=https://pica.zhimg.com/80/v2-1262abb4d28b31a00bcf1199b1aba441_1440w.jpeg?source=d16d100b alt="auto extracted chart"/>
Expand Down

0 comments on commit f01c486

Please sign in to comment.