From f54f8342bb172cbc6656008b173abe116b424d2f Mon Sep 17 00:00:00 2001 From: Mian Lu Date: Thu, 13 Jan 2022 15:00:32 +0800 Subject: [PATCH] docs: update README and CHANGELOG (#1058) --- CHANGELOG.md | 72 ++++++++++++++++++++++++++---- README.md | 122 ++++++++++++++++++++++++++++++++++++--------------- README_cn.md | 96 ++++++++++++++++++++++++---------------- 3 files changed, 207 insertions(+), 83 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index a4d4636697d..e0d8e6c7b94 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,14 +1,67 @@ # Changelog -## [Unreleased] -### Feature -- Support insert multiple rows into a table using a single SQL insert statement. [#398](https://github.com/4paradigm/OpenMLDB/issues/398) -- Support aggregation function, e.g. `COUNT`, `SUM`, `MIN`, `MAX`, `AVG`, over the whole table [#219](https://github.com/4paradigm/OpenMLDB/issues/219) -- Enhance plan optimization on `GROUP` and `FILTER` op [#350](https://github.com/4paradigm/OpenMLDB/pull/350) -- Refactor status code and status macro. Save first message (root message) in `status.msg`. [#430](https://github.com/4paradigm/OpenMLDB/issues/430) +## [0.4.0] - 2022-01-14 + +### Highlights + +- The SQL-centric feature is enhanced for both standalone and cluster versions. Now you can enjoy the SQL-centric development and deployment experience seamlessly. (#991,#1034,#1071,#1064,#1061,#1049,#1045,#1038,#1034,#1029,#997,#996,#968,#946,#840,#830,#814,#776,#774,#764,#747,#740,#466,#481,#1033,#1027,#966,#951,#950,#932,#853,#835,#804,#800,#596,#595,#568,#873,#1025,#1021,#1019,#994,#991,#987,#912,#896,#894,#893,#873,#778,#777,#745,#737,#701,#570,#559,#558,#553 @tobegit3hub; #1030,#965,#933,#920,#829,#783,#754,#1005,#998 @vagetablechicken) +- The Chinese documentations are thoroughly polished and accessible at https://docs.openmldb.ai/ . This documentation repository is available at https://github.com/4paradigm/openmldb-docs-zh , and you are welcome to make contributions. +- Experimental feature: We have introduced a monitoring module based on Prometheus + Grafana for online feature processing. (#1048 @aceforeverd) + +### Other Features + +- Support SQL syntax: LIKE, HAVING (#841 @aceforeverd; #927,#698 @jingchen2222) +- Support new built-in functions: reverse (#1004 @nautaa), dayofyear (#856 @Nicholas-SR) +- Improve the compilation and install process, and support building from sources (#999,#871,#594,#752,#793,#805,#875,#871,#999 @aceforeverd; #992 @vagetablechicken) +- Improve the GitHub CI/CD workflow (#842,#884,#875,#919,#1056,#874 @aceforeverd) +- Support system databases and tables (#773 @dl239) +- Improve the function `create index` (#828 @dl239) +- Improve the demo image (#1023,#690,#734,#751 @zhanghaohit) +- Improve the Python SDK (#913,#906 @tobegit3hub;#949,#909 @HuilinWu2; #838 @dl239) +- Simplify the concepts of execution modes (#877,#985,#892 @jingchen2222) +- Add data import and export for the cluster version (#1078 @tobegit3hub) +- Add new deployment command for the cluster version (#921 @dl239) +- Support default values when creating a table (#563 @zoyopei) +- Support string delimiters and quotes (#668 @ZackeryWang) +- Add a new `lru_cache` to support upsert (#795 @vagetablechicken) +- Support adding index with any `ts_col` (#828 @dl239) +- Improve the `ts` packing in `sql_insert_now` (#955,#938 @vagetablechicken) +- Improve documentations (#952 #885 @mahengyang; #834 @Nicholas-SR; #792,#1058,#1002,#872,#836,#792 @lumianph; #844,#782 @jingchen2222; #1022,#805 @aceforeverd) +- Other minor updates (#1073 @dl239) + +### Bug Fixes + +#847, #831, #647, #934, #953, #1015, #982, #927, #994, #1008, #1028, #1019, #779, #855, #350, #631, #1074, #1073, #1081 + +@nautaa, @Nicholas-SR, @aceforeverd, @dl239, @jingchen2222, @tobegit3hub, @keyu813 + + + +## [0.3.0] - 2021-11-05 + +### Highlights +We introduce a new standalone mode that can be deployed on a single node, which is suitable for small businesses and the demonstration purpose. Please read more details from [here](https://github.com/4paradigm/OpenMLDB/blob/v0.3.0/docs/en/standalone.md). The standalone mode is particularly enhanced for ease of use based on the following features that are supported by standalone mode only. +* The standalone deployment mode https://github.com/4paradigm/OpenMLDB/issues/440 +* Connection establishment by specifying the host name and port in CLI https://github.com/4paradigm/OpenMLDB/issues/441 +* LOAD DATA command for bulk loading https://github.com/4paradigm/OpenMLDB/issues/443 +* SQL syntax support for exporting data: SELECT INTO FILE https://github.com/4paradigm/OpenMLDB/issues/455 +* Deployment commands: DEPLOY, SHOW DEPLOYMENT, and DROP DEPLOYMENT https://github.com/4paradigm/OpenMLDB/issues/460 https://github.com/4paradigm/OpenMLDB/issues/447 +### Other Features +* A new CLI command to support different levels of performance sensitivity: `SET performance_sensitive=true|false`. When it is set to false, SQL queries can be executed without indexes. Please read [here](https://github.com/4paradigm/OpenMLDB/blob/v0.3.0/docs/en/performance_sensitive_mode.md) for more details about the performance sensitivity configuration https://github.com/4paradigm/OpenMLDB/issues/555 +* Supporting SQL queries over multiple databases https://github.com/4paradigm/OpenMLDB/issues/476 +* Supporting inserting multiple tuples into a table using a single SQL https://github.com/4paradigm/OpenMLDB/issues/398 +* Improvements for Java SDK: +The new API getTableSchema https://github.com/4paradigm/OpenMLDB/pull/483 +The new API genDDL, which is used to generate DDLs according to a given SQL script https://github.com/4paradigm/OpenMLDB/issues/588 +### Bugfix +* Exceptions caused by certain physical plans with special structures when performing column resolve for logical plans. https://github.com/4paradigm/OpenMLDB/issues/437 +* Under specific circumstances, unexpected outcomes produced by SQL queries with the WHERE when certain WHERE conditions do not fit into indexes https://github.com/4paradigm/OpenMLDB/issues/599 +* The bug when enabling WindowParallelOpt and WindowSkewOptimization at the same times https://github.com/4paradigm/OpenMLDB/issues/444 +* The bug of LCA (Lowest Common Ancestor) algorithm to support WindowParallelOpt for particular SQLs https://github.com/4paradigm/OpenMLDB/issues/485 +* Workaround for the Spark bug (SPARK-36932) when the columns with the same names in LastJoin https://github.com/4paradigm/OpenMLDB/issues/484 +### Acknowledgement +We appreciate the contribution to this release from external contributors who are not from 4Paradigm's core OpenMLDB team, including [Kanekanekane](https://github.com/Kanekanekane), [shawn-happy](https://github.com/shawn-happy), [lotabout](https://github.com/lotabout), [Shouren](https://github.com/Shouren), [zoyopei](https://github.com/zoyopei), [huqianshan](https://github.com/huqianshan) -### Bug Fix -- Fix plan error triggered by optimize the same plan node repeatedly. [#437](https://github.com/4paradigm/OpenMLDB/issues/437) ## [0.2.3] - 2021-08-31 ### Feature @@ -60,7 +113,8 @@ Removed - openmldb-0.2.0-linux.tar.gz targets on x86_64 - aarch64 artifacts consider experimental -[Unreleased]: https://github.com/4paradigm/OpenMLDB/compare/v0.2.3...HEAD +[0.4.0]: https://github.com/4paradigm/OpenMLDB/compare/v0.3.0...v0.4.0 +[0.3.0]: https://github.com/4paradigm/OpenMLDB/compare/v0.2.3...v0.3.0 [0.2.3]: https://github.com/4paradigm/OpenMLDB/compare/0.2.2...v0.2.3 [0.2.2]: https://github.com/4paradigm/OpenMLDB/compare/v0.2.0...0.2.2 [0.2.0]: https://github.com/4paradigm/OpenMLDB/compare/v0.1.5-pre...v0.2.0 diff --git a/README.md b/README.md index e208ffcc2cc..db361cd37eb 100644 --- a/README.md +++ b/README.md @@ -15,77 +15,126 @@ **English version | [中文版](README_cn.md)** +### OpenMLDB is an open-source machine learning database that provides enterprises with a full-stack FeatureOps solution. -## 1. Introduction +## 1. Our Philosophy -OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning. A database for machine learning consists of two major tasks: feature extraction and feature access, which are served as data provisioning for offline training and online inference. Without OpenMLDB, there are two separate systems for online and offline data provisioning, which cost significant effort to verify the online-offline consistency. On the contrary, OpenMLDB supports the unified SQL programming and its execution engine for both online and offline data provisioning. As a result, the online-offline consistency is inherently guaranteed. Moreover, the system is carefully designed and optimized to ensure the efficiency. By taking advantages of OpenMLDB, database engineers are now able to write SQL scripts only to efficiently provide consistent data to machine learning, and an offline model can be immediately deployed for online serving with little cost involved. -

- image-20211103103052252 -

+In the process of artificial intelligence (AI) engineering, 95% of the time and effort is consumed by data processing, data verification and other data related workloads. In order to tackle this problem, 1% tech giants will spend thousands of hours on building in-house data platforms to address AI engineering challenges such as online-offline consistency, data correctness, and data processing efficiency. The other 99% small and medium-sized enterprise purchase expensive SaaS tools and data governance services. + +OpenMLDB is an open-source machine learning database that is committed to solving the data governance challenge of AI engineering in a closed loop. OpenMLDB has been deployed in hundreds of real-world enterprise applications. OpenMLDB gives priority to open-source the capability of feature engineering using SQL, which provides enterprises with a full-stack feature engineering solution (aka FeatureOps). -The above figure illustrates the OpenMLDB workflow. SQL engineers first write SQL scripts for offline feature extraction, which provides data for offline model training. When the model quality is satisfied, the online feature extraction and access can be enabled immediately for online serving without additional efforts involved. Thanks to the unified SQL programming and execution engine, the online-offline consistency verification is eliminated, which is inherently guaranteed by OpenMLDB. Furthermore, certain optimization techniques (e.g., data skew optimization and in-memory indexing for offline and online feature extraction, respectively) are adopted to ensure that the performance requirement can be met for both offline training and online inference. In summary, OpenMLDB enables SQL as the only programming interface for consistent and efficient data provisioning for both offline model training and online inference serving. +## 2. A Full-Stack FeatureOps Solution for Enterprises -## 2. Highlight Features -### 2.1. SQL Programming APIs +MLOps provides a set of practices to develop, deploy, and maintain machine learning models in production efficiently and reliably. As a key link, FeatureOps is responsible for feature engineering, bridging the DataOps and ModelOps. A closed-loop FeatureOps solution should cover all aspects of feature engineering, including functionalities (such as feature store, feature extraction, feature serving, feature sharing) and production (such as low latency, high throughput, fault recovery, high availability, monitoring). OpenMLDB provides a full-stack FeatureOps solution for enterprises with great ease of use, so that feature engineering development returns to its essence: focusing on the development of high-quality feature extraction scripts only and be no longer bound by engineering challenges. -We believe SQL is the most suitable programming APIs for feature engineering because of its elegant design and popularity. OpenMLDB enables SQL as the programming APIs for developers for both offline and online feature extraction. Besides, we extend the capability of standard SQL and make it more powerful for feature extraction. +

+ image-20211103103052253 +

-### 2.2 Online-Offline Consistency +The figure above shows the workflow of FeatureOps based on OpenMLDB. From offline feature development to online serving, it only consists of three steps: -Based on the SQL programming APIs, we design an unified execution engine for both online and offline feature extraction. As a result, the online-offline consistency is inherently guaranteed by OpenMLDB with no other cost. +1. The offline development of feature extraction using SQL +2. The deployment of SQL scripts with one click only, switching the system from the offline to online mode +3. Online feature extraction and serving by connecting with real-time data streams -### 2.3. Efficiency +## 3. Highlights -We propose a few techniques to improve the performance for both offline and online feature extraction. As a result, our offline feature extraction can be significantly faster than existing opensource bigdata processing frameworks. Moreover, our online service can provide low latency (tens of milliseconds) to meet the performance requirement of online inference. +**The Unified Online-Offline Execution Engine:** Offline and real-time online feature extraction use a unified execution engine, thus online-offline consistency is inherently guaranteed. -You can read our below section (7. Publications & Blogs) for more technical detail. +**SQL-Centric Development and Management**: Feature extraction script development, deployment, and maintenance are all based on SQL with great ease of use. -### 2.4 Integrated CLI +**Customized Optimization for Feature Extraction**: Offline feature extraction is performed based on [a tailored Spark version](https://github.com/4paradigm/spark) that is particularly optimized for batch-based feature processing. Online feature extraction provides tens of milliseconds latency under high throughput pressure, which fully meets the online performance requirements. -We provide a powerful integrated CLI for SQL programming, job management, online and offline deployment, and database administration. Developers who are familiar with database's CLIs should be very comfortable with our tool. +**Designed for Enterprise**: OpenMLDB implements important production features for large-scale enterprise applications, including fault recovery, high availability, seamless scale-out, smooth upgrade, monitoring, heterogeneous memory support, and so on. -*Note that, the CLI of current release 0.3.0 supports the cluster mode partially. It will be fully supported in the next release of 0.4.0* +## 4. FAQ -## 3. Build & Install +1. **What are use cases of OpenMLDB?** + + At present, it is mainly positioned as a full-stack FeatureOps solution for machine learning applications. Its pipeline consists of offline and online feature extraction, feature storage, feature serving, feature sharing, and so on. On the other hand, OpenMLDB contains an efficient and fully functional time-series database, which is used in finance, IoT and other fields. + +2. **How does OpenMLDB evolve?** + + OpenMLDB originated from the commercial product of [4Paradigm](https://www.4paradigm.com/) (a leading artificial intelligence service provider). In 2021, the core team has abstracted, enhanced and developed community-friendly features based on the commercial product; and then makes it publicly available as an open-source project to benefit more enterprises to achieve successful digital transformations at low cost. Before OpenMLDB was open-source, it had been successfully deployed in hundreds of real-world applications together with 4Paradigm's other commercial products. + +3. **Is OpenMLDB a feature store?** + + OpenMLDB covers all the functions of a feature store, but provides a more complete full-stack FeatureOps solution, which includes feature store, development using SQL, [a tailored Spark distribution](https://github.com/4paradigm/spark) for offline feature extraction, highly optimized indexing for real-time online feature extraction, feature serving, and other production features for enterprises (such as monitoring, high-availability, fault recovery and so on). Furthermore, OpenMLDB is also used as a high performance time-series database besides FeatureOps. + +4. **Why does OpenMLDB choose SQL as the programming language for users?** + + SQL has the elegant syntax but yet powerful expression ability. SQL based programming experience flattens the learning curve of using OpenMLDB, and further makes it easier for collaboration and sharing. In addition, based on the experience of developing and deploying hundreds of real-world applications using OpenMLDB, it shows that SQL has complete functions in the expression of feature extraction and has withstood the test of practice for a long time. + +## 5. Build & Install :point_right: [Read more](docs/en/compile.md) -## 4. Demo & QuickStart +## 6. QuickStart + +**Cluster and Standalone Versions** + +OpenMLDB has introduced two deployment versions, which are *cluster version* and *standalone version*. The cluster version is suitable for large-scale applications, which provides the scalability and high-availability. On the other hand, the lightweight standalone version running on a single node is ideal for small businesses and demonstration. The cluster and standalone versions have the same functionalities but with different limitations for particular functions. Please refer to [this document](https://docs.openmldb.ai/v/0.4/content-2/standalone_vs_cluster) for details. + +**Getting Started with OpenMLDB** + +:point_right: [OpenMLDB QuickStart](https://docs.openmldb.ai/v/0.4/content-1/openmldb_quickstart) + +## 7. Use Cases -Since OpenMLDB v0.3.0, we have introduced two operating modes, which are cluster mode and standalone mode. The cluster mode is suitable for large-scale datasets and real-world applications, which provides the scalability and high-availability. On the other hand, the lightweight standalone mode running on a single node is ideal for small businesses and demonstration. +We are making efforts to build a list of real-world use cases based on OpenMLDB to demonstrate how it can fit into your business. Please stay tuned. -We demonstrate the workflow using the cluster and standalone modes: +| Application | Tools | Brief Introduction | +| ------------------------------------------------------------ | ------------------ | ------------------------------------------------------------ | +| [New York City Taxi Trip Duration](https://docs.openmldb.ai/v/0.4/content-3/taxi_tour_duration_prediction) | OpenMLDB, LightGBM | This is a challenge from Kaggle to predict the total ride duration of taxi trips in New York City. You can read [more detail here](https://www.kaggle.com/c/nyc-taxi-trip-duration/). It demonstrates using the open-source tools OpenMLDB + LightGBM to build an end-to-end machine learning applications easily. | -- :point_right: [Demo code](demo) -- :point_right: [QuickStart for the cluster mode](docs/en/cluster.md) -- :point_right: [QuickStart for the standalone mode](docs/en/standalone.md) +## 8. Documentation -## 5. Roadmap +You can find our detailed [OpenMLDB Documentation](https://docs.openmldb.ai/). -We list a few highlight features that we have planned in the future releases. Please join our community to understand more about our planning and discuss your ideas. +## 9. Roadmap | Version | Est. release date | Highlight features | | ------- | ----------------- | ------------------------------------------------------------ | -| 0.4.0 | End of 2021 | - Full support of standalone and cluster modes in the integrated CLI | -| 0.5.0 | 2022 Q1 | - Monitoring APIs and tools for online serving
- Efficient queries over a fairly long period of time by window functions
- Kafka/Pulsar connector support for online data source | +| 0.5.0 | 2022 Q1 | - Monitoring APIs and tools for online serving
- Efficient queries over a fairly long period of time by window functions
- Kafka/Pulsar connector support for online data sources
- The online storage engine supports external storage devices. | -## 6. Community +Furthermore, there are a few important features on the development roadmap but have not been scheduled yet. We appreciate any feedbacks on those features. -You may join our community for feedback and discussion +- A cloud-native OpenMLDB +- Adaptors to open-source machine learning lifecycle management platforms, such as MLflow and Airflow +- Fast recovery based on Intel® Optane™ Persistent Memory +- Automatic feature extraction +- Lightweight OpenMLDB for edge computing + +## 10. Contributors + +We really appreciate the contribution from our community. + +- If you are interested to contribute, please read our [Contribution Guideline](CONTRIBUTING.md) for more details. +- If you are a new contributor, you may get start with [the list of good-first-issue](https://github.com/4paradigm/OpenMLDB/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). + +Let's clap hands for our community contributors :clap: + + + + + +## 11. Community + +- **Website**: [https://openmldb.ai/](https://openmldb.ai) (coming soon) - **Email**: [contact@openmldb.ai](mailto:contact@openmldb.ai) -- **[Slack Workspace](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)**: You may find useful information of release notes, user support, development discussion and even more from our various Slack channels. +- **[Slack](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)** -- **GitHub Issues and Discussions**: If you are a serious developer, you are most welcome to join our discussion on GitHub. **GitHub Issues** are used to report bugs and collect new requirements. **GitHub Discussions** are mostly used by our project maintainers to publish and comment RFCs. +- **[GitHub Issues](https://github.com/4paradigm/OpenMLDB/issues)** and **[GitHub Discussions](https://github.com/4paradigm/OpenMLDB/discussions)**: If you are a serious developer, you are most welcome to join our discussion on GitHub. The GitHub Issues is used to report bugs and collect new requirements. The GitHub Discussions is mostly used by our project maintainers to publish and comment RFCs. - [**Blogs (Chinese)**](https://www.zhihu.com/column/c_1417199590352916480) - **WeChat Groups (Chinese)**: - img + img -## 7. Publications & Blogs +## 12. Publications & Blogs - Cheng Chen, Jun Yang, Mian Lu, Taize Wang, Zhao Zheng, Yuqiang Chen, Wenyuan Dai, Bingsheng He, Weng-Fai Wong, Guoan Wu, Yuping Zhao, and Andy Rudoff. *[Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memory](http://vldb.org/pvldb/vol14/p799-chen.pdf)*. International Conference on Very Large Data Bases (VLDB) 2021. - [In-Depth Interpretation of the Latest VLDB 2021 Paper: Artificial Intelligence Driven Real-Time Decision System Database and Optimization Based on Persistent Memory](https://medium.com/@fengxindai0/in-depth-interpretation-of-the-latest-vldb-2021-paper-artificial-intelligence-driven-real-time-f2a818bcf2b2) @@ -93,5 +142,6 @@ You may join our community for feedback and discussion - [Compared to Native Spark 3.0, We Have Achieved Significant Optimization Effects in the AI Application Field](https://towardsdatascience.com/we-have-achieved-significant-optimization-effects-in-the-ai-application-field-compared-to-native-2a055e47250f) - [MLOp Practice: Using OpenMLDB in the Real-Time Anti-Fraud Model for the Bank’s Online Transaction](https://towardsdatascience.com/practice-of-openmldbs-transaction-real-time-anti-fraud-model-in-the-bank-s-online-event-40ab41fec6d4) -## 8. [User List](https://github.com/4paradigm/OpenMLDB/discussions/707) -We have built [a user list](https://github.com/4paradigm/OpenMLDB/discussions/707) to collect feedback from the community. We really appreciate it if you can provide your use cases, comments, or any feedback when using OpenMLDB. We want to hear from you! +## 13. [The User List](https://github.com/4paradigm/OpenMLDB/discussions/707) + +We are building [a user list](https://github.com/4paradigm/OpenMLDB/discussions/707) to collect feedback from the community. We really appreciate it if you can provide your use cases, comments, or any feedback when using OpenMLDB. We want to hear from you! diff --git a/README_cn.md b/README_cn.md index 171710648f2..af24c8bb5b1 100644 --- a/README_cn.md +++ b/README_cn.md @@ -15,7 +15,7 @@ **[English version](./README.md) | 中文版** -#### OpenMLDB 是一个开源机器学习数据库,提供企业级 FeatureOps 全栈解决方案。 +### OpenMLDB 是一个开源机器学习数据库,提供企业级 FeatureOps 全栈解决方案。 ## 1. 设计理念 @@ -25,26 +25,24 @@ OpenMLDB 致力于闭环解决 AI 工程化落地的数据治理难题,并且 ## 2. 企业级 FeatureOps 全栈解决方案 -MLOps 为人工智能工程化落地提供全栈技术方案,作为其中的关键一环,FeatureOps 负责特征计算和供给,衔接 DataOps 和 ModelOps。一个完整的可工程化落地的 FeatureOps 需要覆盖特征工程的各个方面,包括特征生成、特征计算、特征上线、特征共享、特征服务、灾备和高可用等。OpenMLDB 提供一套全栈 FeatureOps 企业级解决方案,同时拥有低门槛和极简的使用和管理体验,让特征工程开发回归于本质:专注于高质量的特征抽取脚本开发,不再被工程化落地所羁绊。 +MLOps 为人工智能工程化落地提供全栈技术方案,作为其中的关键一环,FeatureOps 负责特征计算和供给,衔接 DataOps 和 ModelOps。一个完整的可高效工程化落地的 FeatureOps 解决方案需要覆盖特征工程的各个方面,包括功能需求(如特征存储、特征计算、特征上线、特征共享、特征服务等)和产品级需求(如低延迟、高并发、灾备、高可用、扩缩容、平滑升级、可监控等)。OpenMLDB 提供一套企业级全栈 FeatureOps 解决方案,以及低门槛的基于 SQL 的开发和管理体验,让特征工程开发回归于本质:专注于高质量的特征计算脚本开发,不再被工程化效率落地所羁绊。

image-20211103103052252

+上图显示了基于 OpenMLDB 的 FeatureOps 的基本使用流程,从离线特征开发到服务上线,只需要三个步骤: +1. 使用 SQL 进行线下特征计算脚本开发 +1. SQL 特征计算脚本一键部署上线,由线下模式切换为线上模式 +3. 接入实时数据流,进行线上实时特征计算和供给服务 -上图显示了基于 OpenMLDB 的 FeatureOps 的基本使用流程,从特征开发到上线,只需要三个步骤: - -1. 线下流程:基于 SQL 的特征脚本开发 -1. SQL 脚本一键部署上线,由线下模式切换为线上模式 -3. 线上流程:接入实时数据流,进行实时特征供给上线服务 - -## 3. 主要特性 +## 3. 核心特性 **线上线下一致性执行引擎:** 离线和实时特征计算使用统一的计算执行引擎,线上线下一致性得到了天然保证。 -**低门槛且功能强大的数据库开发体验:** 低门槛的数据库开发体验,全流程基于 SQL 和 CLI 进行特征抽取脚本开发以及部署上线。 +**以 SQL 为核心的开发和管理体验:** 低门槛且功能强大的数据库开发体验,全流程基于 SQL 进行特征计算脚本开发以及部署上线。 -**面向特征计算的定制化性能优化:** 离线特征计算提供[基于 Spark 的高性能批处理优化版本](https://github.com/4paradigm/spark);线上实时特征计算在高吞吐压力下的复杂查询提供几十毫秒量级的延迟,充分满足高并发、低延迟的性能需求。 +**面向特征计算的定制化性能优化:** 离线特征计算使用[面向特征计算优化的 OpenMLDB Spark 发行版](https://docs.openmldb.ai/v/0.4/content-2/openmldbspark_distribution);线上实时特征计算在高吞吐压力下的复杂查询提供几十毫秒量级的延迟,充分满足高并发、低延迟的性能需求。 **企业级特性:** 为大规模企业级应用而设计,整合诸多企业级特性,包括灾备恢复、高可用、可无缝扩缩容、可平滑升级、可监控、企业级异构内存架构支持等。 @@ -52,63 +50,85 @@ MLOps 为人工智能工程化落地提供全栈技术方案,作为其中的 1. **主要使用场景是什么?** - 目前主要面向人工智能场景,为机器训练模型和推理提供一站式特征供给解决方案,包含特征计算、特征存储、特征访问服务等功能。此外,OpenMLDB 本身也包含了一个高效且功能完备的时序数据库,使用于金融、IoT等领域。 + 目前主要面向人工智能场景,为机器训练模型和推理提供一站式特征供给解决方案,包含特征计算、特征存储、特征访问等功能。此外,OpenMLDB 本身也包含了一个高效且功能完备的时序数据库,使用于金融、IoT、数据标注等领域。 2. **OpenMLDB 是如何发展起来的?** - OpenMLDB 起源于领先的人工智能平台提供商[第四范式](https://www.4paradigm.com/)的商业化平台。我们将商业产品中作为数据供给的若干核心组件进行了抽象、增强、以及社区友好化,将它们形成了一个系统的开源产品,以帮助更多的企业低成本实现数字化转型。在 OpenMLDB 开源之前,已经作为第四范式的商业化组件之一在上百个场景中得到了部署和上线。 + OpenMLDB 起源于领先的人工智能平台提供商[第四范式](https://www.4paradigm.com/)的商业化软件。其核心开发团队在 2021 年将商业产品中作为特征工程的核心组件进行了抽象、增强、以及社区友好化,将它们形成了一个系统的开源产品,以帮助更多的企业低成本实现人工智能转型。在开源之前,OpenMLDB 已经作为第四范式的商业化组件之一在上百个场景中得到了部署和上线。 3. **OpenMLDB 是否就是一个 feature store?** - OpenMLDB 包含 feature store 的全部功能但是提供更为完整的 FeatureOps 全栈方案。除了提供特征存储功能,还具有基于 SQL 的数据库开发体验、特征计算、特征上线、企业级运维等功能。 + OpenMLDB 包含 feature store 的全部功能,并且提供更为完整的 FeatureOps 全栈方案。除了提供特征存储功能,还具有基于 SQL 的数据库开发体验、[面向特征计算优化的 OpenMLDB Spark 发行版](https://docs.openmldb.ai/v/0.4/content-2/openmldbspark_distribution),针对实时特征计算优化的索引结构,特征上线服务、企业级运维和管理等功能。此外,OpenMLDB 也被用作一个高性能的时序特征数据库。 -4. **OpenMLDB 为什么选择 SQL 作为开发语言并且提供数据库的开发体验?** +4. **OpenMLDB 为什么选择 SQL 作为开发语言?** SQL 具备表达语法简洁且功能强大的特点,选用 SQL 和数据库开发体验一方面降低开发门槛,另一方面更易于跨部门之间的协作和共享。此外,基于 OpenMLDB 的实践经验表明,SQL 在特征计算的表达上功能完备,已经经受了长时间的实践考验。 - -5. **如何取得技术支持** - - 欢迎加入我们的社区,为你提供使用支持。 ## 5. 编译和安装 :point_right: [点击这里](docs/cn/compile.md) -## 6. Demo & QuickStart +## 6. QuickStart + +**集群版和单机版** + +OpenMLDB 有两种部署模式:集群版(cluster version)和单机版(standalone vesion)。集群版适合于大规模数据的生产环境,提供了良好的可扩展性和高可用性;单机版适合于小数据场景或者试用目的,更加方便部署和使用。集群版和单机版在功能上完全一致,但是在某些具体功能上会有不同限制,详细参阅[此篇说明文档](https://docs.openmldb.ai/v/0.4/content-2/standalone_vs_cluster)。 + +**准备开始体验 OpenMLDB** + +:point_right: [OpenMLDB 快速上手指南](https://docs.openmldb.ai/v/0.4/content-1/openmldb_quickstart) + +## 7. 使用案例 -从 0.3.0 版本开始,OpenMLDB 引入了两种部署模式:集群模式和单机模式。集群模式适合于大规模数据的实际生产环境;单机模式适合于小数据场景或者试用目的,更加方便部署和使用。我们演示基于这两种模式的 demo 和快速上手指南: +我们正在努力构建一个 OpenMLDB 用于实际案例的列表,为 OpenMLDB 如何在你的业务中发挥价值提供参考,请随时关注我们的列表更新。 -- :point_right: [Demo 代码](demo) -- :point_right: [集群模式快速上手指南](docs/cn/cluster.md) -- :point_right: [单机模式快速上手指南](docs/cn/standalone.md) +| 应用 | 所用工具 | 简介 | +| ------------------------------------------------------------ | ------------------ | ------------------------------------------------------------ | +| [New York City Taxi Trip Duration](https://docs.openmldb.ai/v/0.4/content-3/taxi_tour_duration_prediction) | OpenMLDB, LightGBM | 这是个来自 Kaggle 的挑战,用于预测纽约市的出租车行程时间。你可以从这里阅读更多关于[该应用场景的描述](https://www.kaggle.com/c/nyc-taxi-trip-duration/)。本案例展示使用 OpenMLDB + LightGBM 的开源方案,快速搭建完整的机器学习应用。 | -## 7. 开发计划 +## 8. OpenMLDB 文档 -OpenMLD 社区持续进行开发迭代,在此列出我们已经初步规划好的在未来版本的主要支持特性,如果想详细了解我们的计划,或者提供任何的建议,请加入我们的社区来参与互动。 +你可以找到我们完整的 [OpenMLDB 使用文档](https://docs.openmldb.ai/)。 + + +## 9. 开发计划 | 版本号 | 预期发布日期 | 主要特性 | | ------ | ------------ | ------------------------------------------------------------ | -| 0.4.0 | End of 2021 | - 完全支持基于 CLI 的数据库开发体验(包括单机和集群版) | -| 0.5.0 | 2022 Q1 | - 在线服务监控模块
- 长时间窗口支持
- 支持第三方在线数据流引入,包括 Kafka 和 Pulsar | +| 0.5.0 | 2022 Q1 | - 在线服务监控模块
- 长时间窗口支持
- 支持第三方在线数据流引入,包括 Kafka 和 Pulsar
- 实时特征计算的存储引擎支持外存设备 | -此外,OpenMLDB roadmap 上有一些规划中的重要特性支持,欢迎给我们任何反馈: +此外,OpenMLDB roadmap 上有一些规划中的重要功能演进,但是尚未具体排期,欢迎给我们任何反馈: - Cloud-native 版本 -- 基于外存(如 SSD)进行优化的低成本版本 +- 适配机器学习全流程管理平台,比如 MLflow, Airflow 等 - 整合基于傲腾持久内存的快速恢复技术 -- 开源整合自动特征生成功能 +- 整合自动特征生成 +- 轻量级 edge 版本 -## 8. 社区 +## 10. 开发贡献者 -- **Email**: [contact@openmldb.ai](mailto:contact@openmldb.ai) -- **[Slack Workspace](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)**: 你可以在 Slack 上找到我们,通过在线聊天的方式,获取关于 OpenMLDB 的使用和开发支持。 +我们非常感谢来自社区的贡献。 + +- 如果你对于加入 OpenMLDB 开发者感兴趣,请阅读我们的 [Contribution Guideline](CONTRIBUTING.md)。 +- 如果你是一位新加入的贡献者,你或许可以从我们的这个 [good-first-issue](https://github.com/4paradigm/OpenMLDB/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) 列表开始。 -- **GitHub Issues 和 Discussions**: 如果你是一个严肃的开发者,我们非常欢迎加入我们 GitHub 上的开发者社区,近距离参与我们的开发迭代。GitHub Issues 主要用来搜集 bugs 以及反馈新特性需求;GitHub Discussions 主要用来给开发团队发布并且讨论 RFCs。 +为我们已有的社区贡献者鼓掌表示感谢 :clap: + + + + + +## 11. 社区 + +- 网站:[https://openmldb.ai/](https://openmldb.ai) (即将上线) +- **Email**: [contact@openmldb.ai](mailto:contact@openmldb.ai) +- **[Slack](https://join.slack.com/t/openmldb/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg)** +- **[GitHub Issues](https://github.com/4paradigm/OpenMLDB/issues) 和 [GitHub Discussions](https://github.com/4paradigm/OpenMLDB/discussions)**: 如果你是一个严肃的开发者,我们非常欢迎加入我们 GitHub 上的开发者社区,近距离参与我们的开发迭代。GitHub Issues 主要用来搜集 bugs 以及反馈新特性需求;GitHub Discussions 主要用来给开发团队发布并且讨论 RFCs。 - [**技术博客**](https://www.zhihu.com/column/c_1417199590352916480) - **微信交流群:** - img + img -## 9. 学术论文和技术博客 +## 12. 学术论文和技术博客 * Cheng Chen, Jun Yang, Mian Lu, Taize Wang, Zhao Zheng, Yuqiang Chen, Wenyuan Dai, Bingsheng He, Weng-Fai Wong, Guoan Wu, Yuping Zhao, and Andy Rudoff. *[Optimizing in-memory database engine for AI-powered on-line decision augmentation using persistent memory](http://vldb.org/pvldb/vol14/p799-chen.pdf)*. International Conference on Very Large Data Bases (VLDB) 2021. * [第四范式OpenMLDB优化创新论文被国际数据库顶会VLDB录用](https://zhuanlan.zhihu.com/p/401513878) @@ -116,6 +136,6 @@ OpenMLD 社区持续进行开发迭代,在此列出我们已经初步规划好 * [OpenMLDB在AIOPS领域关于交易系统异常检测应用实践](https://zhuanlan.zhihu.com/p/393602288) * [5分钟完成硬件剩余寿命智能预测](https://zhuanlan.zhihu.com/p/399346826) -## 10. [用户列表](https://github.com/4paradigm/OpenMLDB/discussions/707) +## 13. [用户列表](https://github.com/4paradigm/OpenMLDB/discussions/707) 我们创建了一个用于搜集用户使用反馈意见的[用户列表](https://github.com/4paradigm/OpenMLDB/discussions/707)。我们非常感激我们的社区用户可以留下基于 OpenMLDB 的使用案例、意见、或者任何反馈。我们非常期待听到你的声音!