diff --git a/book/en-us/05-pointers.md b/book/en-us/05-pointers.md index bffff73d..d60b8733 100644 --- a/book/en-us/05-pointers.md +++ b/book/en-us/05-pointers.md @@ -6,7 +6,7 @@ order: 5 # Chapter 05 Standard Library: Pointers -[Table of Content](./toc.md) | [Previous Chapter](./04-containers.md) | [Next Chapter: Standard Library: Regular Expression](./06-regex.md) +[Table of Content](./toc.md) | [Previous Chapter](./04-containers.md) | [Next Chapter: Regular Expression](./06-regex.md) ## Further Readings diff --git a/book/en-us/06-regex.md b/book/en-us/06-regex.md index d33dd80a..808449bf 100644 --- a/book/en-us/06-regex.md +++ b/book/en-us/06-regex.md @@ -1,15 +1,148 @@ --- -title: "Chapter 06 Standard Library: Regular Expression" +title: "Chapter 06 Regular Expression" type: book-en-us order: 6 --- -# Chapter 06 Standard Library: Regular Expression +# Chapter 06 Regular Expression -[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Standard Library: Threads and Concurrency](./07-thread.md) +[TOC] + +## 6.1 Introduction + +Regular expressions are not part of the C++ language and therefore we only briefly +introduced it here. + +Regular expressions describe a pattern of string matching. +The general use of regular expressions is mainly to achieve +the following three requirements: + +1. Check if a string contains some form of substring; +2. Replace the matching substrings; +3. Take the eligible substring from a string. + +Regular expressions are text patterns consisting of ordinary characters (such as a to z) +and special characters. A pattern describes one or more strings to match when searching for text. +Regular expressions act as a template to match a character pattern to the string being searched. + +### Ordinary characters + +Normal characters include all printable and unprintable characters that +are not explicitly specified as metacharacters. This includes all uppercase +and lowercase letters, all numbers, all punctuation, and some other symbols. + +### Special characters + +A special character is a character with special meaning in a regular expression, +and is also the core matching syntax of a regular expression. See the table below: + +|Special characters|Description| +|:---:|:------------------------------------------------------| +|`$`| Matches the end position of the input string. | +|`(`,`)`| Marks the start and end of a subexpression. Subexpressions can be obtained for later use. | +|`*`| Matches the previous subexpression zero or more times. | +|`+`| Matches the previous subexpression one or more times. | +|`.`| Matches any single character except the newline character `\n`. | +|`[`| Marks the beginning of a bracket expression. | +|`?`| Matches the previous subexpression zero or one time, or indicates a non-greedy qualifier. | +| `\`| Marks the next character as either a special character, or a literal character, or a backward reference, or an octal escape character. For example, `n` Matches the character `n`. `\n` matches newline characters. The sequence `\\` Matches the `'\'` character, while `\(` matches the `'('` character.| +|`^`| Matches the beginning of the input string, unless it is used in a square bracket expression, at which point it indicates that the set of characters is not accepted. | +|`{`| Marks the beginning of a qualifier expression. | +|`\`| Indicates a choice between the two. | + +### Quantifiers + +The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy the match. See the table below: + +|Character|Description| +|:---:|:------------------------------------------------------| +|`*`| matches the previous subexpression zero or more times. For example, `foo*` matches `fo` and `foooo`. `*` is equivalent to `{0,}`. | +|`+`| matches the previous subexpression one or more times. For example, `foo+` matches `foo` and `foooo` but does not match `fo`. `+` is equivalent to `{1,}`. | +|`?`| matches the previous subexpression zero or one time. For example, `Your(s)?` can match `Your` in `Your` or `Yours`. `?` is equivalent to `{0,1}`. | +|`{n}`| `n` is a non-negative integer. Matches the determined `n` times. For example, `o{2}` cannot match `o` in `for`, but can match two `o` in `foo`. | +|`{n,}`| `n` is a non-negative integer. Match at least `n` times. For example, `o{2,}` cannot match `o` in `for`, but matches all `o` in `foooooo`. `o{1,}` is equivalent to `o+`. `o{0,}` is equivalent to `o*`. | +|`{n,m}`| `m` and `n` are non-negative integers, where `n` is less than or equal to `m`. Matches at least `n` times and matches up to `m` times. For example, `o{1,3}` will match the first three `o` in `foooooo`. `o{0,1}` is equivalent to `o?`. Note that there can be no spaces between the comma and the two numbers. | + +With these two tables, we can usually read almost all regular expressions. + +## 6.2 `std::regex` and Its Related + +The most common way to match string content is to use regular expressions. Unfortunately, in traditional C++, regular expressions have not been supported by the language level, and are not included in the standard library. C++ is a high-performance language. In the development of background services, the use of regular expressions is also used when judging URL resource links. The most mature and common practice in industry. + +The general solution is to use the regular expression library of `boost`. C++11 officially incorporates the processing of regular expressions into the standard library, providing standard support from the language level and no longer relying on third parties. + +The regular expression library provided by C++11 operates on the `std::string` object, and the pattern `std::regex` (essentially `std::basic_regex`) is initialized and matched by `std::regex_match` Produces `std::smatch` (essentially the `std::match_results` object). + +We use a simple example to briefly introduce the use of this library. Consider the following regular expression: + +- `[az]+\.txt`: In this regular expression, `[az]` means matching a lowercase letter, `+` can match the previous expression multiple times, so `[az]+` can Matches a string of lowercase letters. In the regular expression, a `.` means to match any character, and `\.` means to match the character `.`, and the last `txt` means to match `txt` exactly three letters. So the content of this regular expression to match is a text file consisting of pure lowercase letters. + +`std::regex_match` is used to match strings and regular expressions, and there are many different overloaded forms. The simplest form is to pass `std::string` and a `std::regex` to match. When the match is successful, it will return `true`, otherwise it will return `false`. For example: + +```cpp +#include +#include +#include + +int main() { + std::string fnames[] = {"foo.txt", "bar.txt", "test", "a0.txt", "AAA.txt"}; + // In C++, `\` will be used as an escape character in the string. In order for `\.` to be passed as a regular expression, it is necessary to perform second escaping of `\`, thus we have `\\.` + std::regex txt_regex("[a-z]+\\.txt"); + for (const auto &fname: fnames) + std::cout << fname << ": " << std::regex_match(fname, txt_regex) << std::endl; +} +``` + +Another common form is to pass in the three arguments `std::string`/`std::smatch`/`std::regex`. +The essence of `std::smatch` is actually `std::match_results`. +In the standard library, `std::smatch` is defined as `std::match_results`, +which means `match_results` of a substring iterator type. +Use `std::smatch` to easily get the matching results, for example: + +```cpp +std::regex base_regex("([a-z]+)\\.txt"); +std::smatch base_match; +for(const auto &fname: fnames) { + if (std::regex_match(fname, base_match, base_regex)) { + // the first element of std::smatch matches the entire string + // the second element of std::smatch matches the first expression with brackets + if (base_match.size() == 2) { + std::string base = base_match[1].str(); + std::cout << "sub-match[0]: " << base_match[0].str() << std::endl; + std::cout << fname << " sub-match[1]: " << base << std::endl; + } + } +} +``` + +The output of the above two code snippets is: + +``` +foo.txt: 1 +bar.txt: 1 +test: 0 +a0.txt: 0 +AAA.txt: 0 +sub-match[0]: foo.txt +foo.txt sub-match[1]: foo +sub-match[0]: bar.txt +bar.txt sub-match[1]: bar +``` + +## Conclusion + +This section briefly introduces the regular expression itself, +and then introduces the use of the regular expression library +through a practical example based on the main requirements of +using regular expressions. + +[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Threads and Concurrency](./07-thread.md) ## Further Readings +1. [Comments from `std::regex`'s author](http://zhihu.com/question/23070203/answer/84248248) +2. [Library document of Regular Expression](http://en.cppreference.com/w/cpp/regex) + ## Licenses Creative Commons License
This work was written by [Ou Changkun](https://changkun.de) and licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. The code of this repository is open sourced under the [MIT license](../../LICENSE). \ No newline at end of file diff --git a/book/en-us/toc.md b/book/en-us/toc.md index 580bfe42..02f5e88b 100644 --- a/book/en-us/toc.md +++ b/book/en-us/toc.md @@ -30,6 +30,7 @@ - Default template parameters - Variadic templates - Fold expression + - Non-type template parameter deduction + 2.6 Object-oriented - Delegate constructor - Inheritance constructor @@ -68,11 +69,11 @@ + 5.2 `std::shared_ptr` + 5.3 `std::unique_ptr` - [**Chapter 06 Standard Library: Regular Expression**](./06-regex.md) - + 6.1 Regular Expression Introduction - + Normal characters + + 6.1 Introduction + + Ordinary characters + Special characters - + Determinative - + 6.2 `std::regex` and related + + Quantifiers + + 6.2 `std::regex` and its related + `std::regex` + `std::regex_match` + `std::match_results` diff --git a/book/zh-cn/05-pointers.md b/book/zh-cn/05-pointers.md index 382e2efe..cd361c96 100644 --- a/book/zh-cn/05-pointers.md +++ b/book/zh-cn/05-pointers.md @@ -173,7 +173,7 @@ int main() { 智能指针这种技术并不新奇,在很多语言中都是一种常见的技术,现代 C++ 将这项技术引进,在一定程度上消除了 `new`/`delete` 的滥用,是一种更加成熟的编程范式。 -[返回目录](./toc.md) | [上一章](./04-containers.md) | [下一章 标准库:正则表达式](./06-regex.md) +[返回目录](./toc.md) | [上一章](./04-containers.md) | [下一章 正则表达式](./06-regex.md) ## 进一步阅读的参考资料 diff --git a/book/zh-cn/06-regex.md b/book/zh-cn/06-regex.md index e9729b50..0b562f5e 100644 --- a/book/zh-cn/06-regex.md +++ b/book/zh-cn/06-regex.md @@ -1,12 +1,10 @@ --- -title: 第 6 章 标准库:正则表达式 +title: 第 6 章 正则表达式 type: book-zh-cn order: 6 --- -# 第 6 章 标准库:正则表达式 - -> 内容修订中 +# 第 6 章 正则表达式 [TOC] @@ -59,21 +57,31 @@ order: 6 |`{n,}`| `n` 是一个非负整数。至少匹配 `n` 次。例如,`o{2,}` 不能匹配 `for` 中的 `o`,但能匹配 `foooooo` 中的所有 `o`。`o{1,}` 等价于 `o+`。`o{0,}` 则等价于 `o*`。| |`{n,m}`| `m` 和 `n` 均为非负整数,其中 `n` 小于等于 `m`。最少匹配 `n` 次且最多匹配 `m` 次。例如,`o{1,3}` 将匹配 `foooooo` 中的前三个 `o`。`o{0,1}` 等价于 `o?`。注意,在逗号和两个数之间不能有空格。| -有了这三张表,我们通常就能够读懂几乎所有的正则表达式了。 +有了这两张表,我们通常就能够读懂几乎所有的正则表达式了。 ## 6.2 std::regex 及其相关 -对字符串内容进行匹配的最常见手段就是使用正则表达式。可惜在传统 C++ 中正则表达式一直没有得到语言层面的支持,没有纳入标准库,而 C++ 作为一门高性能语言,在后台服务的开发中,对 URL 资源链接进行判断时,使用正则表达式也是工业界最为成熟的普遍做法。 +对字符串内容进行匹配的最常见手段就是使用正则表达式。 +可惜在传统 C++ 中正则表达式一直没有得到语言层面的支持,没有纳入标准库, +而 C++ 作为一门高性能语言,在后台服务的开发中,对 URL 资源链接进行判断时, +使用正则表达式也是工业界最为成熟的普遍做法。 -一般的解决方案就是使用 `boost` 的正则表达式库。而 C++11 正式将正则表达式的的处理方法纳入标准库的行列,从语言级上提供了标准的支持,不再依赖第三方。 +一般的解决方案就是使用 `boost` 的正则表达式库。 +而 C++11 正式将正则表达式的的处理方法纳入标准库的行列,从语言级上提供了标准的支持, +不再依赖第三方。 -C++11 提供的正则表达式库操作 `std::string` 对象,模式 `std::regex` (本质是 `std::basic_regex`)进行初始化,通过 `std::regex_match` 进行匹配,从而产生 `std::smatch` (本质是 `std::match_results` 对象)。 +C++11 提供的正则表达式库操作 `std::string` 对象, +模式 `std::regex` (本质是 `std::basic_regex`)进行初始化, +通过 `std::regex_match` 进行匹配, +从而产生 `std::smatch` (本质是 `std::match_results` 对象)。 -我们通过一个简单的例子来简单介绍这个库的使用。考虑下面的正则表达式 +我们通过一个简单的例子来简单介绍这个库的使用。考虑下面的正则表达式: - `[a-z]+\.txt`: 在这个正则表达式中, `[a-z]` 表示匹配一个小写字母, `+` 可以使前面的表达式匹配多次,因此 `[a-z]+` 能够匹配一个小写字母组成的字符串。在正则表达式中一个 `.` 表示匹配任意字符,而 `\.` 则表示匹配字符 `.`,最后的 `txt` 表示严格匹配 `txt` 则三个字母。因此这个正则表达式的所要匹配的内容就是由纯小写字母组成的文本文件。 -`std::regex_match` 用于匹配字符串和正则表达式,有很多不同的重载形式。最简单的一个形式就是传入 `std::string` 以及一个 `std::regex` 进行匹配,当匹配成功时,会返回 `true`,否则返回 `false`。例如: +`std::regex_match` 用于匹配字符串和正则表达式,有很多不同的重载形式。 +最简单的一个形式就是传入 `std::string` 以及一个 `std::regex` 进行匹配, +当匹配成功时,会返回 `true`,否则返回 `false`。例如: ```cpp #include @@ -89,15 +97,19 @@ int main() { } ``` -另一种常用的形式就是依次传入 `std::string`/`std::smatch`/`std::regex` 三个参数,其中 `std::smatch` 的本质其实是 `std::match_results`,在标准库中, `std::smatch` 被定义为了 `std::match_results`,也就是一个子串迭代器类型的 `match_results`。使用 `std::smatch` 可以方便的对匹配的结果进行获取,例如: +另一种常用的形式就是依次传入 `std::string`/`std::smatch`/`std::regex` 三个参数, +其中 `std::smatch` 的本质其实是 `std::match_results`。 +在标准库中, `std::smatch` 被定义为了 `std::match_results`, +也就是一个子串迭代器类型的 `match_results`。 +使用 `std::smatch` 可以方便的对匹配的结果进行获取,例如: ```cpp std::regex base_regex("([a-z]+)\\.txt"); std::smatch base_match; for(const auto &fname: fnames) { if (std::regex_match(fname, base_match, base_regex)) { - // sub_match 的第一个元素匹配整个字符串 - // sub_match 的第二个元素匹配了第一个括号表达式 + // std::smatch 的第一个元素匹配整个字符串 + // std::smatch 的第二个元素匹配了第一个括号表达式 if (base_match.size() == 2) { std::string base = base_match[1].str(); std::cout << "sub-match[0]: " << base_match[0].str() << std::endl; @@ -126,9 +138,92 @@ bar.txt sub-match[1]: bar 本节简单介绍了正则表达式本身,然后根据使用正则表达式的主要需求,通过一个实际的例子介绍了正则表达式库的使用。 -> 本节提到的内容足以让我们开发编写一个简单的 Web 框架中关于URL匹配的功能,请参考习题 TODO +## 习题 + +1. 在 Web 服务器开发中,我们通常希望服务某些满足某个条件的路由。正则表达式便是完成这一目标的工具之一。 + +给定如下请求结构: + +```cpp +struct Request { + // request method, POST, GET; path; HTTP version + std::string method, path, http_version; + // use smart pointer for reference counting of content + std::shared_ptr content; + // hash container, key-value dict + std::unordered_map header; + // use regular expression for path match + std::smatch path_match; +}; +``` + +请求的资源类型: + +```cpp +typedef std::map< + std::string, std::unordered_map< + std::string,std::function>> resource_type; +``` + +以及服务端模板: + +```cpp +template +class ServerBase { +public: + resource_type resource; + resource_type default_resource; + + void start() { + // TODO + } +protected: + Request parse_request(std::istream& stream) const { + // TODO + } +} +``` + +请实现成员函数 `start()` 与 `parse_request`。使得服务器模板使用者可以如下指定路由: + +```cpp +template +void start_server(SERVER_TYPE &server) { + + // process GET request for /match/[digit+numbers], e.g. GET request is /match/abc123, will return abc123 + server.resource["^/match/([0-9a-zA-Z]+)/?$"]["GET"] = [](ostream& response, Request& request) { + string number=request.path_match[1]; + response << "HTTP/1.1 200 OK\r\nContent-Length: " << number.length() << "\r\n\r\n" << number; + }; + + // peocess default GET request; anonymous function will be called if no other matches + // response files in folder web/ + // default: index.html + server.default_resource["^/?(.*)$"]["GET"] = [](ostream& response, Request& request) { + string filename = "www/"; + + string path = request.path_match[1]; + + // forbidden use `..` access content outside folder web/ + size_t last_pos = path.rfind("."); + size_t current_pos = 0; + size_t pos; + while((pos=path.find('.', current_pos)) != string::npos && pos != last_pos) { + current_pos = pos; + path.erase(pos, 1); + last_pos--; + } + + // (...) + }; + + server.start(); +} +``` + +参考答案[见此](../../exercises/6)。 -[返回目录](./toc.md) | [上一章](./05-pointers.md) | [下一章 标准库:线程与并发](./07-thread.md) +[返回目录](./toc.md) | [上一章](./05-pointers.md) | [下一章 线程与并发](./07-thread.md) ## 进一步阅读的参考资料 diff --git a/book/zh-cn/toc.md b/book/zh-cn/toc.md index 16b1e1fd..db86e69c 100644 --- a/book/zh-cn/toc.md +++ b/book/zh-cn/toc.md @@ -30,6 +30,7 @@ - 默认模板参数 - 变长参数模板 - 折叠表达式 + - 非类型模板参数推导 + 2.6 面向对象 - 委托构造 - 继承构造 diff --git a/code/6/6.1.cpp b/code/6/6.1.cpp index 3563e99e..fe1f5401 100644 --- a/code/6/6.1.cpp +++ b/code/6/6.1.cpp @@ -3,8 +3,9 @@ // modern c++ tutorial // // created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial // -// 正则表达式库 +// Regular Expression #include #include @@ -12,7 +13,9 @@ int main() { std::string fnames[] = {"foo.txt", "bar.txt", "test", "a0.txt", "AAA.txt"}; - // 在 C++ 中 `\` 会被作为字符串内的转义符,为使 `\.` 作为正则表达式传递进去生效,需要对 `\` 进行二次转义,从而有 `\\.` + // In C++, `\` will be used as an escape character in the string. + // In order for `\.` to be passed as a regular expression, + // it is necessary to perform second escaping of `\`, thus we have `\\.` std::regex txt_regex("[a-z]+\\.txt"); for (const auto &fname: fnames) std::cout << fname << ": " << std::regex_match(fname, txt_regex) << std::endl; @@ -21,8 +24,8 @@ int main() { std::smatch base_match; for(const auto &fname: fnames) { if (std::regex_match(fname, base_match, base_regex)) { - // sub_match 的第一个元素匹配整个字符串 - // sub_match 的第二个元素匹配了第一个括号表达式 + // the first element of std::smatch matches the entire string + // the second element of std::smatch matches the first expression with brackets if (base_match.size() == 2) { std::string base = base_match[1].str(); std::cout << "sub-match[0]: " << base_match[0].str() << std::endl; diff --git a/code/6/Makefile b/code/6/Makefile new file mode 100644 index 00000000..642b9bca --- /dev/null +++ b/code/6/Makefile @@ -0,0 +1,14 @@ +# +# modern cpp tutorial +# +# created by changkun at changkun.de +# https://github.com/changkun/modern-cpp-tutorial +# + +all: $(patsubst %.cpp, %.out, $(wildcard *.cpp)) + +%.out: %.cpp Makefile + clang++ $< -o $@ -std=c++2a -pedantic + +clean: + rm *.out \ No newline at end of file diff --git a/exercises/6/handler.hpp b/exercises/6/handler.hpp index b4f0e348..7833ce8f 100644 --- a/exercises/6/handler.hpp +++ b/exercises/6/handler.hpp @@ -1,7 +1,8 @@ // // handler.hpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #include "server.base.hpp" diff --git a/exercises/6/main.http.cpp b/exercises/6/main.http.cpp index f2f86c43..00d65844 100644 --- a/exercises/6/main.http.cpp +++ b/exercises/6/main.http.cpp @@ -1,7 +1,8 @@ // // main_http.cpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #include diff --git a/exercises/6/main.https.cpp b/exercises/6/main.https.cpp index e0d7b857..880bb874 100644 --- a/exercises/6/main.https.cpp +++ b/exercises/6/main.https.cpp @@ -1,7 +1,8 @@ // // main_https.cpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #include #include "server.https.hpp" diff --git a/exercises/6/server.base.hpp b/exercises/6/server.base.hpp index 1f18a572..290d8521 100644 --- a/exercises/6/server.base.hpp +++ b/exercises/6/server.base.hpp @@ -1,7 +1,8 @@ // // server_base.hpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #ifndef SERVER_BASE_HPP diff --git a/exercises/6/server.http.hpp b/exercises/6/server.http.hpp index 76555f8a..94cb0086 100644 --- a/exercises/6/server.http.hpp +++ b/exercises/6/server.http.hpp @@ -1,7 +1,8 @@ // // server_http.hpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #ifndef SERVER_HTTP_HPP diff --git a/exercises/6/server.https.hpp b/exercises/6/server.https.hpp index bf29dc46..7e4892bd 100644 --- a/exercises/6/server.https.hpp +++ b/exercises/6/server.https.hpp @@ -1,7 +1,8 @@ // // server_https.hpp // web_server -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #ifndef SERVER_HTTPS_HPP diff --git a/exercises/7/main.cpp b/exercises/7/main.cpp index f40e126b..9aba746a 100644 --- a/exercises/7/main.cpp +++ b/exercises/7/main.cpp @@ -4,7 +4,8 @@ // exercise solution - chapter 7 // modern cpp tutorial // -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #include // std::cout, std::endl diff --git a/exercises/7/thread_pool.hpp b/exercises/7/thread_pool.hpp index 885a0873..d885140e 100644 --- a/exercises/7/thread_pool.hpp +++ b/exercises/7/thread_pool.hpp @@ -4,7 +4,8 @@ // exercise solution - chapter 7 // modern cpp tutorial // -// created by changkun at changkun.de/modern-cpp +// created by changkun at changkun.de +// https://github.com/changkun/modern-cpp-tutorial/ // #ifndef THREAD_POOL_H