Skip to content

Commit

Permalink
see #12: translate ch06
Browse files Browse the repository at this point in the history
  • Loading branch information
changkun committed Jul 16, 2019
1 parent 9180aab commit f9d5b44
Show file tree
Hide file tree
Showing 16 changed files with 291 additions and 36 deletions.
2 changes: 1 addition & 1 deletion book/en-us/05-pointers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ order: 5

# Chapter 05 Standard Library: Pointers

[Table of Content](./toc.md) | [Previous Chapter](./04-containers.md) | [Next Chapter: Standard Library: Regular Expression](./06-regex.md)
[Table of Content](./toc.md) | [Previous Chapter](./04-containers.md) | [Next Chapter: Regular Expression](./06-regex.md)

## Further Readings

Expand Down
139 changes: 136 additions & 3 deletions book/en-us/06-regex.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,148 @@
---
title: "Chapter 06 Standard Library: Regular Expression"
title: "Chapter 06 Regular Expression"
type: book-en-us
order: 6
---

# Chapter 06 Standard Library: Regular Expression
# Chapter 06 Regular Expression

[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Standard Library: Threads and Concurrency](./07-thread.md)
[TOC]

## 6.1 Introduction

Regular expressions are not part of the C++ language and therefore we only briefly
introduced it here.

Regular expressions describe a pattern of string matching.
The general use of regular expressions is mainly to achieve
the following three requirements:

1. Check if a string contains some form of substring;
2. Replace the matching substrings;
3. Take the eligible substring from a string.

Regular expressions are text patterns consisting of ordinary characters (such as a to z)
and special characters. A pattern describes one or more strings to match when searching for text.
Regular expressions act as a template to match a character pattern to the string being searched.

### Ordinary characters

Normal characters include all printable and unprintable characters that
are not explicitly specified as metacharacters. This includes all uppercase
and lowercase letters, all numbers, all punctuation, and some other symbols.

### Special characters

A special character is a character with special meaning in a regular expression,
and is also the core matching syntax of a regular expression. See the table below:

|Special characters|Description|
|:---:|:------------------------------------------------------|
|`$`| Matches the end position of the input string. |
|`(`,`)`| Marks the start and end of a subexpression. Subexpressions can be obtained for later use. |
|`*`| Matches the previous subexpression zero or more times. |
|`+`| Matches the previous subexpression one or more times. |
|`.`| Matches any single character except the newline character `\n`. |
|`[`| Marks the beginning of a bracket expression. |
|`?`| Matches the previous subexpression zero or one time, or indicates a non-greedy qualifier. |
| `\`| Marks the next character as either a special character, or a literal character, or a backward reference, or an octal escape character. For example, `n` Matches the character `n`. `\n` matches newline characters. The sequence `\\` Matches the `'\'` character, while `\(` matches the `'('` character.|
|`^`| Matches the beginning of the input string, unless it is used in a square bracket expression, at which point it indicates that the set of characters is not accepted. |
|`{`| Marks the beginning of a qualifier expression. |
|`\`| Indicates a choice between the two. |

### Quantifiers

The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy the match. See the table below:

|Character|Description|
|:---:|:------------------------------------------------------|
|`*`| matches the previous subexpression zero or more times. For example, `foo*` matches `fo` and `foooo`. `*` is equivalent to `{0,}`. |
|`+`| matches the previous subexpression one or more times. For example, `foo+` matches `foo` and `foooo` but does not match `fo`. `+` is equivalent to `{1,}`. |
|`?`| matches the previous subexpression zero or one time. For example, `Your(s)?` can match `Your` in `Your` or `Yours`. `?` is equivalent to `{0,1}`. |
|`{n}`| `n` is a non-negative integer. Matches the determined `n` times. For example, `o{2}` cannot match `o` in `for`, but can match two `o` in `foo`. |
|`{n,}`| `n` is a non-negative integer. Match at least `n` times. For example, `o{2,}` cannot match `o` in `for`, but matches all `o` in `foooooo`. `o{1,}` is equivalent to `o+`. `o{0,}` is equivalent to `o*`. |
|`{n,m}`| `m` and `n` are non-negative integers, where `n` is less than or equal to `m`. Matches at least `n` times and matches up to `m` times. For example, `o{1,3}` will match the first three `o` in `foooooo`. `o{0,1}` is equivalent to `o?`. Note that there can be no spaces between the comma and the two numbers. |

With these two tables, we can usually read almost all regular expressions.

## 6.2 `std::regex` and Its Related

The most common way to match string content is to use regular expressions. Unfortunately, in traditional C++, regular expressions have not been supported by the language level, and are not included in the standard library. C++ is a high-performance language. In the development of background services, the use of regular expressions is also used when judging URL resource links. The most mature and common practice in industry.

The general solution is to use the regular expression library of `boost`. C++11 officially incorporates the processing of regular expressions into the standard library, providing standard support from the language level and no longer relying on third parties.

The regular expression library provided by C++11 operates on the `std::string` object, and the pattern `std::regex` (essentially `std::basic_regex`) is initialized and matched by `std::regex_match` Produces `std::smatch` (essentially the `std::match_results` object).

We use a simple example to briefly introduce the use of this library. Consider the following regular expression:

- `[az]+\.txt`: In this regular expression, `[az]` means matching a lowercase letter, `+` can match the previous expression multiple times, so `[az]+` can Matches a string of lowercase letters. In the regular expression, a `.` means to match any character, and `\.` means to match the character `.`, and the last `txt` means to match `txt` exactly three letters. So the content of this regular expression to match is a text file consisting of pure lowercase letters.

`std::regex_match` is used to match strings and regular expressions, and there are many different overloaded forms. The simplest form is to pass `std::string` and a `std::regex` to match. When the match is successful, it will return `true`, otherwise it will return `false`. For example:

```cpp
#include <iostream>
#include <string>
#include <regex>

int main() {
std::string fnames[] = {"foo.txt", "bar.txt", "test", "a0.txt", "AAA.txt"};
// In C++, `\` will be used as an escape character in the string. In order for `\.` to be passed as a regular expression, it is necessary to perform second escaping of `\`, thus we have `\\.`
std::regex txt_regex("[a-z]+\\.txt");
for (const auto &fname: fnames)
std::cout << fname << ": " << std::regex_match(fname, txt_regex) << std::endl;
}
```

Another common form is to pass in the three arguments `std::string`/`std::smatch`/`std::regex`.
The essence of `std::smatch` is actually `std::match_results`.
In the standard library, `std::smatch` is defined as `std::match_results<std::string::const_iterator>`,
which means `match_results` of a substring iterator type.
Use `std::smatch` to easily get the matching results, for example:

```cpp
std::regex base_regex("([a-z]+)\\.txt");
std::smatch base_match;
for(const auto &fname: fnames) {
if (std::regex_match(fname, base_match, base_regex)) {
// the first element of std::smatch matches the entire string
// the second element of std::smatch matches the first expression with brackets
if (base_match.size() == 2) {
std::string base = base_match[1].str();
std::cout << "sub-match[0]: " << base_match[0].str() << std::endl;
std::cout << fname << " sub-match[1]: " << base << std::endl;
}
}
}
```
The output of the above two code snippets is:
```
foo.txt: 1
bar.txt: 1
test: 0
a0.txt: 0
AAA.txt: 0
sub-match[0]: foo.txt
foo.txt sub-match[1]: foo
sub-match[0]: bar.txt
bar.txt sub-match[1]: bar
```
## Conclusion
This section briefly introduces the regular expression itself,
and then introduces the use of the regular expression library
through a practical example based on the main requirements of
using regular expressions.
[Table of Content](./toc.md) | [Previous Chapter](./05-pointers.md) | [Next Chapter: Threads and Concurrency](./07-thread.md)
## Further Readings
1. [Comments from `std::regex`'s author](http://zhihu.com/question/23070203/answer/84248248)
2. [Library document of Regular Expression](http://en.cppreference.com/w/cpp/regex)
## Licenses
<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work was written by [Ou Changkun](https://changkun.de) and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>. The code of this repository is open sourced under the [MIT license](../../LICENSE).
9 changes: 5 additions & 4 deletions book/en-us/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
- Default template parameters
- Variadic templates
- Fold expression
- Non-type template parameter deduction
+ 2.6 Object-oriented
- Delegate constructor
- Inheritance constructor
Expand Down Expand Up @@ -68,11 +69,11 @@
+ 5.2 `std::shared_ptr`
+ 5.3 `std::unique_ptr`
- [**Chapter 06 Standard Library: Regular Expression**](./06-regex.md)
+ 6.1 Regular Expression Introduction
+ Normal characters
+ 6.1 Introduction
+ Ordinary characters
+ Special characters
+ Determinative
+ 6.2 `std::regex` and related
+ Quantifiers
+ 6.2 `std::regex` and its related
+ `std::regex`
+ `std::regex_match`
+ `std::match_results`
Expand Down
2 changes: 1 addition & 1 deletion book/zh-cn/05-pointers.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ int main() {
智能指针这种技术并不新奇,在很多语言中都是一种常见的技术,现代 C++ 将这项技术引进,在一定程度上消除了 `new`/`delete` 的滥用,是一种更加成熟的编程范式。
[返回目录](./toc.md) | [上一章](./04-containers.md) | [下一章 标准库:正则表达式](./06-regex.md)
[返回目录](./toc.md) | [上一章](./04-containers.md) | [下一章 正则表达式](./06-regex.md)
## 进一步阅读的参考资料
Expand Down
125 changes: 110 additions & 15 deletions book/zh-cn/06-regex.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
---
title: 第 6 章 标准库:正则表达式
title: 第 6 章 正则表达式
type: book-zh-cn
order: 6
---

# 第 6 章 标准库:正则表达式

> 内容修订中
# 第 6 章 正则表达式

[TOC]

Expand Down Expand Up @@ -59,21 +57,31 @@ order: 6
|`{n,}`| `n` 是一个非负整数。至少匹配 `n` 次。例如,`o{2,}` 不能匹配 `for` 中的 `o`,但能匹配 `foooooo` 中的所有 `o``o{1,}` 等价于 `o+``o{0,}` 则等价于 `o*`|
|`{n,m}`| `m``n` 均为非负整数,其中 `n` 小于等于 `m`。最少匹配 `n` 次且最多匹配 `m` 次。例如,`o{1,3}` 将匹配 `foooooo` 中的前三个 `o``o{0,1}` 等价于 `o?`。注意,在逗号和两个数之间不能有空格。|

有了这三张表,我们通常就能够读懂几乎所有的正则表达式了。
有了这两张表,我们通常就能够读懂几乎所有的正则表达式了。

## 6.2 std::regex 及其相关

对字符串内容进行匹配的最常见手段就是使用正则表达式。可惜在传统 C++ 中正则表达式一直没有得到语言层面的支持,没有纳入标准库,而 C++ 作为一门高性能语言,在后台服务的开发中,对 URL 资源链接进行判断时,使用正则表达式也是工业界最为成熟的普遍做法。
对字符串内容进行匹配的最常见手段就是使用正则表达式。
可惜在传统 C++ 中正则表达式一直没有得到语言层面的支持,没有纳入标准库,
而 C++ 作为一门高性能语言,在后台服务的开发中,对 URL 资源链接进行判断时,
使用正则表达式也是工业界最为成熟的普遍做法。

一般的解决方案就是使用 `boost` 的正则表达式库。而 C++11 正式将正则表达式的的处理方法纳入标准库的行列,从语言级上提供了标准的支持,不再依赖第三方。
一般的解决方案就是使用 `boost` 的正则表达式库。
而 C++11 正式将正则表达式的的处理方法纳入标准库的行列,从语言级上提供了标准的支持,
不再依赖第三方。

C++11 提供的正则表达式库操作 `std::string` 对象,模式 `std::regex` (本质是 `std::basic_regex`)进行初始化,通过 `std::regex_match` 进行匹配,从而产生 `std::smatch` (本质是 `std::match_results` 对象)。
C++11 提供的正则表达式库操作 `std::string` 对象,
模式 `std::regex` (本质是 `std::basic_regex`)进行初始化,
通过 `std::regex_match` 进行匹配,
从而产生 `std::smatch` (本质是 `std::match_results` 对象)。

我们通过一个简单的例子来简单介绍这个库的使用。考虑下面的正则表达式
我们通过一个简单的例子来简单介绍这个库的使用。考虑下面的正则表达式:

- `[a-z]+\.txt`: 在这个正则表达式中, `[a-z]` 表示匹配一个小写字母, `+` 可以使前面的表达式匹配多次,因此 `[a-z]+` 能够匹配一个小写字母组成的字符串。在正则表达式中一个 `.` 表示匹配任意字符,而 `\.` 则表示匹配字符 `.`,最后的 `txt` 表示严格匹配 `txt` 则三个字母。因此这个正则表达式的所要匹配的内容就是由纯小写字母组成的文本文件。

`std::regex_match` 用于匹配字符串和正则表达式,有很多不同的重载形式。最简单的一个形式就是传入 `std::string` 以及一个 `std::regex` 进行匹配,当匹配成功时,会返回 `true`,否则返回 `false`。例如:
`std::regex_match` 用于匹配字符串和正则表达式,有很多不同的重载形式。
最简单的一个形式就是传入 `std::string` 以及一个 `std::regex` 进行匹配,
当匹配成功时,会返回 `true`,否则返回 `false`。例如:

```cpp
#include <iostream>
Expand All @@ -89,15 +97,19 @@ int main() {
}
```

另一种常用的形式就是依次传入 `std::string`/`std::smatch`/`std::regex` 三个参数,其中 `std::smatch` 的本质其实是 `std::match_results`,在标准库中, `std::smatch` 被定义为了 `std::match_results<std::string::const_iterator>`,也就是一个子串迭代器类型的 `match_results`。使用 `std::smatch` 可以方便的对匹配的结果进行获取,例如:
另一种常用的形式就是依次传入 `std::string`/`std::smatch`/`std::regex` 三个参数,
其中 `std::smatch` 的本质其实是 `std::match_results`
在标准库中, `std::smatch` 被定义为了 `std::match_results<std::string::const_iterator>`
也就是一个子串迭代器类型的 `match_results`
使用 `std::smatch` 可以方便的对匹配的结果进行获取,例如:

```cpp
std::regex base_regex("([a-z]+)\\.txt");
std::smatch base_match;
for(const auto &fname: fnames) {
if (std::regex_match(fname, base_match, base_regex)) {
// sub_match 的第一个元素匹配整个字符串
// sub_match 的第二个元素匹配了第一个括号表达式
// std::smatch 的第一个元素匹配整个字符串
// std::smatch 的第二个元素匹配了第一个括号表达式
if (base_match.size() == 2) {
std::string base = base_match[1].str();
std::cout << "sub-match[0]: " << base_match[0].str() << std::endl;
Expand Down Expand Up @@ -126,9 +138,92 @@ bar.txt sub-match[1]: bar
本节简单介绍了正则表达式本身,然后根据使用正则表达式的主要需求,通过一个实际的例子介绍了正则表达式库的使用。
> 本节提到的内容足以让我们开发编写一个简单的 Web 框架中关于URL匹配的功能,请参考习题 TODO
## 习题
1. 在 Web 服务器开发中,我们通常希望服务某些满足某个条件的路由。正则表达式便是完成这一目标的工具之一。
给定如下请求结构:
```cpp
struct Request {
// request method, POST, GET; path; HTTP version
std::string method, path, http_version;
// use smart pointer for reference counting of content
std::shared_ptr<std::istream> content;
// hash container, key-value dict
std::unordered_map<std::string, std::string> header;
// use regular expression for path match
std::smatch path_match;
};
```

请求的资源类型:

```cpp
typedef std::map<
std::string, std::unordered_map<
std::string,std::function<void(std::ostream&, Request&)>>> resource_type;
```

以及服务端模板:

```cpp
template <typename socket_type>
class ServerBase {
public:
resource_type resource;
resource_type default_resource;

void start() {
// TODO
}
protected:
Request parse_request(std::istream& stream) const {
// TODO
}
}
```
请实现成员函数 `start()` 与 `parse_request`。使得服务器模板使用者可以如下指定路由:
```cpp
template<typename SERVER_TYPE>
void start_server(SERVER_TYPE &server) {
// process GET request for /match/[digit+numbers], e.g. GET request is /match/abc123, will return abc123
server.resource["^/match/([0-9a-zA-Z]+)/?$"]["GET"] = [](ostream& response, Request& request) {
string number=request.path_match[1];
response << "HTTP/1.1 200 OK\r\nContent-Length: " << number.length() << "\r\n\r\n" << number;
};
// peocess default GET request; anonymous function will be called if no other matches
// response files in folder web/
// default: index.html
server.default_resource["^/?(.*)$"]["GET"] = [](ostream& response, Request& request) {
string filename = "www/";
string path = request.path_match[1];
// forbidden use `..` access content outside folder web/
size_t last_pos = path.rfind(".");
size_t current_pos = 0;
size_t pos;
while((pos=path.find('.', current_pos)) != string::npos && pos != last_pos) {
current_pos = pos;
path.erase(pos, 1);
last_pos--;
}
// (...)
};
server.start();
}
```

参考答案[见此](../../exercises/6)

[返回目录](./toc.md) | [上一章](./05-pointers.md) | [下一章 标准库:线程与并发](./07-thread.md)
[返回目录](./toc.md) | [上一章](./05-pointers.md) | [下一章 线程与并发](./07-thread.md)

## 进一步阅读的参考资料

Expand Down
1 change: 1 addition & 0 deletions book/zh-cn/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
- 默认模板参数
- 变长参数模板
- 折叠表达式
- 非类型模板参数推导
+ 2.6 面向对象
- 委托构造
- 继承构造
Expand Down
Loading

0 comments on commit f9d5b44

Please sign in to comment.