From 33f164cc6dad0b06d4b3689da0ea151b911393eb Mon Sep 17 00:00:00 2001 From: XRubberDuck Date: Sat, 2 Dec 2023 18:01:48 +0800 Subject: [PATCH 1/2] modify pandas --- ch-pandas/dataframe-slicing.ipynb | 31 ++++++-- ch-pandas/series-dataframe.ipynb | 128 +++++++++++++++++++++++++----- 2 files changed, 129 insertions(+), 30 deletions(-) diff --git a/ch-pandas/dataframe-slicing.ipynb b/ch-pandas/dataframe-slicing.ipynb index b8552c0..47bfb71 100644 --- a/ch-pandas/dataframe-slicing.ipynb +++ b/ch-pandas/dataframe-slicing.ipynb @@ -81,9 +81,16 @@ "* 使用 `.iloc` 或者 `.loc` 函数\n", "* 使用 `.query` 函数\n", "\n", - "### 使用 [] 进行选择\n", + "### 使用 `[]` 进行选择\n", + "- 选择行\n", "\n", - "选择第 2 行到第 5 行(不包括第 5 行)的数据:" + "直接使用数字索引即可,df[a,b]表示选择 `DataFrame` 的第`a`行到第`b-1`行。\n", + "\n", + "```{note}\n", + "Python 中的索引区间都是左闭右开区间,这意味着左边端点可以取到,而右边端点取不到。\n", + "```\n", + "\n", + "例:对上一章节的PWT案例数据 df 选择第 2 行到第 5 行(不包括第 5 行)的数据。\n" ] }, { @@ -254,7 +261,11 @@ "id": "eb81787d", "metadata": {}, "source": [ - "要选择列,我们可以传递一个列表,其中包含所需列的列名,为字符串形式。" + "- 选择列\n", + "\n", + "我们可以传递一个列表,其中包含所需列的列名,为字符串形式。\n", + "\n", + "例:选择 country 和 tcgdp 两列。" ] }, { @@ -389,7 +400,9 @@ "source": [ "如果只选取一列,`df['country']` 等价于 `df.country`。\n", "\n", - "`[]` 还可以用来选择符合特定条件的数据。 例如,选取 POP 大于 20000 的行。判断语句 `df.POP> 20000` 会返回一系列布尔值,符合 POP 大于 20000 条件的会返回为 `True`。如果想要选择这些符合条件的数据,则需要:" + "- `[]` 选择符合特定条件的数据。 \n", + "\n", + "例如,选取 POP 大于 20000 的行。判断语句 `df.POP> 20000` 会返回一系列布尔值,符合 POP 大于 20000 条件的会返回为 `True`。如果想要选择这些符合条件的数据,则需要:" ] }, { @@ -789,7 +802,7 @@ "id": "9b41ebb1", "metadata": {}, "source": [ - "如果选择 cc 列和 cg 列的和大于 80 并且 POP 小于 20000 的行:" + "例:选择 cc 列和 cg 列的和大于 80 并且 POP 小于 20000 的行。" ] }, { @@ -1287,7 +1300,7 @@ "source": [ "使用 `loc` 函数进行选择,与 `iloc` 的区别在于,`loc` 除了接受整数外,还可以接受标签(`a`、`b` 这样的列名)、表示整数位置的 index、`boolean` 。\n", "\n", - "选择第 2 行到第 5 行(不包括第 5 行),`country` 和 `tcgdp` 列:" + "例:选择第 2 行到第 5 行(不包括第 5 行),country 和 tcgdp 列。" ] }, { @@ -1369,7 +1382,7 @@ "id": "44f9c427", "metadata": {}, "source": [ - "使用 `loc` 函数选择 POP 列最大值的行:" + "例:使用 `loc` 函数选择 POP 列最大值的行。" ] }, { @@ -1517,7 +1530,9 @@ "id": "97dd2bd9", "metadata": {}, "source": [ - "还可以使用这种形式:`.loc[,]`,两个参数用逗号隔开,第一个参数接受条件,第二个参数接受我们想要返回的列名,得到的是符合条件的特定的列。" + "还可以使用这种形式:`.loc[,]`,两个参数用逗号隔开,第一个参数接受条件,第二个参数接受我们想要返回的列名,得到的是符合条件的特定的列。\n", + "\n", + "例:选择满足 cc 列加 cg 列大于等于80,POP小于等于20000条件的 country, year, POP 三列。" ] }, { diff --git a/ch-pandas/series-dataframe.ipynb b/ch-pandas/series-dataframe.ipynb index d579fa3..7bba9dd 100644 --- a/ch-pandas/series-dataframe.ipynb +++ b/ch-pandas/series-dataframe.ipynb @@ -68,7 +68,8 @@ }, "outputs": [], "source": [ - "s = pd.Series([1, 2, 3, 4], name = 'my_series')" + "s = pd.Series([1, 2, 3, 4], name = 'my_series')\n", + "s" ] }, { @@ -78,7 +79,11 @@ "source": [ "`Series` 是一个数组状数据结构,其实就是 {numref}`numpy-ndarray` 中的 `ndarray`。 数组最重要的结构是索引(Index)。Index 主要用于标记第几个位置存储什么数据。`pd.Series()` 中不指定 Index 参数时,默认从 0 开始,逐一自增,形如: 0,1,...\n", "\n", - "- Series 支持计算操作。" + "- `Series` 支持计算操作。\n", + "\n", + " 可以对Series对象执行基本的数学运算,如加法、减法、乘法和除法。\n", + "\n", + "例:对上述构建的 s,进行乘法操作。" ] }, { @@ -113,12 +118,51 @@ "s * 100" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "也可以对多个 `Series` 对象进行数学操作。\n", + "```{note}\n", + "当多个 Series 对象操作时,如果形状不同,比如 s1 有 4 个数,s2 有 5 个数,s1 + s2 操作后,返回结果会有 5 个数,但是第 5 个数为 NaN 值。\n", + "```\n", + "\n", + "例:构建 s2,与 s 进行加减乘除操作。" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'pd' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[1;32m/Users/xu/Downloads/python-data-science/ch-pandas/series-dataframe.ipynb 单元格 9\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> 1\u001b[0m s2 \u001b[39m=\u001b[39m pd\u001b[39m.\u001b[39mSeries([\u001b[39m1\u001b[39m, \u001b[39m2\u001b[39m, \u001b[39m3\u001b[39m, \u001b[39m4\u001b[39m], name \u001b[39m=\u001b[39m \u001b[39m'\u001b[39m\u001b[39mmy_series\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m 2\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m'\u001b[39m\u001b[39ms+s2结果为\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mformat(s\u001b[39m+\u001b[39ms2))\n\u001b[1;32m 3\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m'\u001b[39m\u001b[39ms-s2结果为\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mformat(s\u001b[39m-\u001b[39ms2))\n", + "\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined" + ] + } + ], + "source": [ + "s2 = pd.Series([1, 2, 3, 4], name = 'my_series')\n", + "print('s+s2结果为\\n{}'.format(s+s2))\n", + "print('s-s2结果为\\n{}'.format(s-s2))\n", + "print('s*s2结果为\\n{}'.format(s*s2))\n", + "print('s/s2结果为\\n{}'.format(s/s2))\n" + ] + }, { "cell_type": "markdown", "id": "ee28bbf0", "metadata": {}, "source": [ - "- Series 支持描述性统计。比如,获得所有统计信息。" + "- `Series` 支持描述性统计,可以使用`.describe()`方法同时获取 [计数、均值、标准差、最小值,25%分位数,50%分位数,75%分位数和最大值] 的统计信息,也可以使用`.max()`等特定的统计量方法单独获取对应的信息。\n", + "\n", + "例:对上例 s 获得所有统计信息。" ] }, { @@ -162,7 +206,7 @@ "id": "fb61cc9f", "metadata": {}, "source": [ - "计算平均值,中位数和标准差。" + "例:单独计算平均值,中位数和标准差。" ] }, { @@ -254,7 +298,11 @@ "id": "b9aafec8", "metadata": {}, "source": [ - "- Series 的索引很灵活。" + "- `Series` 的索引很灵活。\n", + "\n", + "除了上述默认的 index 作为索引,也可以自定义索引方式。\n", + "\n", + "例:将 s 的 0,1,2,3 的索引依次改为 number1, number2, number3,number4。" ] }, { @@ -279,7 +327,9 @@ "id": "c3c4636d", "metadata": {}, "source": [ - "这时,`Series` 就像一个 Python 中的字典 `dict`,可以使用像 `dict` 一样的语法来访问 `Series` 中的元素,其中 `index` 相当于 `dict` 的键 `key`。例如,使用 `[]` 操作符访问 `number1` 对应的值。" + "这时,`Series` 就像一个 Python 中的字典 `dict`,可以使用像 `dict` 一样的语法来访问 `Series` 中的元素,其中 `index` 相当于 `dict` 的键 `key`。\n", + "\n", + "例如,使用 `[]` 操作符访问 `number1` 对应的值。" ] }, { @@ -368,7 +418,9 @@ "\n", "创建一个 `DataFrame` 有很多方式,比如从列表、字典、文件中读取数据,并创建一个 `DataFrame`。\n", "\n", - "- 基于列表创建" + "- 基于列表创建\n", + "\n", + "例:创建一个第一列为 Name,第二列为 Age,第三列为 City 的 `DataFrame`。" ] }, { @@ -389,7 +441,8 @@ "ages = [25, 30, 22]\n", "cities = ['New York', 'San Francisco', 'Los Angeles']\n", "data = {'Name': names, 'Age': ages, 'City': cities}\n", - "df = pd.DataFrame(data)" + "df = pd.DataFrame(data)\n", + "df" ] }, { @@ -397,7 +450,9 @@ "id": "44c4ceb3", "metadata": {}, "source": [ - "- 基于字典创建" + "- 基于字典创建\n", + "\n", + "例:创建一个第一列为 Column1,第二列为 Column2 的 `DataFrame`。" ] }, { @@ -415,7 +470,8 @@ "outputs": [], "source": [ "data = {'Column1': [1, 2], 'Column2': [3, 4]}\n", - "df = pd.DataFrame(data)" + "df = pd.DataFrame(data)\n", + "df" ] }, { @@ -790,7 +846,7 @@ "outputs": [], "source": [ "import pandas as pd\n", - "\n", + "# 注:直接输入文件绝对路径即可,这里的 os.path.join 是将文件夹的路径和文件名结合一起\n", "df = pd.read_csv(os.path.join(folder_path, \"pwt70_w_country_names.csv\"))" ] }, @@ -1448,18 +1504,46 @@ "source": [ "- `rename()` 函数既可以用于更改行标签,也可以用于列标签。传入一个字典,其中键为当前名称,值为新名称,以更新相应的名称。\n", "\n", - "例:\n", - "1. 将 year 改为 Year,country 改为 Country:\n", - "\n", - "```\n", - "df_renamed = df.rename(columns={'year':Year, 'country':'Country'})\n", - "```\n", - "\n", - "2. 将所有列名改为小写:\n", - "\n", - "```\n", + "例:将year改为Year,country改为Country。" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'df' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[1;32m/Users/xu/Downloads/python-data-science/ch-pandas/series-dataframe.ipynb 单元格 50\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> 1\u001b[0m df_renamed \u001b[39m=\u001b[39m df\u001b[39m.\u001b[39mrename(columns\u001b[39m=\u001b[39m{\u001b[39m'\u001b[39m\u001b[39myear\u001b[39m\u001b[39m'\u001b[39m:Year,\u001b[39m'\u001b[39m\u001b[39mcountry\u001b[39m\u001b[39m'\u001b[39m:\u001b[39m'\u001b[39m\u001b[39mCountry\u001b[39m\u001b[39m'\u001b[39m})\n\u001b[1;32m 2\u001b[0m df_renamed\u001b[39m.\u001b[39mhead(\u001b[39m5\u001b[39m)\n", + "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined" + ] + } + ], + "source": [ + "df_renamed = df.rename(columns={'year':Year,'country':'Country'})\n", + "df_renamed.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " 例:将所有列名改为小写。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "df_renamed = df.rename(columns=str.lower)\n", - "```" + "df_renamed.head(5)" ] }, { From e61725bc3453120e814593e900ef3a9255d03f7b Mon Sep 17 00:00:00 2001 From: XRubberDuck Date: Wed, 6 Dec 2023 15:54:42 +0800 Subject: [PATCH 2/2] modify pandas --- ch-pandas/data-preprocessing.ipynb | 8 +- ch-pandas/dataframe-slicing.ipynb | 20 +- ch-pandas/series-dataframe.ipynb | 731 +++++++++++++++++++++++++---- 3 files changed, 671 insertions(+), 88 deletions(-) diff --git a/ch-pandas/data-preprocessing.ipynb b/ch-pandas/data-preprocessing.ipynb index a535021..4823db9 100644 --- a/ch-pandas/data-preprocessing.ipynb +++ b/ch-pandas/data-preprocessing.ipynb @@ -68,7 +68,7 @@ "source": [ "## 处理重复值\n", "\n", - "检测数据集的记录是否存在重复,可以使用 `.duplicated` 函数进行验证,但是该函数返回的是数据集每一行的检测结果,即 n 行数据会返回 n 个布尔值。为了能够得到最直接的结果,可以使用 `any` 函数。该函数表示的是在多个条件判断中,只要有一个条件为 True,则 `any` 返回的结果为 True。" + "检测数据集的记录是否存在重复,可以使用 `.duplicated()` 函数进行验证,但是该函数返回的是数据集每一行的检测结果,即 n 行数据会返回 n 个布尔值。为了能够得到最直接的结果,可以使用 `any` 函数。该函数表示的是在多个条件判断中,只要有一个条件为 True,则 `any` 返回的结果为 True。" ] }, { @@ -104,7 +104,7 @@ "id": "610f3c20", "metadata": {}, "source": [ - "如果有重复项,可以通过 `.drop_duplicated()` 删除。该函数有 inplace 参数,设置为 True 表示直接在原始数据集上做操作:`df.drop_duplicated(inplace = True)`。\n", + "如果有重复项,可以通过 `.drop_duplicated()` 删除。该函数有 `inplace` 参数,设置为 True 表示直接在原始数据集上做操作:`df.drop_duplicated(inplace = True)`。\n", "\n", "## 处理缺失值\n", "\n", @@ -191,7 +191,7 @@ " \n", "可以使用 `.dropna()` 函数删除有缺失值的行或列。具体形式:`df.dropna(axis=0, how='any', inplace=False)`。\n", "\n", - "这个函数有参数 `axis`,`axis` 用来指定要删除的轴。`axis=0` 表示删除行(默认),axis=1 表示删除列。`how` 用来指定删除的条件。`how='any'` 表示删除包含任何缺失值的行(默认),`how='all'` 表示只删除所有值都是缺失值的行。`inplace` 用于指定是否在原始 `DataFrame` 上进行修改,默认为 False,表示不修改原始 `DataFrame`,而是返回一个新的 `DataFrame`。\n", + "这个函数有参数 `axis`,`axis` 用来指定要删除的轴。`axis=0` 表示删除行(默认),`axis=1` 表示删除列。`how` 用来指定删除的条件。`how='any'` 表示删除包含任何缺失值的行(默认),`how='all'` 表示只删除所有值都是缺失值的行。`inplace` 用于指定是否在原始 `DataFrame` 上进行修改,默认为 False,表示不修改原始 `DataFrame`,而是返回一个新的 `DataFrame`。\n", "\n", "例如,删除包含任何缺失值的行。" ] @@ -697,7 +697,7 @@ "id": "480ca978", "metadata": {}, "source": [ - "在这个例子中,`apply` 有个参数为 `axis`,`axis = 1` 设置函数对每一行操作;`axis = 0`` 设置函数对每一列操作;默认 axis = 0。\n", + "在这个例子中,`apply` 有个参数为 `axis`,`axis = 1` 设置函数对每一行操作;`axis = 0` 设置函数对每一列操作;默认 `axis = 0`。\n", "\n", "例:和 `.loc[]` 一起使用,进行更高级的数据切片。`.apply()` 返回对每一行做条件判断的一系列布尔值,以 `[]` 操作选择部分列。下面的选择条件为:如果 `country` 列属于特定国家,且 `POP > 40000`;如果 `country` 列不属于特定国家,且 `POP < 20000`" ] diff --git a/ch-pandas/dataframe-slicing.ipynb b/ch-pandas/dataframe-slicing.ipynb index 47bfb71..950dc16 100644 --- a/ch-pandas/dataframe-slicing.ipynb +++ b/ch-pandas/dataframe-slicing.ipynb @@ -84,10 +84,10 @@ "### 使用 `[]` 进行选择\n", "- 选择行\n", "\n", - "直接使用数字索引即可,df[a,b]表示选择 `DataFrame` 的第`a`行到第`b-1`行。\n", + "直接使用数字索引即可,`df[a,b]`表示选择 `DataFrame` 的第`a`行到第`b-1`行。\n", "\n", "```{note}\n", - "Python 中的索引区间都是左闭右开区间,这意味着左边端点可以取到,而右边端点取不到。\n", + "Python中的索引区间都是左闭右开区间,这意味着左边端点可以取到,而右边端点取不到。\n", "```\n", "\n", "例:对上一章节的PWT案例数据 df 选择第 2 行到第 5 行(不包括第 5 行)的数据。\n" @@ -2470,8 +2470,22 @@ } ], "metadata": { + "kernelspec": { + "display_name": "pyds", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" } }, "nbformat": 4, diff --git a/ch-pandas/series-dataframe.ipynb b/ch-pandas/series-dataframe.ipynb index 7bba9dd..b17e0bf 100644 --- a/ch-pandas/series-dataframe.ipynb +++ b/ch-pandas/series-dataframe.ipynb @@ -12,7 +12,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 31, "id": "c3935c76", "metadata": { "execution": { @@ -56,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 32, "id": "9f760577", "metadata": { "execution": { @@ -66,7 +66,22 @@ "shell.execute_reply": "2023-09-11T14:28:40.645276Z" } }, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "0 1\n", + "1 2\n", + "2 3\n", + "3 4\n", + "Name: my_series, dtype: int64" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "s = pd.Series([1, 2, 3, 4], name = 'my_series')\n", "s" @@ -88,7 +103,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 33, "id": "94626599", "metadata": { "execution": { @@ -109,7 +124,7 @@ "Name: my_series, dtype: int64" ] }, - "execution_count": 27, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -132,23 +147,42 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 35, "metadata": {}, "outputs": [ { - "ename": "NameError", - "evalue": "name 'pd' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/Users/xu/Downloads/python-data-science/ch-pandas/series-dataframe.ipynb 单元格 9\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> 1\u001b[0m s2 \u001b[39m=\u001b[39m pd\u001b[39m.\u001b[39mSeries([\u001b[39m1\u001b[39m, \u001b[39m2\u001b[39m, \u001b[39m3\u001b[39m, \u001b[39m4\u001b[39m], name \u001b[39m=\u001b[39m \u001b[39m'\u001b[39m\u001b[39mmy_series\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m 2\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m'\u001b[39m\u001b[39ms+s2结果为\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mformat(s\u001b[39m+\u001b[39ms2))\n\u001b[1;32m 3\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m'\u001b[39m\u001b[39ms-s2结果为\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m{}\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mformat(s\u001b[39m-\u001b[39ms2))\n", - "\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined" + "name": "stdout", + "output_type": "stream", + "text": [ + "s+s2结果为\n", + "0 2\n", + "1 4\n", + "2 6\n", + "3 8\n", + "dtype: int64\n", + "s-s2结果为\n", + "0 0\n", + "1 0\n", + "2 0\n", + "3 0\n", + "dtype: int64\n", + "s*s2结果为\n", + "0 1\n", + "1 4\n", + "2 9\n", + "3 16\n", + "dtype: int64\n", + "s/s2结果为\n", + "0 1.0\n", + "1 1.0\n", + "2 1.0\n", + "3 1.0\n", + "dtype: float64\n" ] } ], "source": [ - "s2 = pd.Series([1, 2, 3, 4], name = 'my_series')\n", + "s2 = pd.Series([1, 2, 3, 4])\n", "print('s+s2结果为\\n{}'.format(s+s2))\n", "print('s-s2结果为\\n{}'.format(s-s2))\n", "print('s*s2结果为\\n{}'.format(s*s2))\n", @@ -167,7 +201,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 36, "id": "3ccd82a9", "metadata": { "execution": { @@ -192,7 +226,7 @@ "Name: my_series, dtype: float64" ] }, - "execution_count": 28, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } @@ -211,7 +245,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 37, "id": "4e7d31be", "metadata": { "execution": { @@ -228,7 +262,7 @@ "2.5" ] }, - "execution_count": 29, + "execution_count": 37, "metadata": {}, "output_type": "execute_result" } @@ -239,7 +273,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 38, "id": "2c7599d6", "metadata": { "execution": { @@ -256,7 +290,7 @@ "2.5" ] }, - "execution_count": 30, + "execution_count": 38, "metadata": {}, "output_type": "execute_result" } @@ -267,7 +301,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 39, "id": "0c3aab52", "metadata": { "execution": { @@ -284,7 +318,7 @@ "1.2909944487358056" ] }, - "execution_count": 31, + "execution_count": 39, "metadata": {}, "output_type": "execute_result" } @@ -300,14 +334,14 @@ "source": [ "- `Series` 的索引很灵活。\n", "\n", - "除了上述默认的 index 作为索引,也可以自定义索引方式。\n", + "除了上述默认的序数 index 作为索引,也可以自定义索引方式。\n", "\n", "例:将 s 的 0,1,2,3 的索引依次改为 number1, number2, number3,number4。" ] }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 63, "id": "55e7037b", "metadata": { "execution": { @@ -334,7 +368,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 64, "id": "a9287533", "metadata": { "execution": { @@ -351,7 +385,7 @@ "1" ] }, - "execution_count": 33, + "execution_count": 64, "metadata": {}, "output_type": "execute_result" } @@ -370,7 +404,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 65, "id": "7cd8388f", "metadata": { "execution": { @@ -387,7 +421,7 @@ "True" ] }, - "execution_count": 34, + "execution_count": 65, "metadata": {}, "output_type": "execute_result" } @@ -396,6 +430,41 @@ "'number1' in s" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- `value_counts()`计算`Series`中唯一值的频率,这个方法返回一个索引为唯一值,值为对应频率的新的`Series`。\n", + "\n", + "这种方法一方面可以快速了解数据集中各个值的分布情况,知道是否有异常值、缺失值或者某些值的频率很低,也有利于后续进行一些可视化处理。\n", + "\n", + "例:对s3序列进行计数,可见A出现的频率最高而C出现的频率最低。" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "A 4\n", + "B 3\n", + "C 1\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s3 = pd.Series(['A', 'B', 'A', 'C', 'B', 'A', 'A', 'B']) \n", + "s3.value_counts()" + ] + }, { "cell_type": "markdown", "id": "803dff93", @@ -425,7 +494,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 66, "id": "29fda5e0", "metadata": { "execution": { @@ -435,7 +504,68 @@ "shell.execute_reply": "2023-09-11T14:28:40.759646Z" } }, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
NameAgeCity
0Alice25New York
1Bob30San Francisco
2Charlie22Los Angeles
\n", + "
" + ], + "text/plain": [ + " Name Age City\n", + "0 Alice 25 New York\n", + "1 Bob 30 San Francisco\n", + "2 Charlie 22 Los Angeles" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "names = ['Alice', 'Bob', 'Charlie']\n", "ages = [25, 30, 22]\n", @@ -457,7 +587,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 67, "id": "12cbb41a", "metadata": { "execution": { @@ -467,7 +597,58 @@ "shell.execute_reply": "2023-09-11T14:28:40.765960Z" } }, - "outputs": [], + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Column1Column2
013
124
\n", + "
" + ], + "text/plain": [ + " Column1 Column2\n", + "0 1 3\n", + "1 2 4" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "data = {'Column1': [1, 2], 'Column2': [3, 4]}\n", "df = pd.DataFrame(data)\n", @@ -534,7 +715,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 68, "metadata": {}, "outputs": [ { @@ -579,7 +760,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 69, "metadata": {}, "outputs": [ { @@ -664,7 +845,7 @@ "max 2.000000 4.000000" ] }, - "execution_count": 38, + "execution_count": 69, "metadata": {}, "output_type": "execute_result" } @@ -686,7 +867,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 70, "metadata": {}, "outputs": [ { @@ -753,7 +934,7 @@ "mean NaN 3.5" ] }, - "execution_count": 39, + "execution_count": 70, "metadata": {}, "output_type": "execute_result" } @@ -776,7 +957,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 71, "id": "ec624d37", "metadata": { "execution": { @@ -833,7 +1014,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 72, "id": "c052691d", "metadata": { "execution": { @@ -860,7 +1041,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 73, "id": "23a5ed20", "metadata": { "execution": { @@ -1066,7 +1247,7 @@ "[5 rows x 37 columns]" ] }, - "execution_count": 42, + "execution_count": 73, "metadata": {}, "output_type": "execute_result" } @@ -1086,7 +1267,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 74, "id": "f84177de", "metadata": { "execution": { @@ -1299,7 +1480,7 @@ "[5 rows x 37 columns]" ] }, - "execution_count": 43, + "execution_count": 74, "metadata": {}, "output_type": "execute_result" } @@ -1318,7 +1499,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 75, "id": "5f0544f4", "metadata": { "execution": { @@ -1394,7 +1575,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 76, "id": "821c9174", "metadata": { "execution": { @@ -1448,7 +1629,7 @@ "dtype: object" ] }, - "execution_count": 45, + "execution_count": 76, "metadata": {}, "output_type": "execute_result" } @@ -1467,7 +1648,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 77, "id": "aae06a0c", "metadata": { "execution": { @@ -1489,7 +1670,7 @@ " dtype='object')" ] }, - "execution_count": 46, + "execution_count": 77, "metadata": {}, "output_type": "execute_result" } @@ -1509,38 +1690,426 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 78, "metadata": {}, "outputs": [ { - "ename": "NameError", - "evalue": "name 'df' is not defined", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/Users/xu/Downloads/python-data-science/ch-pandas/series-dataframe.ipynb 单元格 50\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> 1\u001b[0m df_renamed \u001b[39m=\u001b[39m df\u001b[39m.\u001b[39mrename(columns\u001b[39m=\u001b[39m{\u001b[39m'\u001b[39m\u001b[39myear\u001b[39m\u001b[39m'\u001b[39m:Year,\u001b[39m'\u001b[39m\u001b[39mcountry\u001b[39m\u001b[39m'\u001b[39m:\u001b[39m'\u001b[39m\u001b[39mCountry\u001b[39m\u001b[39m'\u001b[39m})\n\u001b[1;32m 2\u001b[0m df_renamed\u001b[39m.\u001b[39mhead(\u001b[39m5\u001b[39m)\n", - "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined" - ] - } - ], - "source": [ - "df_renamed = df.rename(columns={'year':Year,'country':'Country'})\n", - "df_renamed.head(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " 例:将所有列名改为小写。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CountryisocodeYearPOPXRATCurrency_Unitppptcgdpcgdpcgdp2...kgkiopenkrgdpeqargdpwokrgdpl2wokrgdpl2pergdpl2tergdpl2thrgdptt
0AfghanistanAFG19508150.368NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1AfghanistanAFG19518284.473NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2AfghanistanAFG19528425.333NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3AfghanistanAFG19538573.217NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4AfghanistanAFG19548728.408NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 37 columns

\n", + "
" + ], + "text/plain": [ + " Country isocode Year POP XRAT Currency_Unit ppp tcgdp cgdp \\\n", + "0 Afghanistan AFG 1950 8150.368 NaN NaN NaN NaN NaN \n", + "1 Afghanistan AFG 1951 8284.473 NaN NaN NaN NaN NaN \n", + "2 Afghanistan AFG 1952 8425.333 NaN NaN NaN NaN NaN \n", + "3 Afghanistan AFG 1953 8573.217 NaN NaN NaN NaN NaN \n", + "4 Afghanistan AFG 1954 8728.408 NaN NaN NaN NaN NaN \n", + "\n", + " cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te \\\n", + "0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "3 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "4 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "\n", + " rgdpl2th rgdptt \n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 NaN NaN \n", + "3 NaN NaN \n", + "4 NaN NaN \n", + "\n", + "[5 rows x 37 columns]" + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_renamed = df.rename(columns={'year':'Year','country':'Country'})\n", + "df_renamed.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " 例:将所有列名改为小写。" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryisocodeyearpopxratcurrency_unitppptcgdpcgdpcgdp2...kgkiopenkrgdpeqargdpwokrgdpl2wokrgdpl2pergdpl2tergdpl2thrgdptt
0AfghanistanAFG19508150.368NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1AfghanistanAFG19518284.473NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2AfghanistanAFG19528425.333NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3AfghanistanAFG19538573.217NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4AfghanistanAFG19548728.408NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 37 columns

\n", + "
" + ], + "text/plain": [ + " country isocode year pop xrat currency_unit ppp tcgdp cgdp \\\n", + "0 Afghanistan AFG 1950 8150.368 NaN NaN NaN NaN NaN \n", + "1 Afghanistan AFG 1951 8284.473 NaN NaN NaN NaN NaN \n", + "2 Afghanistan AFG 1952 8425.333 NaN NaN NaN NaN NaN \n", + "3 Afghanistan AFG 1953 8573.217 NaN NaN NaN NaN NaN \n", + "4 Afghanistan AFG 1954 8728.408 NaN NaN NaN NaN NaN \n", + "\n", + " cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te \\\n", + "0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "2 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "3 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "4 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN \n", + "\n", + " rgdpl2th rgdptt \n", + "0 NaN NaN \n", + "1 NaN NaN \n", + "2 NaN NaN \n", + "3 NaN NaN \n", + "4 NaN NaN \n", + "\n", + "[5 rows x 37 columns]" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "df_renamed = df.rename(columns=str.lower)\n", "df_renamed.head(5)" @@ -1556,7 +2125,7 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 80, "id": "9e1b3c2f", "metadata": { "execution": { @@ -1573,7 +2142,7 @@ "RangeIndex(start=0, stop=11400, step=1)" ] }, - "execution_count": 47, + "execution_count": 80, "metadata": {}, "output_type": "execute_result" } @@ -1592,7 +2161,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 81, "id": "ccc0d0c6", "metadata": { "execution": {