diff --git a/notebooks/05_module_pandas.ipynb b/notebooks/05_module_pandas.ipynb index acee643..43e1f76 100644 --- a/notebooks/05_module_pandas.ipynb +++ b/notebooks/05_module_pandas.ipynb @@ -9,9 +9,11 @@ "-----------------------------------------------------------------------------------------------------\n", "\n", "
\n", - "Note: this notebook contains Additional Theory sections (in blue boxes, like this one) that will be skipped during the class due to time constraints. If you are going through this notebook on your own, feel free to\n", + "\n", + "Note: this notebook contains Additional Material sections (in blue boxes, like this one) that will be skipped during the class due to time constraints. If you are going through this notebook on your own, feel free to\n", "read these sections or skip them depending on your interest.\n", - "
" + "\n", + "
" ] }, { @@ -23,51 +25,51 @@ "[**Introduction**](#0)\n", "\n", "[**Pandas data structures: DataFrame and Series**](#1) \n", - "    [Creating a new DataFrame](#2) \n", - "    [[Additional Theory] Creating a pandas Series](#3) \n", - "    [Micro Exercise 1 - Create a DataFrame](#4) \n", + "    [Creating a new DataFrame](#1.1) \n", + "    [[Additional Material] Creating a pandas Series](#1.2) \n", + "    [Micro Exercise 1 - Create a DataFrame](#1.3) \n", "\n", - "[**Reading tabular data from files**](#5) \n", - "    [[Additional Theory] Reading input data with no header](#6) \n", - "    [Dropping rows with missing values (`NaN`)](#7) \n", + "[**Reading tabular data from files**](#2) \n", + "    [[Additional Material] Reading input data with no header](#2.1) \n", + "    [Dropping rows with missing values (`NaN`)](#2.2) \n", "\n", - "[**Writing files to disk**](#8) \n", - "    [Micro Exercise 2 - Read and write data frames](#9) \n", + "[**Writing files to disk**](#3) \n", + "    [Micro Exercise 2 - Read and write data frames](#3.1) \n", "\n", - "[**Accessing, editing and adding columns and rows**](#10) \n", - "    [Accessing, editing and adding columns](#11) \n", - "    [Operations on columns](#12) \n", - "    [Accessing, editing and adding rows](#13) \n", - "    [[Additional Theory] Adding a Series as a row to a DataFrame with `pd.concat()`](#14) \n", + "[**Accessing, editing and adding columns and rows**](#4) \n", + "    [Accessing, editing and adding columns](#4.1) \n", + "    [Operations on columns](#4.2) \n", + "    [Accessing, editing and adding rows](#4.3) \n", + "    [[Additional Material] Adding a Series as a row to a DataFrame with `pd.concat()`](#4.4) \n", "\n", - "[**Deleting columns and rows**](#15) \n", - "    [Micro Exercise 3 - Editing a DataFrame](#16)\n", + "[**Deleting columns and rows**](#5) \n", + "    [Micro Exercise 3 - Editing a DataFrame](#5.1)\n", "\n", - "[**DataFrame subsetting: the `loc[]` and `iloc[]` indexers**](#17) \n", - "    [Conditional row selection](#18) \n", - "    [[Additional Theory] Mixed selection](#19) \n", - "    [Micro Exercise 3 - DataFrame selection](#20) \n", + "[**DataFrame subsetting: the `loc[]` and `iloc[]` indexers**](#6) \n", + "    [Conditional row selection](#6.1) \n", + "    [[Additional Material] Mixed selection](#6.2) \n", + "    [Micro Exercise 3 - DataFrame selection](#6.3) \n", "\n", - "[**[Additional Theory] Creating copies of a DataFrame**](#21) \n", + "[**[Additional Material] Creating copies of a DataFrame**](#7) \n", "\n", - "[**Summary statistics on DataFrame or Series**](#22) \n", - "    [Micro Exercise 5 - Summary statistics](#23) \n", - "    [[Additional Theory] Applying custom functions by rows or columns](#24) \n", - "    [[Additional Theory] Micro Exercise 6 - applying custom functions](#25) \n", + "[**Summary statistics on DataFrame or Series**](#8) \n", + "    [Micro Exercise 5 - Summary statistics](#8.1) \n", + "    [[Additional Material] Applying custom functions by rows or columns](#8.2) \n", + "    [[Additional Material] Micro Exercise 6 - applying custom functions](#8.3) \n", "\n", - "[**Grouping data by factor**](#26) \n", - "    [Micro Exercise 7 - Grouping data](#27) \n", + "[**Grouping data by factor**](#9) \n", + "    [Micro Exercise 7 - Grouping data](#9.1) \n", "\n", - "[**Exercises 5.1 - 5.3**](#28)\n", + "[**Exercises 5.1 - 5.3**](#10)\n", "\n", - "[**Additional Theory**](#29) \n", - "    [Selecting/Filtering dataframes](#31) \n", - "    [Sorting operations on dataframes](#32) \n", - "    [Extending a dataframe by adding new columns](#33) \n", - "    [Use of numpy functions with pandas dataframes](#34) \n", - "    [Merge and join DataFrames](#35) \n", - "    [Cross-tabulated tables](#36) \n", - "    [Plotting with pandas and matplotlib](#37) " + "[**Additional Material**](#11) \n", + "    [Selecting/Filtering dataframes](#11.1) \n", + "    [Sorting operations on dataframes](#11.2) \n", + "    [Extending a dataframe by adding new columns](#11.3) \n", + "    [Use of numpy functions with pandas dataframes](#11.4) \n", + "    [Merge and join DataFrames](#11.5) \n", + "    [Cross-tabulated tables](#11.6) \n", + "    [Plotting with pandas and matplotlib](#11.7) " ] }, { @@ -136,7 +138,7 @@ "source": [ "
\n", "\n", - "### Creating a new DataFrame \n", + "### Creating a new DataFrame \n", "To create a new pandas DataFrame, we pass a `dict` (dictionary) to **`pd.DataFrame()`**, where:\n", "\n", "* The dictionary's **keys are column names**.\n", @@ -240,7 +242,7 @@ "\n", "
\n", "\n", - "### [Additional Theory] Creating a pandas Series \n", + "### [Additional Material] Creating a pandas Series \n", "To create a new pandas **Series**, we pass a sequence (e.g. list, tuple, dict, generator) to **`pd.Series()`**:\n", "* The optional `name` argument allows to associate a \"name\" to the Series.\n", "* As with `pd.DataFrame()`, an optional `index` argument can be passed (by default the index\n", @@ -253,6 +255,7 @@ "* Its **name**: retrieved with **`s.name`**.\n", "\n", "**Example:**\n", + "\n", "
" ] }, @@ -274,7 +277,7 @@ "\n", "
\n", "\n", - "### Micro Exercise 1 - Create a DataFrame\n", + "### Micro Exercise 1 - Create a DataFrame\n", "* Create a new DataFrame named `df_size` with the following structure:\n", "\n", "Name | Height | Weight\n", @@ -313,7 +316,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## Reading tabular data from files \n", + "## Reading tabular data from files \n", "------------------------------------------\n", "\n", "In the previous section we have seen how to create a DataFrame from scratch. But very frequently our **data is already in a tabular format** in a file, and simply needs to be imported.\n", @@ -423,11 +426,11 @@ "\n", "
\n", "\n", - "### [Additional Theory] Reading input data with no header \n", + "### [Additional Material] Reading input data with no header \n", "\n", "To better understand how pandas reads in data let's try to import the Titanic data set stripped from its header.\n", " \n", - "
" + "
" ] }, { @@ -459,7 +462,7 @@ "\n", "To prevent pandas from wrongly using the values from the first line of the file as column name, we must explicitly tell it that the data contains no header by passing the `header=None` argument:\n", "\n", - "
" + "
" ] }, { @@ -481,7 +484,7 @@ "This looks much better as there is no misinterpretation of the actual data.\n", "We can also see that, when there are no column names in a file, pandas default to using numbers (starting from 0) as column names.\n", " \n", - "
" + "
" ] }, { @@ -492,7 +495,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Dropping rows with missing values (`NaN`) \n", + "### Dropping rows with missing values (`NaN`) \n", "\n", "Datasets frequently contain rows with missing data, indicated as `NaN` or `NA` (stands for \"not a number\" - but it is used even if the column type is not numeric).\n", "\n", @@ -540,7 +543,7 @@ "[Back to ToC](#toc)\n", "\n", "\n", - "## Writing files to disk \n", + "## Writing files to disk \n", "----------------------------\n", "\n", "Just like when reading files, pandas provides a number of functions, depending on the format of the output file we wish to write:\n", @@ -620,7 +623,7 @@ "\n", "
\n", "\n", - "### Micro Exercise 2 - Read and write data frames \n", + "### Micro Exercise 2 - Read and write data frames \n", "\n", "* Read the `titanic.csv` file from disk (it is located in the `data` directory) and save it to a variable named `titanic_df`.\n", "* Create a new DataFrame that only contains the first 10 rows of `titanic_df`, name it `titanic_df_head`. **Hint:** remember the `head`/`tail` functions we have seen just earlier.\n", @@ -645,10 +648,10 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## Accessing, editing and adding columns and rows \n", + "## Accessing, editing and adding columns and rows \n", "-----------------------------------------------------------------\n", "\n", - "### Accessing, editing and adding columns \n", + "### Accessing, editing and adding columns \n", "\n", "DataFrame columns can be accessed, added and modified using the following syntax (here illustrated with a DataFrame is named `df`):\n", "* `df[\"column name\"]`: returns the content of the specified column (as a Series).\n", @@ -787,9 +790,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "# Reminder: \"position 1\" corresponds to the 2nd column of the DataFrame.\n", @@ -805,7 +806,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Operations on columns \n", + "### Operations on columns \n", "\n", "Pandas DataFrame allows the use **arithmetic operators on columns**, which are interpreted as applying the operations to each row of the DataFrame:" ] @@ -851,7 +852,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Accessing, editing and adding rows \n", + "### Accessing, editing and adding rows \n", "\n", "**Adding a new row** to a DataFrame can be done using the **`loc[]` indexer** (there are other possibilities too - see the Additional Theory section below):\n", "* `df.loc[] = values`, where values must be a sequence (e.g. list or tuple) with the\n", @@ -950,13 +951,13 @@ "\n", "
\n", "\n", - "### [Additional Theory] Adding a Series as a row to a DataFrame with `pd.concat()` \n", + "### [Additional Material] Adding a Series as a row to a DataFrame with `pd.concat()` \n", "\n", "`pd.concat()` is a method generally used to concatenate DataFrames (either along rows or columns). It can be used to add a Series as a new row to an existing DataFrame.\n", "\n", "1. Create a new Series object\n", "\n", - "
" + "
" ] }, { @@ -981,15 +982,13 @@ " * For the concatenation to work, the **index of the Series must correspond to the column names of the \n", " DataFrame**.\n", "\n", - "
" + "
" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "df = pd.concat((df, new_passenger.to_frame().T), ignore_index=True)\n", @@ -1005,7 +1004,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## Deleting columns and rows \n", + "## Deleting columns and rows \n", "--------------------------------------\n", "\n", "To remove a column or a row from a DataFrame, the `.drop()` method can be used (illustrated here with a DataFrame named `df`):\n", @@ -1023,14 +1022,14 @@ "\n", "
\n", "\n", - "**[Additional Theory] Deleting columns using `del` or `pop()`:**\n", + "**[Additional Material] Deleting columns using `del` or `pop()`:**\n", "\n", "* `del df[\"column name\"]`: deletes a column from the DataFrame. Note that the syntax is the same as when \n", " removing a key from a `dict`.\n", "* `df.pop(\"col name\")`: deletes column and returns it as a panda Series. Again the syntax is the same\n", " as when using the `pop()` method of a dictionary.\n", "\n", - "
" + "
" ] }, { @@ -1136,7 +1135,7 @@ "
\n", "
\n", "\n", - "### Micro Exercise 3 - Editing a DataFrame \n", + "### Micro Exercise 3 - Editing a DataFrame \n", "\n", "Perform the following tasks on the `df_size` data frame that you created in Micro Exercise 1:\n", "* Add an entry to the DataFrame for \"Tim\", who measures 191 cm and weights 95 Kg.\n", @@ -1182,7 +1181,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## DataFrame subsetting: the `loc[]` and `iloc[]` indexers \n", + "## DataFrame subsetting: the `loc[]` and `iloc[]` indexers \n", "----------------------------------------------------------------------\n", "\n", "A very common operation to perform on DataFrames is to create a subset by selecting certain rows and/or columns. \n", @@ -1240,9 +1239,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "df.loc[0:3, : ] # This selects the first 4 rows." @@ -1416,7 +1413,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Conditional row selection (row filtering) \n", + "### Conditional row selection (row filtering) \n", "The **`.loc[]`** indexer allows **row selection based on a boolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n", "* Testing a condition on a **DataFrame** column returns a boolean **Series**: `df[\"age\"] < 35`.\n", "* This Series can then be used to filter the DataFrame: `df.loc[df[\"age\"] < 35, :]`.\n", @@ -1452,9 +1449,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "# Select people with age < 35.\n", @@ -1519,7 +1514,7 @@ "\n", "
\n", " \n", - "### [Additional Theory] Mixed selection \n", + "### [Additional Material] Mixed selection \n", "\n", "A frequent situation is that we want to select rows based on a certain condition, e.g. `df[\"Age\"] <= 35`, \n", "and columns based on position , e.g. `1:3` to select the second and third columns. The problem is then the following:\n", @@ -1547,7 +1542,7 @@ "\n", "**Examples:**\n", " \n", - "
" + "
" ] }, { @@ -1587,7 +1582,7 @@ "\n", "
\n", "\n", - "### Micro Exercise 4 - DataFrame selection\n", + "### Micro Exercise 4 - DataFrame selection\n", "\n", "Using the `df` data frame:\n", "\n", @@ -1640,12 +1635,12 @@ "\n", "
\n", "\n", - "## [Additional Theory] Creating copies of a DataFrame \n", + "## [Additional Material] Creating copies of a DataFrame \n", "-----------------------------------------------------\n", "\n", "Since DataFrame are mutable, assigning a DataFrame to a new variable does **not create a copy**: it creates a new pointer to the same DataFrame.\n", "\n", - "
" + "
" ] }, { @@ -1669,7 +1664,7 @@ "Assigning the data frame to another variable **does not create a copy** . \n", "Here, modifying `test_copy` will also modify `test_df` (since they point to the same object).\n", "\n", - "
" + "
" ] }, { @@ -1694,7 +1689,7 @@ "\n", "To make a copy, we must instead use the **`.copy()`** method, or perform an indexing operation (here with `.loc[:]`).\n", "\n", - "
" + "
" ] }, { @@ -1721,7 +1716,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## Summary statistics on DataFrames and Series \n", + "## Summary statistics on DataFrames and Series \n", "-------------------------------------------------------------\n", "\n", "When doing exploratory analysis of a dataset, it is often useful get some **basic statistics** on a **per-column** basis (since rows will typically represent the samples and columns the explanatory variables).\n", @@ -1810,9 +1805,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "print(\"Distribution of passengers by class:\", df[\"Pclass\"].value_counts(), \"\\n\", sep=\"\\n\")" @@ -1826,7 +1819,7 @@ "\n", "
\n", "\n", - "### Micro Exercise 5 - summary statistics \n", + "### Micro Exercise 5 - summary statistics \n", "* Which proportion of women and men survived the titanic tragedy?\n", "* **If you have time:** write your answer as a `for` loop.\n", "\n", @@ -1843,7 +1836,10 @@ { "cell_type": "markdown", "metadata": { - "collapsed": true + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } }, "source": [ "
\n", @@ -1852,7 +1848,7 @@ "\n", "
\n", "\n", - "### [Additional Theory] Applying custom functions by rows or columns \n", + "### [Additional Material] Applying custom functions by rows or columns \n", "\n", "As we have just seen, pandas DataFrame have a number of built-in methods (e.g. `describe()`, `count()`, `mean()`) that can be applied row-wise or column-wise.\n", "\n", @@ -1871,7 +1867,7 @@ "**Examples:**\n", "* Apply a custom function to each value of a DataFrame's column. In this example, we expand the abbreviated value of the port of embarkation, to its full value.\n", "\n", - "
" + "
" ] }, { @@ -1908,7 +1904,8 @@ "
\n", "\n", "* Apply a custom function by columns and rows. Here the custom function simply selects the value with the largest number of characters (in either the column or row).\n", - "
" + "\n", + "
" ] }, { @@ -1962,7 +1959,7 @@ "
\n", "
\n", "\n", - "### [Additional Theory] Micro Exercise 6 - Applying custom functions \n", + "### [Additional Material] Micro Exercise 6 - Applying custom functions \n", "\n", "* Write your own implementation of the `mean()` function, then apply it to the \"Age\" and \"Fare\" columns.\n", "* Verify your results against the result obtained by `df.mean()`.\n", @@ -1973,7 +1970,7 @@ " For this second solution, you can use the `isnan()` function from the `math` module to test whether\n", " a value is `NaN` or not.\n", "\n", - "
" + "
" ] }, { @@ -1992,7 +1989,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "## Grouping data by factor \n", + "## Grouping data by factor \n", "---------------------------------\n", "\n", "When analyzing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n", @@ -2065,7 +2062,7 @@ "\n", "
\n", "\n", - "### Micro Exercise 7 - Grouping data\n", + "### Micro Exercise 7 - Grouping data\n", "\n", "* Make a copy of the `df` data frame with `df.copy()`. Name it `dfc`, as shown here:\n", "```python\n", @@ -2101,7 +2098,7 @@ "
\n", "
\n", "\n", - "## Exercise 5.1 - 5.3 \n", + "## Exercise 5.1 - 5.3 \n", "-------------------------" ] }, @@ -2120,10 +2117,13 @@ "\n", "[Back to ToC](#toc)\n", "\n", + "
\n", "\n", - "## Additional Theory \n", + "## Additional Material \n", "-----------------------------\n", "\n", + "
\n", + "\n", "### About the example dataset used in the Additional Theory section\n", "\n", "To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor co-activator-2).\n", @@ -2167,7 +2167,7 @@ "[Back to ToC](#toc)\n", "\n", "\n", - "### Selecting/Filtering dataframes \n", + "### Selecting/Filtering dataframes \n", "\n", "Lets read the dataframe again." ] @@ -2189,6 +2189,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "
\n", + "\n", "And filter it based on some criteria that we may be interested in. \n", "\n", "For example say we want to find genes that have at least 250 reads in the 'Heart_WT_1' sample. We would do it like this: first we find the genes that satisfy the condition:" @@ -2217,6 +2219,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "
\n", + "\n", "Applying the `>` operator returns a boolean Series with the result of the function on every element of the Series. Then, to select the corresponding elements of the dataframe, we use the boolean Series to slice the original dataframe:" ] }, @@ -2243,6 +2247,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "
\n", + "\n", "We can design more complicated filters, as below, we select genes that have more than 250 reads in WT samples, less than 150 in all KO samples:" ] }, @@ -2265,9 +2271,14 @@ { "cell_type": "markdown", "metadata": { - "collapsed": true + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } }, "source": [ + "
\n", + "\n", "We can also slice the result of filtering. For example, let's say that we want to extract the genes with more than 250 reads in the first WT and less than 50 reads the first KO sample but then also only keep these two columns of the data." ] }, @@ -2339,7 +2350,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Sorting operations on dataframes \n", + "### Sorting operations on dataframes \n", "\n", "DataFrames can be sorted on one or more specific column(s) using `sort_values():" ] @@ -2365,9 +2376,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "df.sort_index(ascending=True).head()" @@ -2399,9 +2408,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "df.min(axis=1).head()" @@ -2469,7 +2476,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Extending a dataframe by adding new columns \n", + "### Extending a dataframe by adding new columns \n", "\n", "We can set up a new dataframe and concatenate it to the original dataframe using the `concat` method:" ] @@ -2528,7 +2535,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Use of numpy functions with pandas dataframes \n", + "### Use of numpy functions with pandas dataframes \n", "\n", "\n", "Let's say we want to calculate the log average expression value. We could do it like this:" @@ -2623,7 +2630,7 @@ "\n", "[Back to ToC](#toc)\n", "\n", - "### Merge and join DataFrames \n", + "### Merge and join DataFrames \n", "\n", "The [`merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`join()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) methods allow to combine DataFrames, linking their rows based on their keys. \n", "\n", @@ -2721,9 +2728,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "df.head()" @@ -2773,7 +2778,7 @@ "source": [ "
\n", "\n", - "### Cross-tabulated tables \n", + "### Cross-tabulated tables \n", "\n", "Cross-tabulated tables for two (or more) columns (factors) of a DataFrame can be created with **`pd.crosstab()`**." ] @@ -2800,7 +2805,7 @@ "[Back to ToC](#toc)\n", "\n", "\n", - "### Plotting with pandas and matplotlib \n", + "### Plotting with pandas and matplotlib \n", "\n", "Now let's explore our data a bit. First, a matrix of scatter plots for all pairwise sample comparisons (note: the cell below can take 10-20 seconds to compute):" ] @@ -2808,9 +2813,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": false - }, + "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", @@ -2916,7 +2919,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.10" + "version": "3.10.12" }, "vscode": { "interpreter": { @@ -2925,5 +2928,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 }