diff --git a/notebooks/05_module_pandas.ipynb b/notebooks/05_module_pandas.ipynb index 721c3da..acee643 100644 --- a/notebooks/05_module_pandas.ipynb +++ b/notebooks/05_module_pandas.ipynb @@ -186,7 +186,7 @@ "metadata": {}, "outputs": [], "source": [ - "# The \"shape\" attribute retruns a tuple with row and column count:\n", + "# The \"shape\" attribute returns a tuple with row and column count:\n", "df.shape" ] }, @@ -326,7 +326,7 @@ "* **`pd.read_excel()`**: import data from Excel files.\n", "* ... see [here for an exhaustive list of pandas reader and writer functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).\n", "\n", - "To illustrate the `read_table()` function, let's try to load the `data/titanic.csv` file. As its name suggest, this table contains data about the ill-fated [Titanic](https://en.wikipedia.org/wiki/Titanic) passengers, travelling from England to New York in April 1912.\n", + "To illustrate the `read_table()` function, let's try to load the `data/titanic.csv` file. As its name suggest, this table contains data about the ill-fated [Titanic](https://en.wikipedia.org/wiki/Titanic) passengers, traveling from England to New York in April 1912.\n", "\n", "**Tip:** when working with large datasets, it is convenient to be able to look at a fraction of the data only. For this, the methods [**`head()`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) and [**`tail()`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) are very helpful. Without any argument `head()`/`tail()` display the first/last 5 lines of a DataFrame. A custom number of lines can be displayed by passing a number: e.g. `head(10)` will display the first 10 lines.\n", "\n", @@ -351,7 +351,7 @@ "source": [ "Take a look above at how the data have been read. By default, `read_table()` expects the input data to be **tab-delimited**, but since this is not the case of the `titanic.csv` file, each line was treated as a single field (column), thus creating a DataFrame with a single column.\n", "\n", - "As implied by its `.csv` extension (for \"comma-separeted values\"), the `titanic.csv` file contains **comma-delimited** values. To load a CSV file, we can either:\n", + "As implied by its `.csv` extension (for \"comma-separated values\"), the `titanic.csv` file contains **comma-delimited** values. To load a CSV file, we can either:\n", "* Specify the separator value in `read_table(sep=\",\")`.\n", "* Use `read_csv()`, a function that will use comma as separator by default.\n", "\n", @@ -457,7 +457,7 @@ "\n", "
\n", "\n", - "To prevent pandas from wrongly using the values from the first line of the file as column name, we must explicitely tell it that the data contains no header by passing the `header=None` argument:\n", + "To prevent pandas from wrongly using the values from the first line of the file as column name, we must explicitly tell it that the data contains no header by passing the `header=None` argument:\n", "\n", "
" ] @@ -651,7 +651,7 @@ "### Accessing, editing and adding columns \n", "\n", "DataFrame columns can be accessed, added and modified using the following syntax (here illustrated with a DataFrame is named `df`):\n", - "* `df[\"column name\"]`: returns the content of the specifed column (as a Series).\n", + "* `df[\"column name\"]`: returns the content of the specified column (as a Series).\n", "* `df[\"new column name\"] = value`: creates a new column with the specified values. If the column already\n", " exists, its values are updated.\n", " * When a **single value** is passed as `values`, all rows get that same value.\n", @@ -1417,7 +1417,7 @@ "[Back to ToC](#toc)\n", "\n", "### Conditional row selection (row filtering) \n", - "The **`.loc[]`** indexer allows **row selection based on a bloolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n", + "The **`.loc[]`** indexer allows **row selection based on a boolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n", "* Testing a condition on a **DataFrame** column returns a boolean **Series**: `df[\"age\"] < 35`.\n", "* This Series can then be used to filter the DataFrame: `df.loc[df[\"age\"] < 35, :]`.\n", "* Several **condition can be combined** with the **`&`** (and) and **`|`** (or) operators, e.g.:\n", @@ -1533,7 +1533,7 @@ " ```python\n", " df.loc[df[\"Age\"] <= 35, df.columns[1:3]]\n", " ```\n", - "* If the **index has the same values are row postions (0, 1, 2, ...)**, the `.index` attribute can be \n", + "* If the **index has the same values are row positions (0, 1, 2, ...)**, the `.index` attribute can be \n", " used to get row positions and use them with `.iloc[]`:\n", " ```python\n", " df.iloc[df[df[\"Age\"] <= 35].index, 1:3]\n", @@ -1592,8 +1592,8 @@ "Using the `df` data frame:\n", "\n", "* Select all passengers from the `Barber` family.\n", - "* Select passenger that are either amercian, or older than 30 years.\n", - "* **If you have time:** select british passengers that are either women or men travelling 1st class. The passenger class info is found in the `Pclass` column.\n", + "* Select passenger that are either american, or older than 30 years.\n", + "* **If you have time:** select british passengers that are either women or men traveling 1st class. The passenger class info is found in the `Pclass` column.\n", "\n", "
" ] @@ -1626,7 +1626,7 @@ }, "outputs": [], "source": [ - "# Select passenger that are either amercian, or older than 30 years ...\n" + "# Select passenger that are either american, or older than 30 years ...\n" ] }, { @@ -1995,7 +1995,7 @@ "## Grouping data by factor \n", "---------------------------------\n", "\n", - "When analysing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n", + "When analyzing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n", "\n", "For instance, we earlier computed the proportions of women and men that survived by subsetting the original DataFrame. Using **`groupby()`** can make this a lot easier.\n", "\n", @@ -2013,6 +2013,18 @@ "df.head()" ] }, + { + "cell_type": "markdown", + "metadata": { + "scrolled": true + }, + "source": [ + "* Here we compute mean values of all numeric columns by gender (i.e. the mean value is computed separately\n", + " for \"female\" and \"male\"). \n", + " *Note:* since a mean value can only be computed for numeric values, the argument `numeric_only` must be\n", + " set to `True`." + ] + }, { "cell_type": "code", "execution_count": null, @@ -2021,8 +2033,7 @@ }, "outputs": [], "source": [ - "# Compute means of all numeric columns by gender:\n", - "df.groupby(\"Sex\").mean()" + "df.groupby(\"Sex\").mean(numeric_only=True)" ] }, { @@ -2043,7 +2054,7 @@ "outputs": [], "source": [ "# Compute mean values by gender and passenger class:\n", - "df.groupby([\"Sex\", \"Pclass\"]).mean()" + "df.groupby([\"Sex\", \"Pclass\"]).mean(numeric_only=True)" ] }, { @@ -2115,9 +2126,9 @@ "\n", "### About the example dataset used in the Additional Theory section\n", "\n", - "To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor coactivator-2).\n", + "To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor co-activator-2).\n", "\n", - "The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expresssion.tsv` and is structured as follows:\n", + "The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expression.tsv` and is structured as follows:\n", "* Rows contain the expression values of a particular gene (higher values = gene is more expressed).\n", "* Columns corresponds to one sample/condition and contains the expression of values of all genes in that sample.\n", "* The sample names are given in the first row (header). \n", @@ -2130,7 +2141,7 @@ "\n", "\n", "\n", - "Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wildtype) and \"KO\" (knock out), and four replicates for each condition:\n", + "Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wild type) and \"KO\" (knock out), and four replicates for each condition:\n", "\n", "Heart_WT_1 Heart_WT_2 Heart_WT_3 Heart_WT_4 Heart_KO_1 Heart_KO_2 Heart_KO_3 Heart_KO_4" ] @@ -2143,7 +2154,7 @@ }, "outputs": [], "source": [ - "df = pd.read_csv(\"data/mouse_heart_gene_expresssion.tsv\", sep='\\t')\n", + "df = pd.read_csv(\"data/mouse_heart_gene_expression.tsv\", sep='\\t')\n", "df.head()" ] }, @@ -2189,8 +2200,8 @@ "metadata": {}, "outputs": [], "source": [ - "myslice = df['Heart_WT_1']>250\n", - "print(type(myslice))" + "my_slice = df['Heart_WT_1']>250\n", + "print(type(my_slice))" ] }, { @@ -2199,7 +2210,7 @@ "metadata": {}, "outputs": [], "source": [ - "myslice.head()" + "my_slice.head()" ] }, { @@ -2215,8 +2226,8 @@ "metadata": {}, "outputs": [], "source": [ - "mymysteriousobj = df[df['Heart_WT_1']>250]\n", - "print(type(mymysteriousobj))" + "my_mysterious_obj = df[df['Heart_WT_1']>250]\n", + "print(type(my_mysterious_obj))" ] }, { @@ -2439,7 +2450,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Finding the gene with mininum expression, but on the first 3 rows only\n", + "# Finding the gene with minimum expression, but on the first 3 rows only.\n", "df[0:3].apply(my_filter)" ] }, @@ -2469,10 +2480,10 @@ "metadata": {}, "outputs": [], "source": [ - "dfavg = pd.DataFrame()\n", - "dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n", - "dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n", - "dfavg.head()" + "df_avg = pd.DataFrame()\n", + "df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n", + "df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n", + "df_avg.head()" ] }, { @@ -2483,12 +2494,12 @@ }, "outputs": [], "source": [ - "dfavg = pd.DataFrame()\n", - "dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n", - "dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n", + "df_avg = pd.DataFrame()\n", + "df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n", + "df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n", "\n", - "dfall = pd.concat([df, dfavg], axis=1)\n", - "dfall.head()" + "df_all = pd.concat([df, df_avg], axis=1)\n", + "df_all.head()" ] }, { @@ -2504,8 +2515,8 @@ "metadata": {}, "outputs": [], "source": [ - "df['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n", - "df['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n", + "df['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n", + "df['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n", "df.head()" ] }, @@ -2616,7 +2627,7 @@ "\n", "The [`merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`join()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) methods allow to combine DataFrames, linking their rows based on their keys. \n", "\n", - "Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are contructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):" + "Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are constructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):" ] }, { @@ -2906,6 +2917,11 @@ "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" + }, + "vscode": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + } } }, "nbformat": 4,