\n",
"\n",
"DataFrame columns can be accessed, added and modified using the following syntax (here illustrated with a DataFrame is named `df`):\n",
- "* `df[\"column name\"]`: returns the content of the specifed column (as a Series).\n",
+ "* `df[\"column name\"]`: returns the content of the specified column (as a Series).\n",
"* `df[\"new column name\"] = value`: creates a new column with the specified values. If the column already\n",
" exists, its values are updated.\n",
" * When a **single value** is passed as `values`, all rows get that same value.\n",
@@ -1417,7 +1417,7 @@
"[Back to ToC](#toc)\n",
"\n",
"### Conditional row selection (row filtering)
\n",
- "The **`.loc[]`** indexer allows **row selection based on a bloolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n",
+ "The **`.loc[]`** indexer allows **row selection based on a boolean (`True`/`False`) vector of values**, returning only rows for which the selection vector values are `True`. This is extremely useful to filter DataFrames.\n",
"* Testing a condition on a **DataFrame** column returns a boolean **Series**: `df[\"age\"] < 35`.\n",
"* This Series can then be used to filter the DataFrame: `df.loc[df[\"age\"] < 35, :]`.\n",
"* Several **condition can be combined** with the **`&`** (and) and **`|`** (or) operators, e.g.:\n",
@@ -1533,7 +1533,7 @@
" ```python\n",
" df.loc[df[\"Age\"] <= 35, df.columns[1:3]]\n",
" ```\n",
- "* If the **index has the same values are row postions (0, 1, 2, ...)**, the `.index` attribute can be \n",
+ "* If the **index has the same values are row positions (0, 1, 2, ...)**, the `.index` attribute can be \n",
" used to get row positions and use them with `.iloc[]`:\n",
" ```python\n",
" df.iloc[df[df[\"Age\"] <= 35].index, 1:3]\n",
@@ -1592,8 +1592,8 @@
"Using the `df` data frame:\n",
"\n",
"* Select all passengers from the `Barber` family.\n",
- "* Select passenger that are either amercian, or older than 30 years.\n",
- "* **If you have time:** select british passengers that are either women or men travelling 1st class. The passenger class info is found in the `Pclass` column.\n",
+ "* Select passenger that are either american, or older than 30 years.\n",
+ "* **If you have time:** select british passengers that are either women or men traveling 1st class. The passenger class info is found in the `Pclass` column.\n",
"\n",
"
"
]
@@ -1626,7 +1626,7 @@
},
"outputs": [],
"source": [
- "# Select passenger that are either amercian, or older than 30 years ...\n"
+ "# Select passenger that are either american, or older than 30 years ...\n"
]
},
{
@@ -1995,7 +1995,7 @@
"## Grouping data by factor
\n",
"---------------------------------\n",
"\n",
- "When analysing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n",
+ "When analyzing a dataset where some variables (columns) are factors (categorical values), it is often useful to group the samples (rows) by these factors.\n",
"\n",
"For instance, we earlier computed the proportions of women and men that survived by subsetting the original DataFrame. Using **`groupby()`** can make this a lot easier.\n",
"\n",
@@ -2013,6 +2013,18 @@
"df.head()"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "scrolled": true
+ },
+ "source": [
+ "* Here we compute mean values of all numeric columns by gender (i.e. the mean value is computed separately\n",
+ " for \"female\" and \"male\"). \n",
+ " *Note:* since a mean value can only be computed for numeric values, the argument `numeric_only` must be\n",
+ " set to `True`."
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
@@ -2021,8 +2033,7 @@
},
"outputs": [],
"source": [
- "# Compute means of all numeric columns by gender:\n",
- "df.groupby(\"Sex\").mean()"
+ "df.groupby(\"Sex\").mean(numeric_only=True)"
]
},
{
@@ -2043,7 +2054,7 @@
"outputs": [],
"source": [
"# Compute mean values by gender and passenger class:\n",
- "df.groupby([\"Sex\", \"Pclass\"]).mean()"
+ "df.groupby([\"Sex\", \"Pclass\"]).mean(numeric_only=True)"
]
},
{
@@ -2115,9 +2126,9 @@
"\n",
"### About the example dataset used in the Additional Theory section\n",
"\n",
- "To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor coactivator-2).\n",
+ "To illustrate pandas' functionalities, we will here use an example dataset that contains gene expression data. This dataset originates from a [study that investigated stress response in the hearts of mice deficient in the SRC-2 gene](http://www.ncbi.nlm.nih.gov/pubmed/23300926) (transcriptional regulator steroid receptor co-activator-2).\n",
"\n",
- "The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expresssion.tsv` and is structured as follows:\n",
+ "The dataset is in the \"tab\" delimited file `data/mouse_heart_gene_expression.tsv` and is structured as follows:\n",
"* Rows contain the expression values of a particular gene (higher values = gene is more expressed).\n",
"* Columns corresponds to one sample/condition and contains the expression of values of all genes in that sample.\n",
"* The sample names are given in the first row (header). \n",
@@ -2130,7 +2141,7 @@
"\n",
"\n",
"\n",
- "Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wildtype) and \"KO\" (knock out), and four replicates for each condition:\n",
+ "Based on the names, we can guess that we have gene expression values for heart tissue of two types: \"WT\" (wild type) and \"KO\" (knock out), and four replicates for each condition:\n",
"\n",
"Heart_WT_1 Heart_WT_2 Heart_WT_3 Heart_WT_4 Heart_KO_1 Heart_KO_2 Heart_KO_3 Heart_KO_4"
]
@@ -2143,7 +2154,7 @@
},
"outputs": [],
"source": [
- "df = pd.read_csv(\"data/mouse_heart_gene_expresssion.tsv\", sep='\\t')\n",
+ "df = pd.read_csv(\"data/mouse_heart_gene_expression.tsv\", sep='\\t')\n",
"df.head()"
]
},
@@ -2189,8 +2200,8 @@
"metadata": {},
"outputs": [],
"source": [
- "myslice = df['Heart_WT_1']>250\n",
- "print(type(myslice))"
+ "my_slice = df['Heart_WT_1']>250\n",
+ "print(type(my_slice))"
]
},
{
@@ -2199,7 +2210,7 @@
"metadata": {},
"outputs": [],
"source": [
- "myslice.head()"
+ "my_slice.head()"
]
},
{
@@ -2215,8 +2226,8 @@
"metadata": {},
"outputs": [],
"source": [
- "mymysteriousobj = df[df['Heart_WT_1']>250]\n",
- "print(type(mymysteriousobj))"
+ "my_mysterious_obj = df[df['Heart_WT_1']>250]\n",
+ "print(type(my_mysterious_obj))"
]
},
{
@@ -2439,7 +2450,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# Finding the gene with mininum expression, but on the first 3 rows only\n",
+ "# Finding the gene with minimum expression, but on the first 3 rows only.\n",
"df[0:3].apply(my_filter)"
]
},
@@ -2469,10 +2480,10 @@
"metadata": {},
"outputs": [],
"source": [
- "dfavg = pd.DataFrame()\n",
- "dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
- "dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
- "dfavg.head()"
+ "df_avg = pd.DataFrame()\n",
+ "df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
+ "df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
+ "df_avg.head()"
]
},
{
@@ -2483,12 +2494,12 @@
},
"outputs": [],
"source": [
- "dfavg = pd.DataFrame()\n",
- "dfavg['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
- "dfavg['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
+ "df_avg = pd.DataFrame()\n",
+ "df_avg['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
+ "df_avg['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
"\n",
- "dfall = pd.concat([df, dfavg], axis=1)\n",
- "dfall.head()"
+ "df_all = pd.concat([df, df_avg], axis=1)\n",
+ "df_all.head()"
]
},
{
@@ -2504,8 +2515,8 @@
"metadata": {},
"outputs": [],
"source": [
- "df['Heart_WT_avg'] = (df['Heart_WT_1']+df['Heart_WT_2']+df['Heart_WT_3']+df['Heart_WT_4'])/4\n",
- "df['Heart_KO_avg'] = (df['Heart_KO_1']+df['Heart_KO_2']+df['Heart_KO_3']+df['Heart_KO_4'])/4\n",
+ "df['Heart_WT_avg'] = (df['Heart_WT_1'] + df['Heart_WT_2'] + df['Heart_WT_3'] + df['Heart_WT_4'])/4\n",
+ "df['Heart_KO_avg'] = (df['Heart_KO_1'] + df['Heart_KO_2'] + df['Heart_KO_3'] + df['Heart_KO_4'])/4\n",
"df.head()"
]
},
@@ -2616,7 +2627,7 @@
"\n",
"The [`merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`join()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) methods allow to combine DataFrames, linking their rows based on their keys. \n",
"\n",
- "Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are contructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):"
+ "Here's how we construct a dataframe from a dictionary data structure, where dictionary keys are treated as column names, list of values associated with a key is treated as list of elements in the corresponding column, and rows are constructed based on the index of elements within the list of elements in the column (note however that all columns should have the same length):"
]
},
{
@@ -2906,6 +2917,11 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
+ }
}
},
"nbformat": 4,