Skip to content

Commit

Permalink
Merge pull request #34 from fhdsl/S4
Browse files Browse the repository at this point in the history
S4
  • Loading branch information
caalo authored Nov 13, 2024
2 parents b19a217 + 809de85 commit a31c02d
Show file tree
Hide file tree
Showing 160 changed files with 14,068 additions and 33 deletions.
6 changes: 3 additions & 3 deletions 02-data-structures.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ Object methods are functions that does something with the object you are using i
Here are some more examples of methods with lists:

| Function method | What it takes in | What it does | Returns |
|---------------|---------------|---------------------------|---------------|
|------------------------------------------------------------------------------|------------------------------|-----------------------------------------------------------------------|----------------------------------|
| [`chrNum.count(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer |
| [`chrNum.append(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) |
| [`chrNum.sort()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) |
Expand Down Expand Up @@ -186,7 +186,7 @@ metadata.tail()

Both of these functions (without input arguments) are considered as **methods**: they are functions that does something with the Dataframe you are using it on. You should think about `metadata.head()` as a function that takes in `metadata` as an input. If we had another Dataframe called `my_data` and you want to use the same function, you will have to say `my_data.head()`.

## Subsetting Dataframes
## Subsetting Dataframes

Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists.

Expand All @@ -207,7 +207,7 @@ Here is how the dataframe looks like with the row and column index numbers:

Subset the first fourth rows, and the first two columns:

![](images/pandas subset_1.png)
![](images/pandas%20subset_1.png)

Now, back to `metadata` dataframe:

Expand Down
12 changes: 6 additions & 6 deletions 03-data-wrangling1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ expression.head()
```

| Dataframe | The observation is | Some variables are | Some values are |
|-----------------|-----------------|--------------------|------------------|
|------------|--------------------|-------------------------------|-----------------------------|
| metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" |
| expression | Cell line | KRAS_Exp | 2.4, .3 |
| mutation | Cell line | KRAS_Mut | TRUE, FALSE |
Expand All @@ -82,9 +82,9 @@ Here's a starting prompt:
We have been using **explicit subsetting** with numerical indicies, such as "I want to filter for rows 20-50 and select columns 2 and 8". We are now going to switch to **implicit subsetting** in which we describe the subsetting criteria via comparision operators and column names, such as:

*"I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex."*
*"I want to subset for rows such that the OncotreeLineage is lung cancer and subset for columns Age and Sex."*

Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names.
Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns. This is because we are guaranteed to have column names in Dataframes, but not row names.

#### Let's convert our implicit subsetting criteria into code!

Expand All @@ -94,7 +94,7 @@ To subset for rows implicitly, we will use the conditional operators on Datafram
metadata['OncotreeLineage'] == "Lung"
```

Then, we will use the [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) operation (which is different than [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time:
Then, we will use the [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) attribute (which is different than [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) attribute!) and subsetting brackets to subset rows and columns Age and Sex at the same time:

```{python}
metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]]
Expand Down Expand Up @@ -127,7 +127,7 @@ Now that your Dataframe has be transformed based on your scientific question, yo
If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples:

| Function method | What it takes in | What it does | Returns |
|----------------|----------------|------------------------|----------------|
|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|-------------------------------------------------------------------------------|---------------|
| [`metadata.Age.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) |
| [`metadata['Age'].median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) |
| [`metadata.Age.max()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) |
Expand All @@ -147,7 +147,7 @@ Notice that the output of some of these methods are Float (NumPy). This refers t
We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called [`.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot.

| Plot style | Useful for | kind = | Code |
|-------------|-------------|-------------|---------------------------------|
|------------|------------|--------|--------------------------------------------------------------|
| Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` |
| Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` |

Expand Down
4 changes: 2 additions & 2 deletions 04-data-wrangling2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ To get there, we need to:

- **Summarize** each group via a summary statistic performed on a column, such as `Age`.

We first subset the the two columns we need, and then use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`.
We use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`.

```{python}
metadata_grouped = metadata.groupby("OncotreeLineage")
Expand All @@ -155,7 +155,7 @@ metadata_grouped['Age'].mean()

Here's what's going on:

- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped.
- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a **Grouped Dataframe object**. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped.

- We subset to the column `Age`. The grouping information still persists (This is a Grouped Series object).

Expand Down
2 changes: 1 addition & 1 deletion 05-data-visualization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -182,4 +182,4 @@ plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=

## Exercises

Exercise for week 5 can be found [here](https://colab.research.google.com/drive/1kT3zzq2rrhL1vHl01IdW5L1V7v0iK0wY?usp=sharing).
Exercise for week 5 can be found [here](https://colab.research.google.com/drive/17iwr8NwLLrmzRj4a6zRZucETXpPkmDNR?usp=sharing).
Binary file added images/student_stickers.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit a31c02d

Please sign in to comment.