diff --git a/config_automation.yml b/config_automation.yml index 4224739..0ea533c 100644 --- a/config_automation.yml +++ b/config_automation.yml @@ -21,9 +21,9 @@ render-website: rmd render-leanpub: yes render-coursera: no -## Automate the creation of Book.txt file? TRUE/FALSE? +## Automate the creation of Book.txt file? yes/no ## This is only relevant if render-leanpub is yes, otherwise it will be ignored -make-book-txt: TRUE +make-book-txt: yes # What docker image should be used for rendering? # The default is jhudsl/base_ottr:main diff --git a/docs/01-intro-to-computing.md b/docs/01-intro-to-computing.md index ebcf137..cc8d4ee 100644 --- a/docs/01-intro-to-computing.md +++ b/docs/01-intro-to-computing.md @@ -38,11 +38,13 @@ More importantly: **How we organize ideas \<-\> Instructing a computer to do som Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. -Let's open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named "KRAS Demo" in your Google Classroom workspace. If you are taking this course on your own time, open up... +Let's open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named "KRAS Demo" in your Google Classroom workspace. If you are taking this course on your own time, you can view it [here](https://colab.research.google.com/drive/1_77QQcj0mgZOWLlhtkZ-QKWUP1dnSt-_?usp=sharing). + +![](images/colab.png){width="800"} Today, we will pay close attention to: -- Python Console (Execution): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. +- Python Console ("Executions"): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. - Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text *and* Python code, and it helps us understand better the code we are writing. @@ -50,7 +52,7 @@ Today, we will pay close attention to: The first thing we will do is see the different ways we can run Python code. You can do the following: -1. Type something into the Python Console (Execution) and type enter, such as `2+2`. The Python Console will run it and give you an output. +1. Type something into the Python Console (Execution) and click the arrow button, such as `2+2`. The Python Console will run it and give you an output. 2. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. 3. Run every single Python code chunk via Runtime -\> Run all. @@ -66,13 +68,15 @@ Python Notebook is great for data science work, because: - It is flexible to use other programming languages, such as R. +The version of Python used in this course and in Google Colab is Python 3, which is the version of Python that is most supported. Some Python software is written in Python 2, which is very similar but has some [notable differences](https://www.fullstackpython.com/python-2-or-3.html). + Now, we will get to the basics of programming grammar. ## Grammar Structure 1: Evaluation of Expressions - **Expressions** are be built out of **operations** or **functions**. -- Functions and operations take in **data types**, do something with them, and return another data type. +- Functions and operations take in **data types** as inputs, do something with them, and **return** another data type as ouput. - We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. @@ -142,7 +146,19 @@ add(18, add(21, 65)) ## 104 ``` -Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the `add()` function isn't typically used, it is not automatically available, so we used the import statement to load it in.) +Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called **modules** that needs to be loaded. The `import` statement gives us permission to access the functions in the module "operator".) + +### Function machine schema + +A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: + +![Function machine from algebra class.](images/function_machine.png) + +Here are some aspects of this schema to pay attention to: + +- A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. + +- A function can have different kinds of inputs and outputs - it doesn't need to be numbers. In the `len()` function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. ### Data types @@ -155,16 +171,6 @@ Here are some common data types we will be using in this course. | String | str | "hello", "234-234-8594" | | Boolean | bool | True, False | -A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: - -![Function machine from algebra class.](images/function_machine.png) - -Here are some aspects of this schema to pay attention to: - -- A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. - -- A function can have different kinds of inputs and outputs - it doesn't need to be numbers. In the `len()` function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. - ## Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: @@ -182,11 +188,11 @@ If you enter this in the Console, you will see that in the Variable Environment, > > Bind variable to the left of `=` to the resulting value. > -> The variable is stored in the Variable Environment. +> The variable is stored in the **Variable Environment**. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. -The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. +The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. Look, now `x` can be reused downstream: @@ -203,7 +209,7 @@ x - 2 y = x * 2 ``` -It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the `type()` function on any variable in Python: +It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the `type()` function on any variable in Python: ``` python @@ -214,7 +220,7 @@ type(y) ## ``` -We should give useful variable names so that we know what to expect! Consider `num_sales` instead of `y`. +We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider `num_sales` instead of `y`. ## Grammar Structure 3: Evaluation of Functions @@ -222,22 +228,30 @@ Let's look at functions a little bit more formally: A function has a **function ### Execution rule for functions: -> Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. +> Evaluate the function by its arguments if there's any, and if the arguments are functions or contains operations, evaluate those functions or operations first. > > The output of functions is called the **returned value**. -Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by--step in the line of code below: +Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by--step in the lines of code below: + + +``` python +max(len("hello"), 4) +``` +``` +## 5 +``` ``` python -(len("hello") + 4) * 2 +(len("pumpkin") - 8) * 2 ``` ``` -## 18 +## -2 ``` -If we don't know how to use a function, such as `pow()` we can ask for help: +If we don't know how to use a function, such as `pow()`, we can ask for help: ``` ?pow @@ -249,7 +263,9 @@ Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. ``` -This shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. +We can also find a similar help document, in a [nicer rendered form online.](https://docs.python.org/3/library/functions.html#pow) We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own. + +The documentation shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. The following ways are equivalent ways of using the `pow()` function: @@ -300,13 +316,23 @@ And there is an operational equivalent: ## 8 ``` +We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let's look at some examples of functions that don't always have an input or output: + +| Function call | What it takes in | What it does | Returns | +|---------------------------------------------------------------------------|--------------------------|---------------------------------------------------------------|---------| +| [`pow(a, b)`](https://docs.python.org/3/library/functions.html#pow) | integer `a`, integer `b` | Raises `a` to the `b`th power. | Integer | +| [`time.sleep(x)`](https://docs.python.org/3/library/time.html#time.sleep) | Integer `x` | Waits for `x` seconds. | None | +| [`dir()`](https://docs.python.org/3/library/functions.html#dir) | Nothing | Gives a list of all the variables defined in the environment. | List | + ## Tips on writing your first code `Computer = powerful + stupid` -Even the smallest spelling and formatting changes will cause unexpected output and errors! +Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners: + +- Write incrementally, test often. -- Write incrementally, test often +- Don't be afraid to break things: it is how we learn how things work in programming. - Check your assumptions, especially using new functions, operations, and new data types. @@ -315,3 +341,7 @@ Even the smallest spelling and formatting changes will cause unexpected output a - Ask for help! To get more familiar with the errors Python gives you, take a look at this [summary of Python error messages](https://betterstack.com/community/guides/scaling-python/python-errors/). + +## Exercises + +Exercise for week 1 can be found [here](https://colab.research.google.com/drive/1AqVvktGz3LStUyu6dLJFsU2KoqNxgagT?usp=sharing). diff --git a/docs/02-data-structures.md b/docs/02-data-structures.md new file mode 100644 index 0000000..bd24e0d --- /dev/null +++ b/docs/02-data-structures.md @@ -0,0 +1,408 @@ + + +# Working with data structures + +In our second lesson, we start to look at two **data structures**, **Lists** and **Dataframes**, that can handle a large amount of data for analysis. + +## Lists + +In the first exercise, you started to explore **data structures**, which store information about data types. You explored **lists**, which is an ordered collection of data types or data structures. Each *element* of a list contains a data type or another data structure. + +We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive. + +We create a list via the bracket `[ ]` operation. + + +``` python +staff = ["chris", "ted", "jeff"] +chrNum = [2, 3, 1, 2, 2] +mixedList = [False, False, False, "A", "B", 92] +``` + +### Subsetting lists + +To access an element of a list, you can use the bracket notation `[ ]` to access the elements of the list. We simply access an element via the "index" number - the location of the data within the list. + +*Here's the tricky thing about the index number: it starts at 0!* + +1st element of `chrNum`: `chrNum[0]` + +2nd element of `chrNum`: `chrNum[1]` + +... + +5th element of `chrNum`: `chrNum[4]` + +With subsetting, you can modify elements of a list or use the element of a list as part of an expression. + +### Subsetting multiple elements of lists + +Suppose you want to access multiple elements of a list, such as accessing the first three elements of `chrNum`. You would use the **slice** operator `:`, which specifies: + +- the index number to start + +- the index number to stop, *plus one.* + +If you want to access the first three elements of `chrNum`: + + +``` python +chrNum[0:3] +``` + +``` +## [2, 3, 1] +``` + +The first element's index number is 0, the third element's index number is 2, plus 1, which is 3. + +If you want to access the second and third elements of `chrNum`: + + +``` python +chrNum[1:3] +``` + +``` +## [3, 1] +``` + +Another way of accessing the first 3 elements of `chrNum`: + + +``` python +chrNum[:3] +``` + +``` +## [2, 3, 1] +``` + +Here, the start index number was not specified. When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here's another example, using negative indicies to count from 3 elements from the end of the list: + + +``` python +chrNum[-3:] +``` + +``` +## [1, 2, 2] +``` + +You can find more discussion of list slicing, using negative indicies and incremental slicing, [here](https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf). + +## Objects in Python + +The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined: + +- What does it contain (in terms of data)? + +- What can it do (in terms of functions)? + +And if it "makes sense" to us, then it is well-designed. + +The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: *what does it contain, and what can it do?* It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: + +- **Value** that holds the essential data for the object. + +- **Attributes** that hold subset or additional data for the object. + +- Functions called **Methods** that are for the object and *have to* take in the variable referenced as an input + +This organizing structure on an object applies to pretty much all Python data types and data structures. + +Let's see how this applies to the list: + +- **Value**: the contents of the list, such as `[2, 3, 4].` + +- **Attributes** that store additional values: Not relevant for lists. + +- **Methods** that can be used on the object: `chrNum.count(2)` counts the number of instances 2 appears as an element of `chrNum`. + +Object methods are functions that does something with the object you are using it on. You should think about `chrNum.count(2)` as a function that takes in `chrNum` and `2` as inputs. If you want to use the count function on list `mixedList`, you would use `mixedList.count(x)`. + +Here are some more examples of methods with lists: + +| Function method | What it takes in | What it does | Returns | +|---------------|---------------|---------------------------|---------------| +| [`chrNum.count(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer | +| [`chrNum.append(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) | +| [`chrNum.sort()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) | +| [`chrNum.reverse()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Reverses the order of `chrNum`. | None (but `chrNum` is modified!) | + +## Methods vs Functions + +**Methods** *have to* take in the object of interest as an input: `chrNum.count(2)` automatically treat `chrNum` as an input. Methods are built for a specific Object type. + +**Functions** do not have an implied input: `len(chrNum)` requires specifying a list in the input. + +Otherwise, there is no strong distinction between the two. + +## Dataframes + +A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does. + +The Dataframe data structure is found within a Python module called "Pandas". A Python module is an organized collection of functions and data structures. The `import` statement below gives us permission to access the "Pandas" module via the variable `pd`. + +To load in a Dataframe from existing spreadsheet data, we use the function [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html): + + +``` python +import pandas as pd + +metadata = pd.read_csv("classroom_data/metadata.csv") +type(metadata) +``` + +``` +## +``` + +There is a similar function [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for loading in Excel spreadsheets. + +Let's investigate the Dataframe as an object: + +- What does a Dataframe contain (values, attributes)? + +- What can a Dataframe do (methods)? + +## What does a Dataframe contain? + +We first take a look at the contents: + + +``` python +metadata +``` + +``` +## ModelID ... OncotreeLineage +## 0 ACH-000001 ... Ovary/Fallopian Tube +## 1 ACH-000002 ... Myeloid +## 2 ACH-000003 ... Bowel +## 3 ACH-000004 ... Myeloid +## 4 ACH-000005 ... Myeloid +## ... ... ... ... +## 1859 ACH-002968 ... Esophagus/Stomach +## 1860 ACH-002972 ... Esophagus/Stomach +## 1861 ACH-002979 ... Esophagus/Stomach +## 1862 ACH-002981 ... Esophagus/Stomach +## 1863 ACH-003071 ... Lung +## +## [1864 rows x 30 columns] +``` + +It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data. + + +``` python +metadata +``` + +``` +## ModelID ... OncotreeLineage +## 0 ACH-000001 ... Ovary/Fallopian Tube +## 1 ACH-000002 ... Myeloid +## 2 ACH-000003 ... Bowel +## 3 ACH-000004 ... Myeloid +## 4 ACH-000005 ... Myeloid +## ... ... ... ... +## 1859 ACH-002968 ... Esophagus/Stomach +## 1860 ACH-002972 ... Esophagus/Stomach +## 1861 ACH-002979 ... Esophagus/Stomach +## 1862 ACH-002981 ... Esophagus/Stomach +## 1863 ACH-003071 ... Lung +## +## [1864 rows x 30 columns] +``` + +We can look at specific columns by looking at **attributes** via the dot operation. We can also look at the columns via the bracket operation. + + +``` python +metadata.ModelID +``` + +``` +## 0 ACH-000001 +## 1 ACH-000002 +## 2 ACH-000003 +## 3 ACH-000004 +## 4 ACH-000005 +## ... +## 1859 ACH-002968 +## 1860 ACH-002972 +## 1861 ACH-002979 +## 1862 ACH-002981 +## 1863 ACH-003071 +## Name: ModelID, Length: 1864, dtype: object +``` + +``` python +metadata['ModelID'] +``` + +``` +## 0 ACH-000001 +## 1 ACH-000002 +## 2 ACH-000003 +## 3 ACH-000004 +## 4 ACH-000005 +## ... +## 1859 ACH-002968 +## 1860 ACH-002972 +## 1861 ACH-002979 +## 1862 ACH-002981 +## 1863 ACH-003071 +## Name: ModelID, Length: 1864, dtype: object +``` + +The names of all columns is stored as an attribute, which can be accessed via the dot operation. + + +``` python +metadata.columns +``` + +``` +## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age', +## 'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory', +## 'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis', +## 'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype', +## 'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments', +## 'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus', +## 'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype', +## 'OncotreePrimaryDisease', 'OncotreeLineage'], +## dtype='object') +``` + +The number of rows and columns are also stored as an attribute: + + +``` python +metadata.shape +``` + +``` +## (1864, 30) +``` + +## What can a Dataframe do? + +We can use the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods to look at the first few rows and last few rows of `metadata`, respectively: + + +``` python +metadata.head() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube +## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid +## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel +## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## +## [5 rows x 30 columns] +``` + +``` python +metadata.tail() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 1859 ACH-002968 PT-pjhrsc ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1860 ACH-002972 PT-dkXZB1 ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1861 ACH-002979 PT-lyHTzo ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1862 ACH-002981 PT-Z9akXf ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1863 ACH-003071 PT-LAGmLq ... Lung Neuroendocrine Tumor Lung +## +## [5 rows x 30 columns] +``` + +Both of these functions (without input arguments) are considered as **methods**: they are functions that does something with the Dataframe you are using it on. You should think about `metadata.head()` as a function that takes in `metadata` as an input. If we had another Dataframe called `my_data` and you want to use the same function, you will have to say `my_data.head()`. + +## Subsetting Dataframes + +Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. + +You will use the [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) attribute and bracket operations, and you give two slices: one for the row, and one for the column. + +Let's start with a small dataframe to see how it works before returning to `metadata`: + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +Here is how the dataframe looks like with the row and column index numbers: + +![](images/pandas_subset_0.png) + +Subset the first fourth rows, and the first two columns: + +![](images/pandas subset_1.png) + +Now, back to `metadata` dataframe: + +Subset the first 5 rows, and first two columns: + + +``` python +metadata.iloc[:5, :2] +``` + +``` +## ModelID PatientID +## 0 ACH-000001 PT-gj46wT +## 1 ACH-000002 PT-5qa3uk +## 2 ACH-000003 PT-puKIyc +## 3 ACH-000004 PT-q4K2cp +## 4 ACH-000005 PT-q4K2cp +``` + +If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column: + + +``` python +metadata.iloc[5:, [1, 10, 21]] +``` + +``` +## PatientID GrowthPattern WTSIMasterCellID +## 5 PT-ej13Dz Suspension 2167.0 +## 6 PT-NOXwpH Adherent 569.0 +## 7 PT-fp8PeY Adherent 1806.0 +## 8 PT-puKIyc Adherent 2104.0 +## 9 PT-AR7W9o Adherent NaN +## ... ... ... ... +## 1859 PT-pjhrsc Organoid NaN +## 1860 PT-dkXZB1 Organoid NaN +## 1861 PT-lyHTzo Organoid NaN +## 1862 PT-Z9akXf Organoid NaN +## 1863 PT-LAGmLq Suspension NaN +## +## [1859 rows x 3 columns] +``` + +When we subset via numerical indicies, it's called **explicit subsetting**. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed. + +The second way is to subset by the column name and comparison operators, also known as **implicit subsetting**. This is much more robust in data analysis practice. You will learn about it next week! + +## Exercises + +Exercise for week 2 can be found [here](https://colab.research.google.com/drive/1oIL3gKEZR2Lq16k6XY0HXIhjYl34pEjr?usp=sharing). diff --git a/docs/03-data-wrangling1.md b/docs/03-data-wrangling1.md new file mode 100644 index 0000000..7e8578e --- /dev/null +++ b/docs/03-data-wrangling1.md @@ -0,0 +1,354 @@ + + +# Data Wrangling, Part 1 + +From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis. + +![Data science workflow. Image source: R for Data Science.](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"} + +For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy". + +## Tidy Data + +Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of **tidy data**, developed by Hadley Wickham: + +1. Each variable must have its own column. + +2. Each observation must have its own row. + +3. Each value must have its own cell. + +If you want to be technical about what variables and observations are, Hadley Wickham describes: + +> A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes. + +![A tidy dataframe. Image source: R for Data Science.](https://r4ds.hadley.nz/images/tidy-1.png){width="800"} + +## Our working Tidy Data: DepMap Project + +The [Dependency Map project](https://depmap.org/portal/) is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session. + +- Metadata + +- Somatic mutations + +- Gene expression + +- Drug sensitivity + +- CRISPR knockout + +- and more... + +Let's load these datasets in, and see how these datasets fit the definition of Tidy data: + + +``` python +import pandas as pd + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + + +``` python +metadata.head() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube +## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid +## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel +## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## +## [5 rows x 30 columns] +``` + + +``` python +mutation.head() +``` + +``` +## ModelID CACNA1D_Mut CYP2D6_Mut ... CCDC28A_Mut C1orf194_Mut U2AF1_Mut +## 0 ACH-000001 False False ... False False False +## 1 ACH-000002 False False ... False False False +## 2 ACH-000004 False False ... False False False +## 3 ACH-000005 False False ... False False False +## 4 ACH-000006 False False ... False False False +## +## [5 rows x 540 columns] +``` + + +``` python +expression.head() +``` + +``` +## ModelID ENPP4_Exp CREBBP_Exp ... OR5D13_Exp C2orf81_Exp OR8S1_Exp +## 0 ACH-001113 2.280956 4.094236 ... 0.0 1.726831 0.0 +## 1 ACH-001289 3.622930 3.606442 ... 0.0 0.790772 0.0 +## 2 ACH-001339 0.790772 2.970854 ... 0.0 0.575312 0.0 +## 3 ACH-001538 3.485427 2.801159 ... 0.0 1.077243 0.0 +## 4 ACH-000242 0.879706 3.327687 ... 0.0 0.722466 0.0 +## +## [5 rows x 536 columns] +``` + +| Dataframe | The observation is | Some variables are | Some values are | +|-----------------|-----------------|--------------------|------------------| +| metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" | +| expression | Cell line | KRAS_Exp | 2.4, .3 | +| mutation | Cell line | KRAS_Mut | TRUE, FALSE | + +## Transform: "What do you want to do with this Dataframe"? + +Remember that a major theme of the course is about: **How we organize ideas \<-\> Instructing a computer to do something.** + +Until now, we haven't focused too much on how we organize our scientific ideas to interact with what we can do with code. Let's pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows. + +Here's a starting prompt: + +> In the `metadata` dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question? + +We have been using **explicit subsetting** with numerical indicies, such as "I want to filter for rows 20-50 and select columns 2 and 8". We are now going to switch to **implicit subsetting** in which we describe the subsetting criteria via comparision operators and column names, such as: + +*"I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex."* + +Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names. + +#### Let's convert our implicit subsetting criteria into code! + +To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer: + + +``` python +metadata['OncotreeLineage'] == "Lung" +``` + +``` +## 0 False +## 1 False +## 2 False +## 3 False +## 4 False +## ... +## 1859 False +## 1860 False +## 1861 False +## 1862 False +## 1863 True +## Name: OncotreeLineage, Length: 1864, dtype: bool +``` + +Then, we will use the [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) operation (which is different than [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: + + +``` python +metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]] +``` + +``` +## Age Sex +## 10 39.0 Female +## 13 44.0 Male +## 19 55.0 Female +## 27 39.0 Female +## 28 45.0 Male +## ... ... ... +## 1745 52.0 Male +## 1819 84.0 Male +## 1820 57.0 Female +## 1822 53.0 Male +## 1863 62.0 Male +## +## [241 rows x 2 columns] +``` + +What's going on here? The first component of the subset, `metadata['OncotreeLineage'] == "Lung"`, subsets for the rows. It gives us a column of `True` and `False` values, and we keep rows that correspond to `True` values. Then, we specify the column names we want to subset for via a list. + +Here's another example: + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) + +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +*"I want to subset for rows such that the status is"treated" and subset for columns status and age_case."* + + +``` python +df.loc[df.status == "treated", ["status", "age_case"]] +``` + +``` +## status age_case +## 0 treated 25 +## 4 treated 7 +``` + +![](images/pandas_subset_2.png) + +## Summary Statistics + +Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode. + +If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples: + +| Function method | What it takes in | What it does | Returns | +|----------------|----------------|------------------------|----------------| +| [`metadata.Age.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) | +| [`metadata['Age'].median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) | +| [`metadata.Age.max()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) | +| [`metadata.OncotreeSubtype.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | + +Let's try it out, with some nice print formatting: + + +``` python +print("Mean value of Age column:", metadata['Age'].mean()) +``` + +``` +## Mean value of Age column: 47.45187165775401 +``` + +``` python +print("Frequency of column", metadata.OncotreeLineage.value_counts()) +``` + +``` +## Frequency of column OncotreeLineage +## Lung 241 +## Lymphoid 209 +## CNS/Brain 123 +## Skin 118 +## Esophagus/Stomach 95 +## Breast 92 +## Bowel 87 +## Head and Neck 81 +## Myeloid 77 +## Bone 75 +## Ovary/Fallopian Tube 74 +## Pancreas 65 +## Kidney 64 +## Peripheral Nervous System 55 +## Soft Tissue 54 +## Uterus 41 +## Fibroblast 41 +## Biliary Tract 40 +## Bladder/Urinary Tract 39 +## Normal 39 +## Pleura 35 +## Liver 28 +## Cervix 25 +## Eye 19 +## Thyroid 18 +## Prostate 14 +## Vulva/Vagina 5 +## Ampulla of Vater 4 +## Testis 4 +## Adrenal Gland 1 +## Other 1 +## Name: count, dtype: int64 +``` + +Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.) + +## Simple data visualization + +We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called [`.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot. + +| Plot style | Useful for | kind = | Code | +|-------------|-------------|-------------|---------------------------------| +| Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` | +| Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` | + +Let's look at a histogram: + + +``` python +import matplotlib.pyplot as plt + +plt.figure() +metadata.Age.plot(kind = "hist") +plt.show() +``` + +![](resources/images/03-data-wrangling1_files/figure-docx/unnamed-chunk-11-1.png) + +Let's look at a bar plot: + + +``` python +plt.figure() +metadata.OncotreeLineage.value_counts().plot(kind = "bar") +plt.show() +``` + +![](resources/images/03-data-wrangling1_files/figure-docx/unnamed-chunk-12-3.png) + +(The `plt.figure()` and `plt.show()` functions are used to render the plots on the website, but you don't need to use it for your exercises. We will discuss this in more detail during our week of data visualization.) + +#### Chained function calls + +Let's look at our bar plot syntax more carefully. We start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Series* of a frequency table. Then, we take the frequency table Series and use the `.plot()` method. + +It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()` all in one line of code. It takes a bit of time to get used to this! + +Here's another example of a chained function call, which looks quite complex, but let's break it down: + + +``` python +plt.figure() + +metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar") + +plt.show() +``` + +![](resources/images/03-data-wrangling1_files/figure-docx/unnamed-chunk-13-5.png) + +1. We first take the entire `metadata` and do some subsetting, which outputs a Dataframe. +2. We access the `OncotreeLineage` column, which outputs a Series. +3. We use the method `.value_counts()`, which outputs a Series. +4. We make a plot out of it! + +We could have, alternatively, done this in several lines of code: + + +``` python +plt.figure() + +metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ] +metadata_subset_lineage = metadata_subset.OncotreeLineage +lineage_freq = metadata_subset_lineage.value_counts() +lineage_freq.plot(kind = "bar") + +plt.show() +``` + +![](resources/images/03-data-wrangling1_files/figure-docx/unnamed-chunk-14-7.png) + +These are two different *styles* of code, but they do the exact same thing. It's up to you to decide what is easier for you to understand. + +## Exercises + +Exercise for week 3 can be found [here](https://colab.research.google.com/drive/1ClNOJviyrcaaoVq5F-YtsO7NhMqn315c?usp=sharing). diff --git a/docs/04-data-wrangling2.md b/docs/04-data-wrangling2.md new file mode 100644 index 0000000..77cb01c --- /dev/null +++ b/docs/04-data-wrangling2.md @@ -0,0 +1,334 @@ + + +# Data Wrangling, Part 2 + +We will continue to learn about data analysis with Dataframes. Let's load our three Dataframes from the Depmap project in again: + + +``` python +import pandas as pd +import numpy as np + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + +## Creating new columns + +Often, we want to perform some kind of transformation on our data's columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale. + +To create a new column, you simply modify it as if it exists using the bracket operation `[ ]`, and the column will be created: + + +``` python +metadata['AgePlusTen'] = metadata['Age'] + 10 +expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp'] +expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp']) +``` + +where [`np.log(x)`](https://numpy.org/doc/stable/reference/generated/numpy.log.html) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value. + +Note: you cannot create a new column referring to the attribute of the Dataframe, such as: `expression.KRAS_Exp_log = np.log(expression.KRAS_Exp)`. + +## Merging two Dataframes together + +Suppose we have the following Dataframes: + +`expression` + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | +|--------------|------------|----------------| +| "ACH-001113" | 5.138733 | 1.636806 | +| "ACH-001289" | 3.184280 | 1.158226 | +| "ACH-001339" | 3.165108 | 1.152187 | + +`metadata` + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "CNS/Brain" | NaN | +| "ACH-001339" | "Skin" | 14 | + +Suppose that I want to compare the relationship between `OncotreeLineage` and `PIK3CA_Exp`, but they are columns in different Dataframes. We want a new Dataframe that looks like this: + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age | +|--------------|------------|----------------|-----------------|-----| +| "ACH-001113" | 5.138733 | 1.636806 | "Lung" | 69 | +| "ACH-001289" | 3.184280 | 1.158226 | "CNS/Brain" | NaN | +| "ACH-001339" | 3.165108 | 1.152187 | "Skin" | 14 | + +We see that in both dataframes, + +- the rows (observations) represent cell lines. + +- there is a common column `ModelID`, with shared values between the two dataframes that can faciltate the merging process. We call this an **index**. + +We will use the method [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index. + + +``` python +merged = metadata.merge(expression) +``` + +It's usually better to specify what that index column to avoid ambiguity, using the `on` optional argument: + + +``` python +merged = metadata.merge(expression, on='ModelID') +``` + +If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe: + + +``` python +merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID') +``` + +One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not: + +The number of rows and columns of `metadata`: + + +``` python +metadata.shape +``` + +``` +## (1864, 31) +``` + +The number of rows and columns of `expression`: + + +``` python +expression.shape +``` + +``` +## (1450, 538) +``` + +The number of rows and columns of `merged`: + + +``` python +merged.shape +``` + +``` +## (1450, 568) +``` + +We see that the number of *columns* in `merged` combines the number of columns in `metadata` and `expression`, while the number of *rows* in `merged` is the smaller of the number of rows in `metadata` and `expression`: it only keeps rows that are found in both Dataframe's index columns. This kind of join is called "inner join", because in the Venn Diagram of elements common in both index column, we keep the inner overlap: + +![](images/join.png) + +You can specifiy the join style by changing the optional input argument `how`. + +- `how = "outer"` keeps all observations - also known as a "full join" + +- `how = "left"` keeps all observations in the left Dataframe. + +- `how = "right"` keeps all observations in the right Dataframe. + +- `how = "inner"` keeps observations common to both Dataframe. This is the default value of `how`. + +## Grouping and summarizing Dataframes + +In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in `metadata`, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, `OncotreeLineage`, and look at the mean age for each cancer type. + +We want to take `metadata`: + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "Lung" | 23 | +| "ACH-001339" | "Skin" | 14 | +| "ACH-002342" | "Brain" | 23 | +| "ACH-004854" | "Brain" | 56 | +| "ACH-002921" | "Brain" | 67 | + +into: + +| OncotreeLineage | MeanAge | +|-----------------|---------| +| "Lung" | 46 | +| "Skin" | 14 | +| "Brain" | 48.67 | + +To get there, we need to: + +- **Group** the data based on some criteria, elements of `OncotreeLineage` + +- **Summarize** each group via a summary statistic performed on a column, such as `Age`. + +We first subset the the two columns we need, and then use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`. + + +``` python +metadata_grouped = metadata.groupby("OncotreeLineage") +metadata_grouped['Age'].mean() +``` + +``` +## OncotreeLineage +## Adrenal Gland 55.000000 +## Ampulla of Vater 65.500000 +## Biliary Tract 58.450000 +## Bladder/Urinary Tract 65.166667 +## Bone 20.854545 +## Bowel 58.611111 +## Breast 50.961039 +## CNS/Brain 43.849057 +## Cervix 47.136364 +## Esophagus/Stomach 57.855556 +## Eye 51.100000 +## Fibroblast 38.194444 +## Head and Neck 60.149254 +## Kidney 46.193548 +## Liver 43.928571 +## Lung 55.444444 +## Lymphoid 38.916667 +## Myeloid 38.810811 +## Normal 52.370370 +## Other 46.000000 +## Ovary/Fallopian Tube 51.980769 +## Pancreas 60.226415 +## Peripheral Nervous System 5.480000 +## Pleura 61.000000 +## Prostate 61.666667 +## Skin 49.033708 +## Soft Tissue 27.500000 +## Testis 25.000000 +## Thyroid 63.235294 +## Uterus 62.060606 +## Vulva/Vagina 75.400000 +## Name: Age, dtype: float64 +``` + +Here's what's going on: + +- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped. + +- We subset to the column `Age`. The grouping information still persists (This is a Grouped Series object). + +- We use the method `.mean()` to calculate the mean value of `Age` within each group defined by `OncotreeLineage`. + +Alternatively, this could have been done in a chain of methods: + + +``` python +metadata.groupby("OncotreeLineage")["Age"].mean() +``` + +``` +## OncotreeLineage +## Adrenal Gland 55.000000 +## Ampulla of Vater 65.500000 +## Biliary Tract 58.450000 +## Bladder/Urinary Tract 65.166667 +## Bone 20.854545 +## Bowel 58.611111 +## Breast 50.961039 +## CNS/Brain 43.849057 +## Cervix 47.136364 +## Esophagus/Stomach 57.855556 +## Eye 51.100000 +## Fibroblast 38.194444 +## Head and Neck 60.149254 +## Kidney 46.193548 +## Liver 43.928571 +## Lung 55.444444 +## Lymphoid 38.916667 +## Myeloid 38.810811 +## Normal 52.370370 +## Other 46.000000 +## Ovary/Fallopian Tube 51.980769 +## Pancreas 60.226415 +## Peripheral Nervous System 5.480000 +## Pleura 61.000000 +## Prostate 61.666667 +## Skin 49.033708 +## Soft Tissue 27.500000 +## Testis 25.000000 +## Thyroid 63.235294 +## Uterus 62.060606 +## Vulva/Vagina 75.400000 +## Name: Age, dtype: float64 +``` + +Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as `.mean()`, `.median()`, `.max()`, can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is [`.count()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html) which tells you how many entries are counted within each group. + +### Optional: Multiple grouping, Multiple columns, Multiple summary statistics + +Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously. + +For example, you may want to group by a combination of `OncotreeLineage` and `AgeCategory`, such as "Lung" and "Adult" as one grouping. You can do so like this: + + +``` python +metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"]) +metadata_grouped['Age'].mean() +``` + +``` +## OncotreeLineage AgeCategory +## Adrenal Gland Adult 55.000000 +## Ampulla of Vater Adult 65.500000 +## Unknown NaN +## Biliary Tract Adult 58.450000 +## Unknown NaN +## ... +## Thyroid Unknown NaN +## Uterus Adult 62.060606 +## Fetus NaN +## Unknown NaN +## Vulva/Vagina Adult 75.400000 +## Name: Age, Length: 72, dtype: float64 +``` + +You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the [`.agg(x)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) method on a Grouped Dataframe. + +For example, coming back to our age case-control Dataframe, + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) + +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +We group by `status` and summarize `age_case` and `age_control` with a few summary statistics each: + + +``` python +df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]}) +``` + +``` +## age_case age_control +## mean min max mean +## status +## discharged 65.0 25 25 25.0 +## treated 16.0 32 49 40.5 +## untreated 32.0 20 32 26.0 +``` + +The input argument to the `.agg(x)` method is called a [Dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries), which let's you structure information in a paired relationship. You can learn more about dictionaries here. + +## Exercises + +Exercise for week 4 can be found [here](https://colab.research.google.com/drive/1ntkUdKQ209vu1M89rcsBst-pKKuwzdwX?usp=sharing). diff --git a/docs/05-data-visualization.md b/docs/05-data-visualization.md new file mode 100644 index 0000000..a48df0f --- /dev/null +++ b/docs/05-data-visualization.md @@ -0,0 +1,226 @@ + + +# Data Visualization + +In our final to last week together, we learn about how to visualize our data. + +There are several different data visualization modules in Python: + +- [matplotlib](https://matplotlib.org/) is a general purpose plotting module that is commonly used. + +- [seaborn](https://seaborn.pydata.org/) is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course. + +- [plotnine](https://plotnine.org/) is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package "ggplot". + +To get started, we will consider these most simple and common plots: + +Distributions (one variable) + +- Histograms + +Relational (between 2 continuous variables) + +- Scatterplots + +- Line plots + +Categorical (between 1 categorical and 1 continuous variable) + +- Bar plots + +- Violin plots + +[![Image source: Seaborn's overview of plotting functions](https://seaborn.pydata.org/_images/function_overview_8_0.png)](https://seaborn.pydata.org/tutorial/function_overview.html) + +Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale. + +[![Image Source: Visualization Analysis and Design by [Tamara Munzner](https://www.oreilly.com/search?q=author:%22Tamara%20Munzner%22)](https://www.oreilly.com/api/v2/epubs/9781466508910/files/image/fig5-1.png)](https://www.oreilly.com/library/view/visualization-analysis-and/9781466508910/K14708_C005.xhtml) + +Let's load in our genomics datasets and start making some plots from them. + + +``` python +import pandas as pd +import seaborn as sns +import matplotlib.pyplot as plt + + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + +## Distributions (one variable) + +To create a histogram, we use the function [`sns.displot()`](https://seaborn.pydata.org/generated/seaborn.displot.html) and we specify the input argument `data` as our dataframe, and the input argument `x` as the column name in a String. + + +``` python +plot = sns.displot(data=metadata, x="Age") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-3-1.png) + +(For the webpage's purpose, assign the plot to a variable `plot`. In practice, you don't need to do that. You can just write `sns.displot(data=metadata, x="Age")`). + +A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via `binwidth` argument, or the number of bins via `bins` argument. + + +``` python +plot = sns.displot(data=metadata, x="Age", binwidth = 10) +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-4-3.png) + +Our histogram also works for categorical variables, such as "Sex". + + +``` python +plot = sns.displot(data=metadata, x="Sex") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-5-5.png) + +**Conditioning on other variables** + +Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the `hue` input argument: + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-6-7.png) + +It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via `multiple="dodge"` input argument: + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-7-9.png) + +Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable's value via `col="Sex"` or `row="Sex"`: + + +``` python +plot = sns.displot(data=metadata, x="Age", col="Sex") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-8-11.png) + +You can find a lot more details about distributions and histograms in [the Seaborn tutorial](https://seaborn.pydata.org/tutorial/distributions.html). + +## Relational (between 2 continuous variables) + +To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function [`sns.relplot()`](https://seaborn.pydata.org/generated/seaborn.relplot.html) and we specify the input argument `data` as our dataframe, and the input arguments `x` and `y` as the column names in a String: + + +``` python +plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-9-13.png) + +To conditional on other variables, plotting features are used to distinguish conditional variable values: + +- `hue` (similar to the histogram) + +- `style` + +- `size` + +Let's merge `expression` and `metadata` together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color: + + +``` python +expression_metadata = expression.merge(metadata) + +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-10-15.png) + +Here is the scatterplot with different shapes: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-11-17.png) + +You can also try plotting with `size=PrimaryOrMetastasis"` if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-12-19.png) + +You can also conditional on multiple variables by assigning a different variable to the conditioning options: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-13-21.png) + +You can find a lot more details about relational plots such as scatterplots and lineplots [in the Seaborn tutorial](https://seaborn.pydata.org/tutorial/relational.html). + +## Categorical (between 1 categorical and 1 continuous variable) + +A very similar pattern follows for categorical plots. We start with [sns.catplot()](https://seaborn.pydata.org/generated/seaborn.catplot.html) as our main plotting function, with the basic input arguments: + +- `data` + +- `x` + +- `y` + +You can change the plot styles via the input arguments: + +- `kind`: "strip", "box", "swarm", etc. + +You can add additional conditional variables via the input arguments: + +- `hue` + +- `col` + +- `row` + +See categorical plots [in the Seaborn tutorial.](https://seaborn.pydata.org/tutorial/categorical.html) + +## Basic plot customization + +You can easily change the axis labels and title if you modify the plot object, using the method `.set()`: + + +``` python +exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") +exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship") +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-14-23.png)![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-14-24.png) + +You can change the color palette by setting adding the `palette` input argument to any of the plots. You can explore available color palettes [here](https://www.practicalpythonfordatascience.com/ap_seaborn_palette): + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow') +) +``` + +``` +## :1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended. +``` + +![](resources/images/05-data-visualization_files/figure-docx/unnamed-chunk-15-27.png) + +## Exercises + +Exercise for week 5 can be found [here](https://colab.research.google.com/drive/1kT3zzq2rrhL1vHl01IdW5L1V7v0iK0wY?usp=sharing). diff --git a/docs/404.html b/docs/404.html index 2c73718..ac6d572 100644 --- a/docs/404.html +++ b/docs/404.html @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/About.md b/docs/About.md index 121b3d7..6c7533a 100644 --- a/docs/About.md +++ b/docs/About.md @@ -51,7 +51,7 @@ These credits are based on our [course contributors table guidelines](https://ww ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2024-08-07 +## date 2024-09-26 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── diff --git a/docs/Introduction-to-Python.docx b/docs/Introduction-to-Python.docx index 8560319..ced8ff2 100644 Binary files a/docs/Introduction-to-Python.docx and b/docs/Introduction-to-Python.docx differ diff --git a/docs/about-the-authors.html b/docs/about-the-authors.html index 2a814fd..b05d8fa 100644 --- a/docs/about-the-authors.html +++ b/docs/about-the-authors.html @@ -28,7 +28,7 @@ - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -342,7 +386,7 @@

    About the Authors + diff --git a/docs/data-visualization.html b/docs/data-visualization.html new file mode 100644 index 0000000..671f810 --- /dev/null +++ b/docs/data-visualization.html @@ -0,0 +1,443 @@ + + + + + + + Chapter 5 Data Visualization | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 5 Data Visualization

    +

    In our final to last week together, we learn about how to visualize our data.

    +

    There are several different data visualization modules in Python:

    +
      +
    • matplotlib is a general purpose plotting module that is commonly used.

    • +
    • seaborn is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course.

    • +
    • plotnine is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package “ggplot”.

    • +
    +

    To get started, we will consider these most simple and common plots:

    +

    Distributions (one variable)

    +
      +
    • Histograms
    • +
    +

    Relational (between 2 continuous variables)

    +
      +
    • Scatterplots

    • +
    • Line plots

    • +
    +

    Categorical (between 1 categorical and 1 continuous variable)

    +
      +
    • Bar plots

    • +
    • Violin plots

    • +
    +

    Image source: Seaborn’s overview of plotting functions

    +

    Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale.

    +

    Image Source: Visualization Analysis and Design by [Tamara Munzner](https://www.oreilly.com/search?q=author:%22Tamara%20Munzner%22)

    +

    Let’s load in our genomics datasets and start making some plots from them.

    +
    import pandas as pd
    +import seaborn as sns
    +import matplotlib.pyplot as plt
    +
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    +

    5.1 Distributions (one variable)

    +

    To create a histogram, we use the function sns.displot() and we specify the input argument data as our dataframe, and the input argument x as the column name in a String.

    +
    plot = sns.displot(data=metadata, x="Age")
    +

    +

    (For the webpage’s purpose, assign the plot to a variable plot. In practice, you don’t need to do that. You can just write sns.displot(data=metadata, x="Age")).

    +

    A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via binwidth argument, or the number of bins via bins argument.

    +
    plot = sns.displot(data=metadata, x="Age", binwidth = 10)
    +

    +

    Our histogram also works for categorical variables, such as “Sex”.

    +
    plot = sns.displot(data=metadata, x="Sex")
    +

    +

    Conditioning on other variables

    +

    Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the hue input argument:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex")
    +

    +

    It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via multiple="dodge" input argument:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge")
    +

    +

    Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable’s value via col="Sex" or row="Sex":

    +
    plot = sns.displot(data=metadata, x="Age", col="Sex")
    +

    +

    You can find a lot more details about distributions and histograms in the Seaborn tutorial.

    +
    +
    +

    5.2 Relational (between 2 continuous variables)

    +

    To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function sns.relplot() and we specify the input argument data as our dataframe, and the input arguments x and y as the column names in a String:

    +
    plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp")
    +

    +

    To conditional on other variables, plotting features are used to distinguish conditional variable values:

    +
      +
    • hue (similar to the histogram)

    • +
    • style

    • +
    • size

    • +
    +

    Let’s merge expression and metadata together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color:

    +
    expression_metadata = expression.merge(metadata)
    +
    +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis")
    +

    +

    Here is the scatterplot with different shapes:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis")
    +

    +

    You can also try plotting with size=PrimaryOrMetastasis" if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis")
    +

    +

    You can also conditional on multiple variables by assigning a different variable to the conditioning options:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory")
    +

    +

    You can find a lot more details about relational plots such as scatterplots and lineplots in the Seaborn tutorial.

    +
    +
    +

    5.3 Categorical (between 1 categorical and 1 continuous variable)

    +

    A very similar pattern follows for categorical plots. We start with sns.catplot() as our main plotting function, with the basic input arguments:

    +
      +
    • data

    • +
    • x

    • +
    • y

    • +
    +

    You can change the plot styles via the input arguments:

    +
      +
    • kind: “strip”, “box”, “swarm”, etc.
    • +
    +

    You can add additional conditional variables via the input arguments:

    +
      +
    • hue

    • +
    • col

    • +
    • row

    • +
    +

    See categorical plots in the Seaborn tutorial.

    +
    +
    +

    5.4 Basic plot customization

    +

    You can easily change the axis labels and title if you modify the plot object, using the method .set():

    +
    exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp")
    +exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship")
    +

    +

    You can change the color palette by setting adding the palette input argument to any of the plots. You can explore available color palettes here:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow')
    +)
    +
    ## <string>:1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended.
    +

    +
    +
    +

    5.5 Exercises

    +

    Exercise for week 5 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/data-wrangling-part-1.html b/docs/data-wrangling-part-1.html new file mode 100644 index 0000000..0ece4cc --- /dev/null +++ b/docs/data-wrangling-part-1.html @@ -0,0 +1,655 @@ + + + + + + + Chapter 3 Data Wrangling, Part 1 | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 3 Data Wrangling, Part 1

    +

    From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.

    +
    +Data science workflow. Image source: R for Data Science. +
    Data science workflow. Image source: R for Data Science.
    +
    +

    For the rest of the course, we focus on Transform and Visualize with the assumption that our data is in a nice, “Tidy format”. First, we need to understand what it means for a data to be “Tidy”.

    +
    +

    3.1 Tidy Data

    +

    Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of tidy data, developed by Hadley Wickham:

    +
      +
    1. Each variable must have its own column.

    2. +
    3. Each observation must have its own row.

    4. +
    5. Each value must have its own cell.

    6. +
    +

    If you want to be technical about what variables and observations are, Hadley Wickham describes:

    +
    +

    A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

    +
    +
    +A tidy dataframe. Image source: R for Data Science. +
    A tidy dataframe. Image source: R for Data Science.
    +
    +
    +
    +

    3.2 Our working Tidy Data: DepMap Project

    +

    The Dependency Map project is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session.

    +
      +
    • Metadata

    • +
    • Somatic mutations

    • +
    • Gene expression

    • +
    • Drug sensitivity

    • +
    • CRISPR knockout

    • +
    • and more…

    • +
    +

    Let’s load these datasets in, and see how these datasets fit the definition of Tidy data:

    +
    import pandas as pd
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    metadata.head()
    +
    ##       ModelID  PatientID  ...     OncotreePrimaryDisease       OncotreeLineage
    +## 0  ACH-000001  PT-gj46wT  ...   Ovarian Epithelial Tumor  Ovary/Fallopian Tube
    +## 1  ACH-000002  PT-5qa3uk  ...     Acute Myeloid Leukemia               Myeloid
    +## 2  ACH-000003  PT-puKIyc  ...  Colorectal Adenocarcinoma                 Bowel
    +## 3  ACH-000004  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 4  ACH-000005  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 
    +## [5 rows x 30 columns]
    +
    mutation.head()
    +
    ##       ModelID  CACNA1D_Mut  CYP2D6_Mut  ...  CCDC28A_Mut  C1orf194_Mut  U2AF1_Mut
    +## 0  ACH-000001        False       False  ...        False         False      False
    +## 1  ACH-000002        False       False  ...        False         False      False
    +## 2  ACH-000004        False       False  ...        False         False      False
    +## 3  ACH-000005        False       False  ...        False         False      False
    +## 4  ACH-000006        False       False  ...        False         False      False
    +## 
    +## [5 rows x 540 columns]
    +
    expression.head()
    +
    ##       ModelID  ENPP4_Exp  CREBBP_Exp  ...  OR5D13_Exp  C2orf81_Exp  OR8S1_Exp
    +## 0  ACH-001113   2.280956    4.094236  ...         0.0     1.726831        0.0
    +## 1  ACH-001289   3.622930    3.606442  ...         0.0     0.790772        0.0
    +## 2  ACH-001339   0.790772    2.970854  ...         0.0     0.575312        0.0
    +## 3  ACH-001538   3.485427    2.801159  ...         0.0     1.077243        0.0
    +## 4  ACH-000242   0.879706    3.327687  ...         0.0     0.722466        0.0
    +## 
    +## [5 rows x 536 columns]
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    DataframeThe observation isSome variables areSome values are
    metadataCell lineModelID, Age, OncotreeLineage“ACH-000001”, 60, “Myeloid”
    expressionCell lineKRAS_Exp2.4, .3
    mutationCell lineKRAS_MutTRUE, FALSE
    +
    +
    +

    3.3 Transform: “What do you want to do with this Dataframe”?

    +

    Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something.

    +

    Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows.

    +

    Here’s a starting prompt:

    +
    +

    In the metadata dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question?

    +
    +

    We have been using explicit subsetting with numerical indicies, such as “I want to filter for rows 20-50 and select columns 2 and 8”. We are now going to switch to implicit subsetting in which we describe the subsetting criteria via comparision operators and column names, such as:

    +

    “I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex.”

    +

    Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names.

    +
    +

    3.3.0.1 Let’s convert our implicit subsetting criteria into code!

    +

    To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer:

    +
    metadata['OncotreeLineage'] == "Lung"
    +
    ## 0       False
    +## 1       False
    +## 2       False
    +## 3       False
    +## 4       False
    +##         ...  
    +## 1859    False
    +## 1860    False
    +## 1861    False
    +## 1862    False
    +## 1863     True
    +## Name: OncotreeLineage, Length: 1864, dtype: bool
    +

    Then, we will use the .loc operation (which is different than .iloc operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time:

    +
    metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]]
    +
    ##        Age     Sex
    +## 10    39.0  Female
    +## 13    44.0    Male
    +## 19    55.0  Female
    +## 27    39.0  Female
    +## 28    45.0    Male
    +## ...    ...     ...
    +## 1745  52.0    Male
    +## 1819  84.0    Male
    +## 1820  57.0  Female
    +## 1822  53.0    Male
    +## 1863  62.0    Male
    +## 
    +## [241 rows x 2 columns]
    +

    What’s going on here? The first component of the subset, metadata['OncotreeLineage'] == "Lung", subsets for the rows. It gives us a column of True and False values, and we keep rows that correspond to True values. Then, we specify the column names we want to subset for via a list.

    +

    Here’s another example:

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +                            
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    “I want to subset for rows such that the status is”treated” and subset for columns status and age_case.”

    +
    df.loc[df.status == "treated", ["status", "age_case"]]
    +
    ##     status  age_case
    +## 0  treated        25
    +## 4  treated         7
    +

    +
    +
    +
    +

    3.4 Summary Statistics

    +

    Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode.

    +

    If we look at the data structure of a Dataframe’s column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples:

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function methodWhat it takes inWhat it doesReturns
    metadata.Age.mean()metadata.Age as a numeric SeriesComputes the mean value of the Age column.Float (NumPy)
    metadata['Age'].median()metadata['Age'] as a numeric SeriesComputes the median value of the Age column.Float (NumPy)
    metadata.Age.max()metadata.Age as a numeric SeriesComputes the max value of the Age column.Float (NumPy)
    metadata.OncotreeSubtype.value_counts()metadata.OncotreeSubtype as a string SeriesCreates a frequency table of all unique elements in OncotreeSubtype column.Series
    +

    Let’s try it out, with some nice print formatting:

    +
    print("Mean value of Age column:", metadata['Age'].mean())
    +
    ## Mean value of Age column: 47.45187165775401
    +
    print("Frequency of column", metadata.OncotreeLineage.value_counts())
    +
    ## Frequency of column OncotreeLineage
    +## Lung                         241
    +## Lymphoid                     209
    +## CNS/Brain                    123
    +## Skin                         118
    +## Esophagus/Stomach             95
    +## Breast                        92
    +## Bowel                         87
    +## Head and Neck                 81
    +## Myeloid                       77
    +## Bone                          75
    +## Ovary/Fallopian Tube          74
    +## Pancreas                      65
    +## Kidney                        64
    +## Peripheral Nervous System     55
    +## Soft Tissue                   54
    +## Uterus                        41
    +## Fibroblast                    41
    +## Biliary Tract                 40
    +## Bladder/Urinary Tract         39
    +## Normal                        39
    +## Pleura                        35
    +## Liver                         28
    +## Cervix                        25
    +## Eye                           19
    +## Thyroid                       18
    +## Prostate                      14
    +## Vulva/Vagina                   5
    +## Ampulla of Vater               4
    +## Testis                         4
    +## Adrenal Gland                  1
    +## Other                          1
    +## Name: count, dtype: int64
    +

    Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we’re not focused on that in this course.)

    +
    +
    +

    3.5 Simple data visualization

    +

    We will dedicate extensive time later this course to talk about data visualization, but the Dataframe’s column, Series, has a method called .plot() that can help us make simple plots for one variable. The .plot() method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument kind a String value to specify the plot style. We use it for making a histogram or bar plot.

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + +
    Plot styleUseful forkind =Code
    HistogramNumerics“hist”metadata.Age.plot(kind = "hist")
    Bar plotStrings“bar”metadata.OncotreeSubtype.value_counts().plot(kind = "bar")
    +

    Let’s look at a histogram:

    +
    import matplotlib.pyplot as plt
    +
    +plt.figure()
    +metadata.Age.plot(kind = "hist")
    +plt.show()
    +

    +

    Let’s look at a bar plot:

    +
    plt.figure()
    +metadata.OncotreeLineage.value_counts().plot(kind = "bar")
    +plt.show()
    +

    +

    (The plt.figure() and plt.show() functions are used to render the plots on the website, but you don’t need to use it for your exercises. We will discuss this in more detail during our week of data visualization.)

    +
    +

    3.5.0.1 Chained function calls

    +

    Let’s look at our bar plot syntax more carefully. We start with the column metadata.OncotreeLineage, and then we first use the method .value_counts() to get a Series of a frequency table. Then, we take the frequency table Series and use the .plot() method.

    +

    It is quite common in Python to have multiple “chained” function calls, in which the output of .value_counts() is used for the input of .plot() all in one line of code. It takes a bit of time to get used to this!

    +

    Here’s another example of a chained function call, which looks quite complex, but let’s break it down:

    +
    plt.figure()
    +
    +metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar")
    +
    +plt.show()
    +

    +
      +
    1. We first take the entire metadata and do some subsetting, which outputs a Dataframe.
    2. +
    3. We access the OncotreeLineage column, which outputs a Series.
    4. +
    5. We use the method .value_counts(), which outputs a Series.
    6. +
    7. We make a plot out of it!
    8. +
    +

    We could have, alternatively, done this in several lines of code:

    +
    plt.figure()
    +
    +metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ]
    +metadata_subset_lineage = metadata_subset.OncotreeLineage
    +lineage_freq = metadata_subset_lineage.value_counts()
    +lineage_freq.plot(kind = "bar")
    +
    +plt.show()
    +

    +

    These are two different styles of code, but they do the exact same thing. It’s up to you to decide what is easier for you to understand.

    +
    +
    +
    +

    3.6 Exercises

    +

    Exercise for week 3 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/data-wrangling-part-2.html b/docs/data-wrangling-part-2.html new file mode 100644 index 0000000..3e87bd8 --- /dev/null +++ b/docs/data-wrangling-part-2.html @@ -0,0 +1,660 @@ + + + + + + + Chapter 4 Data Wrangling, Part 2 | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 4 Data Wrangling, Part 2

    +

    We will continue to learn about data analysis with Dataframes. Let’s load our three Dataframes from the Depmap project in again:

    +
    import pandas as pd
    +import numpy as np
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    +

    4.1 Creating new columns

    +

    Often, we want to perform some kind of transformation on our data’s columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale.

    +

    To create a new column, you simply modify it as if it exists using the bracket operation [ ], and the column will be created:

    +
    metadata['AgePlusTen'] = metadata['Age'] + 10
    +expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp']
    +expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp'])
    +

    where np.log(x) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value.

    +

    Note: you cannot create a new column referring to the attribute of the Dataframe, such as: expression.KRAS_Exp_log = np.log(expression.KRAS_Exp).

    +
    +
    +

    4.2 Merging two Dataframes together

    +

    Suppose we have the following Dataframes:

    +

    expression

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDPIK3CA_Explog_PIK3CA_Exp
    “ACH-001113”5.1387331.636806
    “ACH-001289”3.1842801.158226
    “ACH-001339”3.1651081.152187
    +

    metadata

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDOncotreeLineageAge
    “ACH-001113”“Lung”69
    “ACH-001289”“CNS/Brain”NaN
    “ACH-001339”“Skin”14
    +

    Suppose that I want to compare the relationship between OncotreeLineage and PIK3CA_Exp, but they are columns in different Dataframes. We want a new Dataframe that looks like this:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDPIK3CA_Explog_PIK3CA_ExpOncotreeLineageAge
    “ACH-001113”5.1387331.636806“Lung”69
    “ACH-001289”3.1842801.158226“CNS/Brain”NaN
    “ACH-001339”3.1651081.152187“Skin”14
    +

    We see that in both dataframes,

    +
      +
    • the rows (observations) represent cell lines.

    • +
    • there is a common column ModelID, with shared values between the two dataframes that can faciltate the merging process. We call this an index.

    • +
    +

    We will use the method .merge() for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index.

    +
    merged = metadata.merge(expression)
    +

    It’s usually better to specify what that index column to avoid ambiguity, using the on optional argument:

    +
    merged = metadata.merge(expression, on='ModelID')
    +

    If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe:

    +
    merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID')
    +

    One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not:

    +

    The number of rows and columns of metadata:

    +
    metadata.shape
    +
    ## (1864, 31)
    +

    The number of rows and columns of expression:

    +
    expression.shape
    +
    ## (1450, 538)
    +

    The number of rows and columns of merged:

    +
    merged.shape
    +
    ## (1450, 568)
    +

    We see that the number of columns in merged combines the number of columns in metadata and expression, while the number of rows in merged is the smaller of the number of rows in metadata and expression: it only keeps rows that are found in both Dataframe’s index columns. This kind of join is called “inner join”, because in the Venn Diagram of elements common in both index column, we keep the inner overlap:

    +

    +

    You can specifiy the join style by changing the optional input argument how.

    +
      +
    • how = "outer" keeps all observations - also known as a “full join”

    • +
    • how = "left" keeps all observations in the left Dataframe.

    • +
    • how = "right" keeps all observations in the right Dataframe.

    • +
    • how = "inner" keeps observations common to both Dataframe. This is the default value of how.

    • +
    +
    +
    +

    4.3 Grouping and summarizing Dataframes

    +

    In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in metadata, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, OncotreeLineage, and look at the mean age for each cancer type.

    +

    We want to take metadata:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDOncotreeLineageAge
    “ACH-001113”“Lung”69
    “ACH-001289”“Lung”23
    “ACH-001339”“Skin”14
    “ACH-002342”“Brain”23
    “ACH-004854”“Brain”56
    “ACH-002921”“Brain”67
    +

    into:

    + + + + + + + + + + + + + + + + + + + + + +
    OncotreeLineageMeanAge
    “Lung”46
    “Skin”14
    “Brain”48.67
    +

    To get there, we need to:

    +
      +
    • Group the data based on some criteria, elements of OncotreeLineage

    • +
    • Summarize each group via a summary statistic performed on a column, such as Age.

    • +
    +

    We first subset the the two columns we need, and then use the methods .group_by(x) and .mean().

    +
    metadata_grouped = metadata.groupby("OncotreeLineage")
    +metadata_grouped['Age'].mean()
    +
    ## OncotreeLineage
    +## Adrenal Gland                55.000000
    +## Ampulla of Vater             65.500000
    +## Biliary Tract                58.450000
    +## Bladder/Urinary Tract        65.166667
    +## Bone                         20.854545
    +## Bowel                        58.611111
    +## Breast                       50.961039
    +## CNS/Brain                    43.849057
    +## Cervix                       47.136364
    +## Esophagus/Stomach            57.855556
    +## Eye                          51.100000
    +## Fibroblast                   38.194444
    +## Head and Neck                60.149254
    +## Kidney                       46.193548
    +## Liver                        43.928571
    +## Lung                         55.444444
    +## Lymphoid                     38.916667
    +## Myeloid                      38.810811
    +## Normal                       52.370370
    +## Other                        46.000000
    +## Ovary/Fallopian Tube         51.980769
    +## Pancreas                     60.226415
    +## Peripheral Nervous System     5.480000
    +## Pleura                       61.000000
    +## Prostate                     61.666667
    +## Skin                         49.033708
    +## Soft Tissue                  27.500000
    +## Testis                       25.000000
    +## Thyroid                      63.235294
    +## Uterus                       62.060606
    +## Vulva/Vagina                 75.400000
    +## Name: Age, dtype: float64
    +

    Here’s what’s going on:

    +
      +
    • We use the Dataframe method .group_by(x) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the metadata Dataframe, but it makes a note that it’s been grouped.

    • +
    • We subset to the column Age. The grouping information still persists (This is a Grouped Series object).

    • +
    • We use the method .mean() to calculate the mean value of Age within each group defined by OncotreeLineage.

    • +
    +

    Alternatively, this could have been done in a chain of methods:

    +
    metadata.groupby("OncotreeLineage")["Age"].mean()
    +
    ## OncotreeLineage
    +## Adrenal Gland                55.000000
    +## Ampulla of Vater             65.500000
    +## Biliary Tract                58.450000
    +## Bladder/Urinary Tract        65.166667
    +## Bone                         20.854545
    +## Bowel                        58.611111
    +## Breast                       50.961039
    +## CNS/Brain                    43.849057
    +## Cervix                       47.136364
    +## Esophagus/Stomach            57.855556
    +## Eye                          51.100000
    +## Fibroblast                   38.194444
    +## Head and Neck                60.149254
    +## Kidney                       46.193548
    +## Liver                        43.928571
    +## Lung                         55.444444
    +## Lymphoid                     38.916667
    +## Myeloid                      38.810811
    +## Normal                       52.370370
    +## Other                        46.000000
    +## Ovary/Fallopian Tube         51.980769
    +## Pancreas                     60.226415
    +## Peripheral Nervous System     5.480000
    +## Pleura                       61.000000
    +## Prostate                     61.666667
    +## Skin                         49.033708
    +## Soft Tissue                  27.500000
    +## Testis                       25.000000
    +## Thyroid                      63.235294
    +## Uterus                       62.060606
    +## Vulva/Vagina                 75.400000
    +## Name: Age, dtype: float64
    +

    Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as .mean(), .median(), .max(), can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is .count() which tells you how many entries are counted within each group.

    +
    +

    4.3.1 Optional: Multiple grouping, Multiple columns, Multiple summary statistics

    +

    Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously.

    +

    For example, you may want to group by a combination of OncotreeLineage and AgeCategory, such as “Lung” and “Adult” as one grouping. You can do so like this:

    +
    metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"])
    +metadata_grouped['Age'].mean()
    +
    ## OncotreeLineage   AgeCategory
    +## Adrenal Gland     Adult          55.000000
    +## Ampulla of Vater  Adult          65.500000
    +##                   Unknown              NaN
    +## Biliary Tract     Adult          58.450000
    +##                   Unknown              NaN
    +##                                    ...    
    +## Thyroid           Unknown              NaN
    +## Uterus            Adult          62.060606
    +##                   Fetus                NaN
    +##                   Unknown              NaN
    +## Vulva/Vagina      Adult          75.400000
    +## Name: Age, Length: 72, dtype: float64
    +

    You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the .agg(x) method on a Grouped Dataframe.

    +

    For example, coming back to our age case-control Dataframe,

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +                            
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    We group by status and summarize age_case and age_control with a few summary statistics each:

    +
    df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]})
    +
    ##            age_case age_control          
    +##                mean         min max  mean
    +## status                                   
    +## discharged     65.0          25  25  25.0
    +## treated        16.0          32  49  40.5
    +## untreated      32.0          20  32  26.0
    +

    The input argument to the .agg(x) method is called a Dictionary, which let’s you structure information in a paired relationship. You can learn more about dictionaries here.

    +
    +
    +
    +

    4.4 Exercises

    +

    Exercise for week 4 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/data-wrangling.html b/docs/data-wrangling.html new file mode 100644 index 0000000..b3024ac --- /dev/null +++ b/docs/data-wrangling.html @@ -0,0 +1,303 @@ + + + + + + + Chapter 3 Data Wrangling | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/images/colab.png b/docs/images/colab.png new file mode 100644 index 0000000..ccd9004 Binary files /dev/null and b/docs/images/colab.png differ diff --git a/docs/images/join.png b/docs/images/join.png new file mode 100644 index 0000000..d408d6b Binary files /dev/null and b/docs/images/join.png differ diff --git a/docs/images/pandas subset_1.png b/docs/images/pandas subset_1.png new file mode 100644 index 0000000..45376f2 Binary files /dev/null and b/docs/images/pandas subset_1.png differ diff --git a/docs/images/pandas_subset_0.png b/docs/images/pandas_subset_0.png new file mode 100644 index 0000000..2a37d28 Binary files /dev/null and b/docs/images/pandas_subset_0.png differ diff --git a/docs/images/pandas_subset_2.png b/docs/images/pandas_subset_2.png new file mode 100644 index 0000000..eecc68e Binary files /dev/null and b/docs/images/pandas_subset_2.png differ diff --git a/docs/index.html b/docs/index.html index 9e6b078..8204dc7 100644 --- a/docs/index.html +++ b/docs/index.html @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -202,7 +246,7 @@

    About this Course

    diff --git a/docs/index.md b/docs/index.md index bdb6e62..c4161aa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ --- title: "Introduction to Python" -date: "August, 2024" +date: "September, 2024" site: bookdown::bookdown_site documentclass: book bibliography: [book.bib] diff --git a/docs/intro-to-computing.html b/docs/intro-to-computing.html index dcc75b6..6d0ee0a 100644 --- a/docs/intro-to-computing.html +++ b/docs/intro-to-computing.html @@ -29,7 +29,7 @@ - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -236,16 +280,17 @@

    1.3 A programming language has fo

    1.4 Google Colab Setup

    Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user.

    -

    Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, open up…

    +

    Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, you can view it here.

    +

    Today, we will pay close attention to:

      -
    • Python Console (Execution): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

    • +
    • Python Console (“Executions”): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

    • Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing.

    • Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code.

    The first thing we will do is see the different ways we can run Python code. You can do the following:

      -
    1. Type something into the Python Console (Execution) and type enter, such as 2+2. The Python Console will run it and give you an output.
    2. +
    3. Type something into the Python Console (Execution) and click the arrow button, such as 2+2. The Python Console will run it and give you an output.
    4. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data.
    5. Run every single Python code chunk via Runtime -> Run all.
    @@ -257,13 +302,14 @@

    1.4 Google Colab Setupnotable differences.

    Now, we will get to the basics of programming grammar.

    1.5 Grammar Structure 1: Evaluation of Expressions

    • Expressions are be built out of operations or functions.

    • -
    • Functions and operations take in data types, do something with them, and return another data type.

    • +
    • Functions and operations take in data types as inputs, do something with them, and return another data type as ouput.

    • We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it.

    For instance, consider the following expressions entered to the Python Console:

    @@ -285,9 +331,22 @@

    1.5 Grammar Structure 1: Evaluati
    ## 39
    add(18, add(21, 65))
    ## 104
    -

    Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the add() function isn’t typically used, it is not automatically available, so we used the import statement to load it in.)

    -
    -

    1.5.1 Data types

    +

    Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called modules that needs to be loaded. The import statement gives us permission to access the functions in the module “operator”.)

    +
    +

    1.5.1 Function machine schema

    +

    A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class:

    +
    +Function machine from algebra class. +
    Function machine from algebra class.
    +
    +

    Here are some aspects of this schema to pay attention to:

    +
      +
    • A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.

    • +
    • A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs.

    • +
    +
    +
    +

    1.5.2 Data types

    Here are some common data types we will be using in this course.

    @@ -320,16 +379,6 @@

    1.5.1 Data types -Function machine from algebra class. -
    Function machine from algebra class.
    - -

    Here are some aspects of this schema to pay attention to:

    -
      -
    • A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.

    • -
    • A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs.

    • -
    @@ -342,18 +391,18 @@

    1.6.1 Execution rule for variable

    Evaluate the expression to the right of =.

    Bind variable to the left of = to the resulting value.

    -

    The variable is stored in the Variable Environment.

    +

    The variable is stored in the Variable Environment.

    The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined.

    -

    The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later.

    +

    The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM.

    Look, now x can be reused downstream:

    x - 2
    ## 37
    y = x * 2
    -

    It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python:

    +

    It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python:

    type(y)
    ## <class 'int'>
    -

    We should give useful variable names so that we know what to expect! Consider num_sales instead of y.

    +

    We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider num_sales instead of y.

    @@ -362,13 +411,15 @@

    1.7 Grammar Structure 3: Evaluati

    1.7.1 Execution rule for functions:

    -

    Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first.

    +

    Evaluate the function by its arguments if there’s any, and if the arguments are functions or contains operations, evaluate those functions or operations first.

    The output of functions is called the returned value.

    -

    Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by–step in the line of code below:

    -
    (len("hello") + 4) * 2
    -
    ## 18
    -

    If we don’t know how to use a function, such as pow() we can ask for help:

    +

    Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by–step in the lines of code below:

    +
    max(len("hello"), 4)
    +
    ## 5
    +
    (len("pumpkin") - 8) * 2
    +
    ## -2
    +

    If we don’t know how to use a function, such as pow(), we can ask for help:

    ?pow
     
     pow(base, exp, mod=None)
    @@ -376,33 +427,76 @@ 

    1.7.1 Execution rule for function Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form.

    -

    This shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to.

    +

    We can also find a similar help document, in a nicer rendered form online. We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own.

    +

    The documentation shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to.

    The following ways are equivalent ways of using the pow() function:

    -
    pow(2, 3)
    +
    pow(2, 3)
    ## 8
    -
    pow(base=2, exp=3)
    +
    pow(base=2, exp=3)
    ## 8
    -
    pow(exp=3, base=2)
    +
    pow(exp=3, base=2)
    ## 8

    but this will give you something different:

    -
    pow(3, 2)
    +
    pow(3, 2)
    ## 9

    And there is an operational equivalent:

    -
    2 ** 3
    +
    2 ** 3
    ## 8
    +

    We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let’s look at some examples of functions that don’t always have an input or output:

    +

    ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function callWhat it takes inWhat it doesReturns
    pow(a, b)integer a, integer bRaises a to the bth power.Integer
    time.sleep(x)Integer xWaits for x seconds.None
    dir()NothingGives a list of all the variables defined in the environment.List

    1.8 Tips on writing your first code

    Computer = powerful + stupid

    -

    Even the smallest spelling and formatting changes will cause unexpected output and errors!

    +

    Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners:

      -
    • Write incrementally, test often

    • +
    • Write incrementally, test often.

    • +
    • Don’t be afraid to break things: it is how we learn how things work in programming.

    • Check your assumptions, especially using new functions, operations, and new data types.

    • Live environments are great for testing, but not great for reproducibility.

    • Ask for help!

    To get more familiar with the errors Python gives you, take a look at this summary of Python error messages.

    +
    +
    +

    1.9 Exercises

    +

    Exercise for week 1 can be found here.

    @@ -421,7 +515,7 @@

    1.8 Tips on writing your first co

    - + diff --git a/docs/no_toc/01-intro-to-computing.md b/docs/no_toc/01-intro-to-computing.md index ebcf137..cc8d4ee 100644 --- a/docs/no_toc/01-intro-to-computing.md +++ b/docs/no_toc/01-intro-to-computing.md @@ -38,11 +38,13 @@ More importantly: **How we organize ideas \<-\> Instructing a computer to do som Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. -Let's open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named "KRAS Demo" in your Google Classroom workspace. If you are taking this course on your own time, open up... +Let's open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named "KRAS Demo" in your Google Classroom workspace. If you are taking this course on your own time, you can view it [here](https://colab.research.google.com/drive/1_77QQcj0mgZOWLlhtkZ-QKWUP1dnSt-_?usp=sharing). + +![](images/colab.png){width="800"} Today, we will pay close attention to: -- Python Console (Execution): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. +- Python Console ("Executions"): Open it via View -\> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. - Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text *and* Python code, and it helps us understand better the code we are writing. @@ -50,7 +52,7 @@ Today, we will pay close attention to: The first thing we will do is see the different ways we can run Python code. You can do the following: -1. Type something into the Python Console (Execution) and type enter, such as `2+2`. The Python Console will run it and give you an output. +1. Type something into the Python Console (Execution) and click the arrow button, such as `2+2`. The Python Console will run it and give you an output. 2. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. 3. Run every single Python code chunk via Runtime -\> Run all. @@ -66,13 +68,15 @@ Python Notebook is great for data science work, because: - It is flexible to use other programming languages, such as R. +The version of Python used in this course and in Google Colab is Python 3, which is the version of Python that is most supported. Some Python software is written in Python 2, which is very similar but has some [notable differences](https://www.fullstackpython.com/python-2-or-3.html). + Now, we will get to the basics of programming grammar. ## Grammar Structure 1: Evaluation of Expressions - **Expressions** are be built out of **operations** or **functions**. -- Functions and operations take in **data types**, do something with them, and return another data type. +- Functions and operations take in **data types** as inputs, do something with them, and **return** another data type as ouput. - We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. @@ -142,7 +146,19 @@ add(18, add(21, 65)) ## 104 ``` -Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the `add()` function isn't typically used, it is not automatically available, so we used the import statement to load it in.) +Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to *readable* code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called **modules** that needs to be loaded. The `import` statement gives us permission to access the functions in the module "operator".) + +### Function machine schema + +A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: + +![Function machine from algebra class.](images/function_machine.png) + +Here are some aspects of this schema to pay attention to: + +- A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. + +- A function can have different kinds of inputs and outputs - it doesn't need to be numbers. In the `len()` function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. ### Data types @@ -155,16 +171,6 @@ Here are some common data types we will be using in this course. | String | str | "hello", "234-234-8594" | | Boolean | bool | True, False | -A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: - -![Function machine from algebra class.](images/function_machine.png) - -Here are some aspects of this schema to pay attention to: - -- A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. - -- A function can have different kinds of inputs and outputs - it doesn't need to be numbers. In the `len()` function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. - ## Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: @@ -182,11 +188,11 @@ If you enter this in the Console, you will see that in the Variable Environment, > > Bind variable to the left of `=` to the resulting value. > -> The variable is stored in the Variable Environment. +> The variable is stored in the **Variable Environment**. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. -The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. +The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. Look, now `x` can be reused downstream: @@ -203,7 +209,7 @@ x - 2 y = x * 2 ``` -It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the `type()` function on any variable in Python: +It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the `type()` function on any variable in Python: ``` python @@ -214,7 +220,7 @@ type(y) ## ``` -We should give useful variable names so that we know what to expect! Consider `num_sales` instead of `y`. +We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider `num_sales` instead of `y`. ## Grammar Structure 3: Evaluation of Functions @@ -222,22 +228,30 @@ Let's look at functions a little bit more formally: A function has a **function ### Execution rule for functions: -> Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. +> Evaluate the function by its arguments if there's any, and if the arguments are functions or contains operations, evaluate those functions or operations first. > > The output of functions is called the **returned value**. -Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by--step in the line of code below: +Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by--step in the lines of code below: + + +``` python +max(len("hello"), 4) +``` +``` +## 5 +``` ``` python -(len("hello") + 4) * 2 +(len("pumpkin") - 8) * 2 ``` ``` -## 18 +## -2 ``` -If we don't know how to use a function, such as `pow()` we can ask for help: +If we don't know how to use a function, such as `pow()`, we can ask for help: ``` ?pow @@ -249,7 +263,9 @@ Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. ``` -This shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. +We can also find a similar help document, in a [nicer rendered form online.](https://docs.python.org/3/library/functions.html#pow) We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own. + +The documentation shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. The following ways are equivalent ways of using the `pow()` function: @@ -300,13 +316,23 @@ And there is an operational equivalent: ## 8 ``` +We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let's look at some examples of functions that don't always have an input or output: + +| Function call | What it takes in | What it does | Returns | +|---------------------------------------------------------------------------|--------------------------|---------------------------------------------------------------|---------| +| [`pow(a, b)`](https://docs.python.org/3/library/functions.html#pow) | integer `a`, integer `b` | Raises `a` to the `b`th power. | Integer | +| [`time.sleep(x)`](https://docs.python.org/3/library/time.html#time.sleep) | Integer `x` | Waits for `x` seconds. | None | +| [`dir()`](https://docs.python.org/3/library/functions.html#dir) | Nothing | Gives a list of all the variables defined in the environment. | List | + ## Tips on writing your first code `Computer = powerful + stupid` -Even the smallest spelling and formatting changes will cause unexpected output and errors! +Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners: + +- Write incrementally, test often. -- Write incrementally, test often +- Don't be afraid to break things: it is how we learn how things work in programming. - Check your assumptions, especially using new functions, operations, and new data types. @@ -315,3 +341,7 @@ Even the smallest spelling and formatting changes will cause unexpected output a - Ask for help! To get more familiar with the errors Python gives you, take a look at this [summary of Python error messages](https://betterstack.com/community/guides/scaling-python/python-errors/). + +## Exercises + +Exercise for week 1 can be found [here](https://colab.research.google.com/drive/1AqVvktGz3LStUyu6dLJFsU2KoqNxgagT?usp=sharing). diff --git a/docs/no_toc/02-data-structures.md b/docs/no_toc/02-data-structures.md new file mode 100644 index 0000000..bd24e0d --- /dev/null +++ b/docs/no_toc/02-data-structures.md @@ -0,0 +1,408 @@ + + +# Working with data structures + +In our second lesson, we start to look at two **data structures**, **Lists** and **Dataframes**, that can handle a large amount of data for analysis. + +## Lists + +In the first exercise, you started to explore **data structures**, which store information about data types. You explored **lists**, which is an ordered collection of data types or data structures. Each *element* of a list contains a data type or another data structure. + +We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive. + +We create a list via the bracket `[ ]` operation. + + +``` python +staff = ["chris", "ted", "jeff"] +chrNum = [2, 3, 1, 2, 2] +mixedList = [False, False, False, "A", "B", 92] +``` + +### Subsetting lists + +To access an element of a list, you can use the bracket notation `[ ]` to access the elements of the list. We simply access an element via the "index" number - the location of the data within the list. + +*Here's the tricky thing about the index number: it starts at 0!* + +1st element of `chrNum`: `chrNum[0]` + +2nd element of `chrNum`: `chrNum[1]` + +... + +5th element of `chrNum`: `chrNum[4]` + +With subsetting, you can modify elements of a list or use the element of a list as part of an expression. + +### Subsetting multiple elements of lists + +Suppose you want to access multiple elements of a list, such as accessing the first three elements of `chrNum`. You would use the **slice** operator `:`, which specifies: + +- the index number to start + +- the index number to stop, *plus one.* + +If you want to access the first three elements of `chrNum`: + + +``` python +chrNum[0:3] +``` + +``` +## [2, 3, 1] +``` + +The first element's index number is 0, the third element's index number is 2, plus 1, which is 3. + +If you want to access the second and third elements of `chrNum`: + + +``` python +chrNum[1:3] +``` + +``` +## [3, 1] +``` + +Another way of accessing the first 3 elements of `chrNum`: + + +``` python +chrNum[:3] +``` + +``` +## [2, 3, 1] +``` + +Here, the start index number was not specified. When the start or stop index is *not* specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here's another example, using negative indicies to count from 3 elements from the end of the list: + + +``` python +chrNum[-3:] +``` + +``` +## [1, 2, 2] +``` + +You can find more discussion of list slicing, using negative indicies and incremental slicing, [here](https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf). + +## Objects in Python + +The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined: + +- What does it contain (in terms of data)? + +- What can it do (in terms of functions)? + +And if it "makes sense" to us, then it is well-designed. + +The list data structure we have been working with is an example of an **Object**. The definition of an object allows us to ask the questions above: *what does it contain, and what can it do?* It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: + +- **Value** that holds the essential data for the object. + +- **Attributes** that hold subset or additional data for the object. + +- Functions called **Methods** that are for the object and *have to* take in the variable referenced as an input + +This organizing structure on an object applies to pretty much all Python data types and data structures. + +Let's see how this applies to the list: + +- **Value**: the contents of the list, such as `[2, 3, 4].` + +- **Attributes** that store additional values: Not relevant for lists. + +- **Methods** that can be used on the object: `chrNum.count(2)` counts the number of instances 2 appears as an element of `chrNum`. + +Object methods are functions that does something with the object you are using it on. You should think about `chrNum.count(2)` as a function that takes in `chrNum` and `2` as inputs. If you want to use the count function on list `mixedList`, you would use `mixedList.count(x)`. + +Here are some more examples of methods with lists: + +| Function method | What it takes in | What it does | Returns | +|---------------|---------------|---------------------------|---------------| +| [`chrNum.count(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer | +| [`chrNum.append(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) | +| [`chrNum.sort()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) | +| [`chrNum.reverse()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Reverses the order of `chrNum`. | None (but `chrNum` is modified!) | + +## Methods vs Functions + +**Methods** *have to* take in the object of interest as an input: `chrNum.count(2)` automatically treat `chrNum` as an input. Methods are built for a specific Object type. + +**Functions** do not have an implied input: `len(chrNum)` requires specifying a list in the input. + +Otherwise, there is no strong distinction between the two. + +## Dataframes + +A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does. + +The Dataframe data structure is found within a Python module called "Pandas". A Python module is an organized collection of functions and data structures. The `import` statement below gives us permission to access the "Pandas" module via the variable `pd`. + +To load in a Dataframe from existing spreadsheet data, we use the function [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html): + + +``` python +import pandas as pd + +metadata = pd.read_csv("classroom_data/metadata.csv") +type(metadata) +``` + +``` +## +``` + +There is a similar function [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for loading in Excel spreadsheets. + +Let's investigate the Dataframe as an object: + +- What does a Dataframe contain (values, attributes)? + +- What can a Dataframe do (methods)? + +## What does a Dataframe contain? + +We first take a look at the contents: + + +``` python +metadata +``` + +``` +## ModelID ... OncotreeLineage +## 0 ACH-000001 ... Ovary/Fallopian Tube +## 1 ACH-000002 ... Myeloid +## 2 ACH-000003 ... Bowel +## 3 ACH-000004 ... Myeloid +## 4 ACH-000005 ... Myeloid +## ... ... ... ... +## 1859 ACH-002968 ... Esophagus/Stomach +## 1860 ACH-002972 ... Esophagus/Stomach +## 1861 ACH-002979 ... Esophagus/Stomach +## 1862 ACH-002981 ... Esophagus/Stomach +## 1863 ACH-003071 ... Lung +## +## [1864 rows x 30 columns] +``` + +It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data. + + +``` python +metadata +``` + +``` +## ModelID ... OncotreeLineage +## 0 ACH-000001 ... Ovary/Fallopian Tube +## 1 ACH-000002 ... Myeloid +## 2 ACH-000003 ... Bowel +## 3 ACH-000004 ... Myeloid +## 4 ACH-000005 ... Myeloid +## ... ... ... ... +## 1859 ACH-002968 ... Esophagus/Stomach +## 1860 ACH-002972 ... Esophagus/Stomach +## 1861 ACH-002979 ... Esophagus/Stomach +## 1862 ACH-002981 ... Esophagus/Stomach +## 1863 ACH-003071 ... Lung +## +## [1864 rows x 30 columns] +``` + +We can look at specific columns by looking at **attributes** via the dot operation. We can also look at the columns via the bracket operation. + + +``` python +metadata.ModelID +``` + +``` +## 0 ACH-000001 +## 1 ACH-000002 +## 2 ACH-000003 +## 3 ACH-000004 +## 4 ACH-000005 +## ... +## 1859 ACH-002968 +## 1860 ACH-002972 +## 1861 ACH-002979 +## 1862 ACH-002981 +## 1863 ACH-003071 +## Name: ModelID, Length: 1864, dtype: object +``` + +``` python +metadata['ModelID'] +``` + +``` +## 0 ACH-000001 +## 1 ACH-000002 +## 2 ACH-000003 +## 3 ACH-000004 +## 4 ACH-000005 +## ... +## 1859 ACH-002968 +## 1860 ACH-002972 +## 1861 ACH-002979 +## 1862 ACH-002981 +## 1863 ACH-003071 +## Name: ModelID, Length: 1864, dtype: object +``` + +The names of all columns is stored as an attribute, which can be accessed via the dot operation. + + +``` python +metadata.columns +``` + +``` +## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age', +## 'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory', +## 'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis', +## 'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype', +## 'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments', +## 'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus', +## 'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype', +## 'OncotreePrimaryDisease', 'OncotreeLineage'], +## dtype='object') +``` + +The number of rows and columns are also stored as an attribute: + + +``` python +metadata.shape +``` + +``` +## (1864, 30) +``` + +## What can a Dataframe do? + +We can use the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods to look at the first few rows and last few rows of `metadata`, respectively: + + +``` python +metadata.head() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube +## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid +## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel +## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## +## [5 rows x 30 columns] +``` + +``` python +metadata.tail() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 1859 ACH-002968 PT-pjhrsc ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1860 ACH-002972 PT-dkXZB1 ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1861 ACH-002979 PT-lyHTzo ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1862 ACH-002981 PT-Z9akXf ... Esophagogastric Adenocarcinoma Esophagus/Stomach +## 1863 ACH-003071 PT-LAGmLq ... Lung Neuroendocrine Tumor Lung +## +## [5 rows x 30 columns] +``` + +Both of these functions (without input arguments) are considered as **methods**: they are functions that does something with the Dataframe you are using it on. You should think about `metadata.head()` as a function that takes in `metadata` as an input. If we had another Dataframe called `my_data` and you want to use the same function, you will have to say `my_data.head()`. + +## Subsetting Dataframes + +Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. + +You will use the [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) attribute and bracket operations, and you give two slices: one for the row, and one for the column. + +Let's start with a small dataframe to see how it works before returning to `metadata`: + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +Here is how the dataframe looks like with the row and column index numbers: + +![](images/pandas_subset_0.png) + +Subset the first fourth rows, and the first two columns: + +![](images/pandas subset_1.png) + +Now, back to `metadata` dataframe: + +Subset the first 5 rows, and first two columns: + + +``` python +metadata.iloc[:5, :2] +``` + +``` +## ModelID PatientID +## 0 ACH-000001 PT-gj46wT +## 1 ACH-000002 PT-5qa3uk +## 2 ACH-000003 PT-puKIyc +## 3 ACH-000004 PT-q4K2cp +## 4 ACH-000005 PT-q4K2cp +``` + +If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column: + + +``` python +metadata.iloc[5:, [1, 10, 21]] +``` + +``` +## PatientID GrowthPattern WTSIMasterCellID +## 5 PT-ej13Dz Suspension 2167.0 +## 6 PT-NOXwpH Adherent 569.0 +## 7 PT-fp8PeY Adherent 1806.0 +## 8 PT-puKIyc Adherent 2104.0 +## 9 PT-AR7W9o Adherent NaN +## ... ... ... ... +## 1859 PT-pjhrsc Organoid NaN +## 1860 PT-dkXZB1 Organoid NaN +## 1861 PT-lyHTzo Organoid NaN +## 1862 PT-Z9akXf Organoid NaN +## 1863 PT-LAGmLq Suspension NaN +## +## [1859 rows x 3 columns] +``` + +When we subset via numerical indicies, it's called **explicit subsetting**. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed. + +The second way is to subset by the column name and comparison operators, also known as **implicit subsetting**. This is much more robust in data analysis practice. You will learn about it next week! + +## Exercises + +Exercise for week 2 can be found [here](https://colab.research.google.com/drive/1oIL3gKEZR2Lq16k6XY0HXIhjYl34pEjr?usp=sharing). diff --git a/docs/no_toc/03-data-wrangling1.md b/docs/no_toc/03-data-wrangling1.md new file mode 100644 index 0000000..50ee0c1 --- /dev/null +++ b/docs/no_toc/03-data-wrangling1.md @@ -0,0 +1,354 @@ + + +# Data Wrangling, Part 1 + +From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis. + +![Data science workflow. Image source: R for Data Science.](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png){width="550"} + +For the rest of the course, we focus on *Transform* and *Visualize* with the assumption that our data is in a nice, "Tidy format". First, we need to understand what it means for a data to be "Tidy". + +## Tidy Data + +Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of **tidy data**, developed by Hadley Wickham: + +1. Each variable must have its own column. + +2. Each observation must have its own row. + +3. Each value must have its own cell. + +If you want to be technical about what variables and observations are, Hadley Wickham describes: + +> A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes. + +![A tidy dataframe. Image source: R for Data Science.](https://r4ds.hadley.nz/images/tidy-1.png){width="800"} + +## Our working Tidy Data: DepMap Project + +The [Dependency Map project](https://depmap.org/portal/) is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session. + +- Metadata + +- Somatic mutations + +- Gene expression + +- Drug sensitivity + +- CRISPR knockout + +- and more... + +Let's load these datasets in, and see how these datasets fit the definition of Tidy data: + + +``` python +import pandas as pd + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + + +``` python +metadata.head() +``` + +``` +## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage +## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube +## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid +## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel +## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid +## +## [5 rows x 30 columns] +``` + + +``` python +mutation.head() +``` + +``` +## ModelID CACNA1D_Mut CYP2D6_Mut ... CCDC28A_Mut C1orf194_Mut U2AF1_Mut +## 0 ACH-000001 False False ... False False False +## 1 ACH-000002 False False ... False False False +## 2 ACH-000004 False False ... False False False +## 3 ACH-000005 False False ... False False False +## 4 ACH-000006 False False ... False False False +## +## [5 rows x 540 columns] +``` + + +``` python +expression.head() +``` + +``` +## ModelID ENPP4_Exp CREBBP_Exp ... OR5D13_Exp C2orf81_Exp OR8S1_Exp +## 0 ACH-001113 2.280956 4.094236 ... 0.0 1.726831 0.0 +## 1 ACH-001289 3.622930 3.606442 ... 0.0 0.790772 0.0 +## 2 ACH-001339 0.790772 2.970854 ... 0.0 0.575312 0.0 +## 3 ACH-001538 3.485427 2.801159 ... 0.0 1.077243 0.0 +## 4 ACH-000242 0.879706 3.327687 ... 0.0 0.722466 0.0 +## +## [5 rows x 536 columns] +``` + +| Dataframe | The observation is | Some variables are | Some values are | +|-----------------|-----------------|--------------------|------------------| +| metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" | +| expression | Cell line | KRAS_Exp | 2.4, .3 | +| mutation | Cell line | KRAS_Mut | TRUE, FALSE | + +## Transform: "What do you want to do with this Dataframe"? + +Remember that a major theme of the course is about: **How we organize ideas \<-\> Instructing a computer to do something.** + +Until now, we haven't focused too much on how we organize our scientific ideas to interact with what we can do with code. Let's pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows. + +Here's a starting prompt: + +> In the `metadata` dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question? + +We have been using **explicit subsetting** with numerical indicies, such as "I want to filter for rows 20-50 and select columns 2 and 8". We are now going to switch to **implicit subsetting** in which we describe the subsetting criteria via comparision operators and column names, such as: + +*"I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex."* + +Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names. + +#### Let's convert our implicit subsetting criteria into code! + +To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer: + + +``` python +metadata['OncotreeLineage'] == "Lung" +``` + +``` +## 0 False +## 1 False +## 2 False +## 3 False +## 4 False +## ... +## 1859 False +## 1860 False +## 1861 False +## 1862 False +## 1863 True +## Name: OncotreeLineage, Length: 1864, dtype: bool +``` + +Then, we will use the [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) operation (which is different than [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: + + +``` python +metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]] +``` + +``` +## Age Sex +## 10 39.0 Female +## 13 44.0 Male +## 19 55.0 Female +## 27 39.0 Female +## 28 45.0 Male +## ... ... ... +## 1745 52.0 Male +## 1819 84.0 Male +## 1820 57.0 Female +## 1822 53.0 Male +## 1863 62.0 Male +## +## [241 rows x 2 columns] +``` + +What's going on here? The first component of the subset, `metadata['OncotreeLineage'] == "Lung"`, subsets for the rows. It gives us a column of `True` and `False` values, and we keep rows that correspond to `True` values. Then, we specify the column names we want to subset for via a list. + +Here's another example: + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) + +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +*"I want to subset for rows such that the status is"treated" and subset for columns status and age_case."* + + +``` python +df.loc[df.status == "treated", ["status", "age_case"]] +``` + +``` +## status age_case +## 0 treated 25 +## 4 treated 7 +``` + +![](images/pandas_subset_2.png) + +## Summary Statistics + +Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode. + +If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples: + +| Function method | What it takes in | What it does | Returns | +|----------------|----------------|------------------------|----------------| +| [`metadata.Age.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) | +| [`metadata['Age'].median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) | +| [`metadata.Age.max()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) | +| [`metadata.OncotreeSubtype.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | + +Let's try it out, with some nice print formatting: + + +``` python +print("Mean value of Age column:", metadata['Age'].mean()) +``` + +``` +## Mean value of Age column: 47.45187165775401 +``` + +``` python +print("Frequency of column", metadata.OncotreeLineage.value_counts()) +``` + +``` +## Frequency of column OncotreeLineage +## Lung 241 +## Lymphoid 209 +## CNS/Brain 123 +## Skin 118 +## Esophagus/Stomach 95 +## Breast 92 +## Bowel 87 +## Head and Neck 81 +## Myeloid 77 +## Bone 75 +## Ovary/Fallopian Tube 74 +## Pancreas 65 +## Kidney 64 +## Peripheral Nervous System 55 +## Soft Tissue 54 +## Uterus 41 +## Fibroblast 41 +## Biliary Tract 40 +## Bladder/Urinary Tract 39 +## Normal 39 +## Pleura 35 +## Liver 28 +## Cervix 25 +## Eye 19 +## Thyroid 18 +## Prostate 14 +## Vulva/Vagina 5 +## Ampulla of Vater 4 +## Testis 4 +## Adrenal Gland 1 +## Other 1 +## Name: count, dtype: int64 +``` + +Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we're not focused on that in this course.) + +## Simple data visualization + +We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called [`.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot. + +| Plot style | Useful for | kind = | Code | +|-------------|-------------|-------------|---------------------------------| +| Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` | +| Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` | + +Let's look at a histogram: + + +``` python +import matplotlib.pyplot as plt + +plt.figure() +metadata.Age.plot(kind = "hist") +plt.show() +``` + + + +Let's look at a bar plot: + + +``` python +plt.figure() +metadata.OncotreeLineage.value_counts().plot(kind = "bar") +plt.show() +``` + + + +(The `plt.figure()` and `plt.show()` functions are used to render the plots on the website, but you don't need to use it for your exercises. We will discuss this in more detail during our week of data visualization.) + +#### Chained function calls + +Let's look at our bar plot syntax more carefully. We start with the column `metadata.OncotreeLineage`, and then we first use the method `.value_counts()` to get a *Series* of a frequency table. Then, we take the frequency table Series and use the `.plot()` method. + +It is quite common in Python to have multiple "chained" function calls, in which the output of `.value_counts()` is used for the input of `.plot()` all in one line of code. It takes a bit of time to get used to this! + +Here's another example of a chained function call, which looks quite complex, but let's break it down: + + +``` python +plt.figure() + +metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar") + +plt.show() +``` + + + +1. We first take the entire `metadata` and do some subsetting, which outputs a Dataframe. +2. We access the `OncotreeLineage` column, which outputs a Series. +3. We use the method `.value_counts()`, which outputs a Series. +4. We make a plot out of it! + +We could have, alternatively, done this in several lines of code: + + +``` python +plt.figure() + +metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ] +metadata_subset_lineage = metadata_subset.OncotreeLineage +lineage_freq = metadata_subset_lineage.value_counts() +lineage_freq.plot(kind = "bar") + +plt.show() +``` + + + +These are two different *styles* of code, but they do the exact same thing. It's up to you to decide what is easier for you to understand. + +## Exercises + +Exercise for week 3 can be found [here](https://colab.research.google.com/drive/1ClNOJviyrcaaoVq5F-YtsO7NhMqn315c?usp=sharing). diff --git a/docs/no_toc/04-data-wrangling2.md b/docs/no_toc/04-data-wrangling2.md new file mode 100644 index 0000000..77cb01c --- /dev/null +++ b/docs/no_toc/04-data-wrangling2.md @@ -0,0 +1,334 @@ + + +# Data Wrangling, Part 2 + +We will continue to learn about data analysis with Dataframes. Let's load our three Dataframes from the Depmap project in again: + + +``` python +import pandas as pd +import numpy as np + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + +## Creating new columns + +Often, we want to perform some kind of transformation on our data's columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale. + +To create a new column, you simply modify it as if it exists using the bracket operation `[ ]`, and the column will be created: + + +``` python +metadata['AgePlusTen'] = metadata['Age'] + 10 +expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp'] +expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp']) +``` + +where [`np.log(x)`](https://numpy.org/doc/stable/reference/generated/numpy.log.html) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value. + +Note: you cannot create a new column referring to the attribute of the Dataframe, such as: `expression.KRAS_Exp_log = np.log(expression.KRAS_Exp)`. + +## Merging two Dataframes together + +Suppose we have the following Dataframes: + +`expression` + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | +|--------------|------------|----------------| +| "ACH-001113" | 5.138733 | 1.636806 | +| "ACH-001289" | 3.184280 | 1.158226 | +| "ACH-001339" | 3.165108 | 1.152187 | + +`metadata` + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "CNS/Brain" | NaN | +| "ACH-001339" | "Skin" | 14 | + +Suppose that I want to compare the relationship between `OncotreeLineage` and `PIK3CA_Exp`, but they are columns in different Dataframes. We want a new Dataframe that looks like this: + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age | +|--------------|------------|----------------|-----------------|-----| +| "ACH-001113" | 5.138733 | 1.636806 | "Lung" | 69 | +| "ACH-001289" | 3.184280 | 1.158226 | "CNS/Brain" | NaN | +| "ACH-001339" | 3.165108 | 1.152187 | "Skin" | 14 | + +We see that in both dataframes, + +- the rows (observations) represent cell lines. + +- there is a common column `ModelID`, with shared values between the two dataframes that can faciltate the merging process. We call this an **index**. + +We will use the method [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index. + + +``` python +merged = metadata.merge(expression) +``` + +It's usually better to specify what that index column to avoid ambiguity, using the `on` optional argument: + + +``` python +merged = metadata.merge(expression, on='ModelID') +``` + +If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe: + + +``` python +merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID') +``` + +One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not: + +The number of rows and columns of `metadata`: + + +``` python +metadata.shape +``` + +``` +## (1864, 31) +``` + +The number of rows and columns of `expression`: + + +``` python +expression.shape +``` + +``` +## (1450, 538) +``` + +The number of rows and columns of `merged`: + + +``` python +merged.shape +``` + +``` +## (1450, 568) +``` + +We see that the number of *columns* in `merged` combines the number of columns in `metadata` and `expression`, while the number of *rows* in `merged` is the smaller of the number of rows in `metadata` and `expression`: it only keeps rows that are found in both Dataframe's index columns. This kind of join is called "inner join", because in the Venn Diagram of elements common in both index column, we keep the inner overlap: + +![](images/join.png) + +You can specifiy the join style by changing the optional input argument `how`. + +- `how = "outer"` keeps all observations - also known as a "full join" + +- `how = "left"` keeps all observations in the left Dataframe. + +- `how = "right"` keeps all observations in the right Dataframe. + +- `how = "inner"` keeps observations common to both Dataframe. This is the default value of `how`. + +## Grouping and summarizing Dataframes + +In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in `metadata`, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, `OncotreeLineage`, and look at the mean age for each cancer type. + +We want to take `metadata`: + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "Lung" | 23 | +| "ACH-001339" | "Skin" | 14 | +| "ACH-002342" | "Brain" | 23 | +| "ACH-004854" | "Brain" | 56 | +| "ACH-002921" | "Brain" | 67 | + +into: + +| OncotreeLineage | MeanAge | +|-----------------|---------| +| "Lung" | 46 | +| "Skin" | 14 | +| "Brain" | 48.67 | + +To get there, we need to: + +- **Group** the data based on some criteria, elements of `OncotreeLineage` + +- **Summarize** each group via a summary statistic performed on a column, such as `Age`. + +We first subset the the two columns we need, and then use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`. + + +``` python +metadata_grouped = metadata.groupby("OncotreeLineage") +metadata_grouped['Age'].mean() +``` + +``` +## OncotreeLineage +## Adrenal Gland 55.000000 +## Ampulla of Vater 65.500000 +## Biliary Tract 58.450000 +## Bladder/Urinary Tract 65.166667 +## Bone 20.854545 +## Bowel 58.611111 +## Breast 50.961039 +## CNS/Brain 43.849057 +## Cervix 47.136364 +## Esophagus/Stomach 57.855556 +## Eye 51.100000 +## Fibroblast 38.194444 +## Head and Neck 60.149254 +## Kidney 46.193548 +## Liver 43.928571 +## Lung 55.444444 +## Lymphoid 38.916667 +## Myeloid 38.810811 +## Normal 52.370370 +## Other 46.000000 +## Ovary/Fallopian Tube 51.980769 +## Pancreas 60.226415 +## Peripheral Nervous System 5.480000 +## Pleura 61.000000 +## Prostate 61.666667 +## Skin 49.033708 +## Soft Tissue 27.500000 +## Testis 25.000000 +## Thyroid 63.235294 +## Uterus 62.060606 +## Vulva/Vagina 75.400000 +## Name: Age, dtype: float64 +``` + +Here's what's going on: + +- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped. + +- We subset to the column `Age`. The grouping information still persists (This is a Grouped Series object). + +- We use the method `.mean()` to calculate the mean value of `Age` within each group defined by `OncotreeLineage`. + +Alternatively, this could have been done in a chain of methods: + + +``` python +metadata.groupby("OncotreeLineage")["Age"].mean() +``` + +``` +## OncotreeLineage +## Adrenal Gland 55.000000 +## Ampulla of Vater 65.500000 +## Biliary Tract 58.450000 +## Bladder/Urinary Tract 65.166667 +## Bone 20.854545 +## Bowel 58.611111 +## Breast 50.961039 +## CNS/Brain 43.849057 +## Cervix 47.136364 +## Esophagus/Stomach 57.855556 +## Eye 51.100000 +## Fibroblast 38.194444 +## Head and Neck 60.149254 +## Kidney 46.193548 +## Liver 43.928571 +## Lung 55.444444 +## Lymphoid 38.916667 +## Myeloid 38.810811 +## Normal 52.370370 +## Other 46.000000 +## Ovary/Fallopian Tube 51.980769 +## Pancreas 60.226415 +## Peripheral Nervous System 5.480000 +## Pleura 61.000000 +## Prostate 61.666667 +## Skin 49.033708 +## Soft Tissue 27.500000 +## Testis 25.000000 +## Thyroid 63.235294 +## Uterus 62.060606 +## Vulva/Vagina 75.400000 +## Name: Age, dtype: float64 +``` + +Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as `.mean()`, `.median()`, `.max()`, can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is [`.count()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html) which tells you how many entries are counted within each group. + +### Optional: Multiple grouping, Multiple columns, Multiple summary statistics + +Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously. + +For example, you may want to group by a combination of `OncotreeLineage` and `AgeCategory`, such as "Lung" and "Adult" as one grouping. You can do so like this: + + +``` python +metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"]) +metadata_grouped['Age'].mean() +``` + +``` +## OncotreeLineage AgeCategory +## Adrenal Gland Adult 55.000000 +## Ampulla of Vater Adult 65.500000 +## Unknown NaN +## Biliary Tract Adult 58.450000 +## Unknown NaN +## ... +## Thyroid Unknown NaN +## Uterus Adult 62.060606 +## Fetus NaN +## Unknown NaN +## Vulva/Vagina Adult 75.400000 +## Name: Age, Length: 72, dtype: float64 +``` + +You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the [`.agg(x)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) method on a Grouped Dataframe. + +For example, coming back to our age case-control Dataframe, + + +``` python +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) + +df +``` + +``` +## status age_case age_control +## 0 treated 25 49 +## 1 untreated 43 20 +## 2 untreated 21 32 +## 3 discharged 65 25 +## 4 treated 7 32 +``` + +We group by `status` and summarize `age_case` and `age_control` with a few summary statistics each: + + +``` python +df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]}) +``` + +``` +## age_case age_control +## mean min max mean +## status +## discharged 65.0 25 25 25.0 +## treated 16.0 32 49 40.5 +## untreated 32.0 20 32 26.0 +``` + +The input argument to the `.agg(x)` method is called a [Dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries), which let's you structure information in a paired relationship. You can learn more about dictionaries here. + +## Exercises + +Exercise for week 4 can be found [here](https://colab.research.google.com/drive/1ntkUdKQ209vu1M89rcsBst-pKKuwzdwX?usp=sharing). diff --git a/docs/no_toc/05-data-visualization.md b/docs/no_toc/05-data-visualization.md new file mode 100644 index 0000000..71d4a68 --- /dev/null +++ b/docs/no_toc/05-data-visualization.md @@ -0,0 +1,226 @@ + + +# Data Visualization + +In our final to last week together, we learn about how to visualize our data. + +There are several different data visualization modules in Python: + +- [matplotlib](https://matplotlib.org/) is a general purpose plotting module that is commonly used. + +- [seaborn](https://seaborn.pydata.org/) is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course. + +- [plotnine](https://plotnine.org/) is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package "ggplot". + +To get started, we will consider these most simple and common plots: + +Distributions (one variable) + +- Histograms + +Relational (between 2 continuous variables) + +- Scatterplots + +- Line plots + +Categorical (between 1 categorical and 1 continuous variable) + +- Bar plots + +- Violin plots + +[![Image source: Seaborn's overview of plotting functions](https://seaborn.pydata.org/_images/function_overview_8_0.png)](https://seaborn.pydata.org/tutorial/function_overview.html) + +Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale. + +[![Image Source: Visualization Analysis and Design by [Tamara Munzner](https://www.oreilly.com/search?q=author:%22Tamara%20Munzner%22)](https://www.oreilly.com/api/v2/epubs/9781466508910/files/image/fig5-1.png)](https://www.oreilly.com/library/view/visualization-analysis-and/9781466508910/K14708_C005.xhtml) + +Let's load in our genomics datasets and start making some plots from them. + + +``` python +import pandas as pd +import seaborn as sns +import matplotlib.pyplot as plt + + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + +## Distributions (one variable) + +To create a histogram, we use the function [`sns.displot()`](https://seaborn.pydata.org/generated/seaborn.displot.html) and we specify the input argument `data` as our dataframe, and the input argument `x` as the column name in a String. + + +``` python +plot = sns.displot(data=metadata, x="Age") +``` + + + +(For the webpage's purpose, assign the plot to a variable `plot`. In practice, you don't need to do that. You can just write `sns.displot(data=metadata, x="Age")`). + +A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via `binwidth` argument, or the number of bins via `bins` argument. + + +``` python +plot = sns.displot(data=metadata, x="Age", binwidth = 10) +``` + + + +Our histogram also works for categorical variables, such as "Sex". + + +``` python +plot = sns.displot(data=metadata, x="Sex") +``` + + + +**Conditioning on other variables** + +Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the `hue` input argument: + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex") +``` + + + +It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via `multiple="dodge"` input argument: + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge") +``` + + + +Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable's value via `col="Sex"` or `row="Sex"`: + + +``` python +plot = sns.displot(data=metadata, x="Age", col="Sex") +``` + + + +You can find a lot more details about distributions and histograms in [the Seaborn tutorial](https://seaborn.pydata.org/tutorial/distributions.html). + +## Relational (between 2 continuous variables) + +To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function [`sns.relplot()`](https://seaborn.pydata.org/generated/seaborn.relplot.html) and we specify the input argument `data` as our dataframe, and the input arguments `x` and `y` as the column names in a String: + + +``` python +plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") +``` + + + +To conditional on other variables, plotting features are used to distinguish conditional variable values: + +- `hue` (similar to the histogram) + +- `style` + +- `size` + +Let's merge `expression` and `metadata` together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color: + + +``` python +expression_metadata = expression.merge(metadata) + +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis") +``` + + + +Here is the scatterplot with different shapes: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis") +``` + + + +You can also try plotting with `size=PrimaryOrMetastasis"` if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis") +``` + + + +You can also conditional on multiple variables by assigning a different variable to the conditioning options: + + +``` python +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory") +``` + + + +You can find a lot more details about relational plots such as scatterplots and lineplots [in the Seaborn tutorial](https://seaborn.pydata.org/tutorial/relational.html). + +## Categorical (between 1 categorical and 1 continuous variable) + +A very similar pattern follows for categorical plots. We start with [sns.catplot()](https://seaborn.pydata.org/generated/seaborn.catplot.html) as our main plotting function, with the basic input arguments: + +- `data` + +- `x` + +- `y` + +You can change the plot styles via the input arguments: + +- `kind`: "strip", "box", "swarm", etc. + +You can add additional conditional variables via the input arguments: + +- `hue` + +- `col` + +- `row` + +See categorical plots [in the Seaborn tutorial.](https://seaborn.pydata.org/tutorial/categorical.html) + +## Basic plot customization + +You can easily change the axis labels and title if you modify the plot object, using the method `.set()`: + + +``` python +exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") +exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship") +``` + + + +You can change the color palette by setting adding the `palette` input argument to any of the plots. You can explore available color palettes [here](https://www.practicalpythonfordatascience.com/ap_seaborn_palette): + + +``` python +plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow') +) +``` + +``` +## :1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended. +``` + + + +## Exercises + +Exercise for week 5 can be found [here](https://colab.research.google.com/drive/1kT3zzq2rrhL1vHl01IdW5L1V7v0iK0wY?usp=sharing). diff --git a/docs/no_toc/404.html b/docs/no_toc/404.html index 2c73718..ac6d572 100644 --- a/docs/no_toc/404.html +++ b/docs/no_toc/404.html @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/no_toc/About.md b/docs/no_toc/About.md index c307ec3..f902595 100644 --- a/docs/no_toc/About.md +++ b/docs/no_toc/About.md @@ -51,7 +51,7 @@ These credits are based on our [course contributors table guidelines](https://ww ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2024-08-07 +## date 2024-09-26 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── diff --git a/docs/no_toc/about-the-authors.html b/docs/no_toc/about-the-authors.html index 434bafb..1cb928a 100644 --- a/docs/no_toc/about-the-authors.html +++ b/docs/no_toc/about-the-authors.html @@ -28,7 +28,7 @@ - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -342,7 +386,7 @@

    About the Authors + diff --git a/docs/no_toc/data-visualization.html b/docs/no_toc/data-visualization.html new file mode 100644 index 0000000..671f810 --- /dev/null +++ b/docs/no_toc/data-visualization.html @@ -0,0 +1,443 @@ + + + + + + + Chapter 5 Data Visualization | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 5 Data Visualization

    +

    In our final to last week together, we learn about how to visualize our data.

    +

    There are several different data visualization modules in Python:

    +
      +
    • matplotlib is a general purpose plotting module that is commonly used.

    • +
    • seaborn is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course.

    • +
    • plotnine is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package “ggplot”.

    • +
    +

    To get started, we will consider these most simple and common plots:

    +

    Distributions (one variable)

    +
      +
    • Histograms
    • +
    +

    Relational (between 2 continuous variables)

    +
      +
    • Scatterplots

    • +
    • Line plots

    • +
    +

    Categorical (between 1 categorical and 1 continuous variable)

    +
      +
    • Bar plots

    • +
    • Violin plots

    • +
    +

    Image source: Seaborn’s overview of plotting functions

    +

    Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale.

    +

    Image Source: Visualization Analysis and Design by [Tamara Munzner](https://www.oreilly.com/search?q=author:%22Tamara%20Munzner%22)

    +

    Let’s load in our genomics datasets and start making some plots from them.

    +
    import pandas as pd
    +import seaborn as sns
    +import matplotlib.pyplot as plt
    +
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    +

    5.1 Distributions (one variable)

    +

    To create a histogram, we use the function sns.displot() and we specify the input argument data as our dataframe, and the input argument x as the column name in a String.

    +
    plot = sns.displot(data=metadata, x="Age")
    +

    +

    (For the webpage’s purpose, assign the plot to a variable plot. In practice, you don’t need to do that. You can just write sns.displot(data=metadata, x="Age")).

    +

    A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via binwidth argument, or the number of bins via bins argument.

    +
    plot = sns.displot(data=metadata, x="Age", binwidth = 10)
    +

    +

    Our histogram also works for categorical variables, such as “Sex”.

    +
    plot = sns.displot(data=metadata, x="Sex")
    +

    +

    Conditioning on other variables

    +

    Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the hue input argument:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex")
    +

    +

    It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via multiple="dodge" input argument:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge")
    +

    +

    Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable’s value via col="Sex" or row="Sex":

    +
    plot = sns.displot(data=metadata, x="Age", col="Sex")
    +

    +

    You can find a lot more details about distributions and histograms in the Seaborn tutorial.

    +
    +
    +

    5.2 Relational (between 2 continuous variables)

    +

    To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function sns.relplot() and we specify the input argument data as our dataframe, and the input arguments x and y as the column names in a String:

    +
    plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp")
    +

    +

    To conditional on other variables, plotting features are used to distinguish conditional variable values:

    +
      +
    • hue (similar to the histogram)

    • +
    • style

    • +
    • size

    • +
    +

    Let’s merge expression and metadata together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color:

    +
    expression_metadata = expression.merge(metadata)
    +
    +plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis")
    +

    +

    Here is the scatterplot with different shapes:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis")
    +

    +

    You can also try plotting with size=PrimaryOrMetastasis" if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis")
    +

    +

    You can also conditional on multiple variables by assigning a different variable to the conditioning options:

    +
    plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory")
    +

    +

    You can find a lot more details about relational plots such as scatterplots and lineplots in the Seaborn tutorial.

    +
    +
    +

    5.3 Categorical (between 1 categorical and 1 continuous variable)

    +

    A very similar pattern follows for categorical plots. We start with sns.catplot() as our main plotting function, with the basic input arguments:

    +
      +
    • data

    • +
    • x

    • +
    • y

    • +
    +

    You can change the plot styles via the input arguments:

    +
      +
    • kind: “strip”, “box”, “swarm”, etc.
    • +
    +

    You can add additional conditional variables via the input arguments:

    +
      +
    • hue

    • +
    • col

    • +
    • row

    • +
    +

    See categorical plots in the Seaborn tutorial.

    +
    +
    +

    5.4 Basic plot customization

    +

    You can easily change the axis labels and title if you modify the plot object, using the method .set():

    +
    exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp")
    +exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship")
    +

    +

    You can change the color palette by setting adding the palette input argument to any of the plots. You can explore available color palettes here:

    +
    plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow')
    +)
    +
    ## <string>:1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended.
    +

    +
    +
    +

    5.5 Exercises

    +

    Exercise for week 5 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/no_toc/data-wrangling-part-1.html b/docs/no_toc/data-wrangling-part-1.html new file mode 100644 index 0000000..0ece4cc --- /dev/null +++ b/docs/no_toc/data-wrangling-part-1.html @@ -0,0 +1,655 @@ + + + + + + + Chapter 3 Data Wrangling, Part 1 | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 3 Data Wrangling, Part 1

    +

    From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis.

    +
    +Data science workflow. Image source: R for Data Science. +
    Data science workflow. Image source: R for Data Science.
    +
    +

    For the rest of the course, we focus on Transform and Visualize with the assumption that our data is in a nice, “Tidy format”. First, we need to understand what it means for a data to be “Tidy”.

    +
    +

    3.1 Tidy Data

    +

    Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of tidy data, developed by Hadley Wickham:

    +
      +
    1. Each variable must have its own column.

    2. +
    3. Each observation must have its own row.

    4. +
    5. Each value must have its own cell.

    6. +
    +

    If you want to be technical about what variables and observations are, Hadley Wickham describes:

    +
    +

    A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

    +
    +
    +A tidy dataframe. Image source: R for Data Science. +
    A tidy dataframe. Image source: R for Data Science.
    +
    +
    +
    +

    3.2 Our working Tidy Data: DepMap Project

    +

    The Dependency Map project is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session.

    +
      +
    • Metadata

    • +
    • Somatic mutations

    • +
    • Gene expression

    • +
    • Drug sensitivity

    • +
    • CRISPR knockout

    • +
    • and more…

    • +
    +

    Let’s load these datasets in, and see how these datasets fit the definition of Tidy data:

    +
    import pandas as pd
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    metadata.head()
    +
    ##       ModelID  PatientID  ...     OncotreePrimaryDisease       OncotreeLineage
    +## 0  ACH-000001  PT-gj46wT  ...   Ovarian Epithelial Tumor  Ovary/Fallopian Tube
    +## 1  ACH-000002  PT-5qa3uk  ...     Acute Myeloid Leukemia               Myeloid
    +## 2  ACH-000003  PT-puKIyc  ...  Colorectal Adenocarcinoma                 Bowel
    +## 3  ACH-000004  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 4  ACH-000005  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 
    +## [5 rows x 30 columns]
    +
    mutation.head()
    +
    ##       ModelID  CACNA1D_Mut  CYP2D6_Mut  ...  CCDC28A_Mut  C1orf194_Mut  U2AF1_Mut
    +## 0  ACH-000001        False       False  ...        False         False      False
    +## 1  ACH-000002        False       False  ...        False         False      False
    +## 2  ACH-000004        False       False  ...        False         False      False
    +## 3  ACH-000005        False       False  ...        False         False      False
    +## 4  ACH-000006        False       False  ...        False         False      False
    +## 
    +## [5 rows x 540 columns]
    +
    expression.head()
    +
    ##       ModelID  ENPP4_Exp  CREBBP_Exp  ...  OR5D13_Exp  C2orf81_Exp  OR8S1_Exp
    +## 0  ACH-001113   2.280956    4.094236  ...         0.0     1.726831        0.0
    +## 1  ACH-001289   3.622930    3.606442  ...         0.0     0.790772        0.0
    +## 2  ACH-001339   0.790772    2.970854  ...         0.0     0.575312        0.0
    +## 3  ACH-001538   3.485427    2.801159  ...         0.0     1.077243        0.0
    +## 4  ACH-000242   0.879706    3.327687  ...         0.0     0.722466        0.0
    +## 
    +## [5 rows x 536 columns]
    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    DataframeThe observation isSome variables areSome values are
    metadataCell lineModelID, Age, OncotreeLineage“ACH-000001”, 60, “Myeloid”
    expressionCell lineKRAS_Exp2.4, .3
    mutationCell lineKRAS_MutTRUE, FALSE
    +
    +
    +

    3.3 Transform: “What do you want to do with this Dataframe”?

    +

    Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something.

    +

    Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows.

    +

    Here’s a starting prompt:

    +
    +

    In the metadata dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question?

    +
    +

    We have been using explicit subsetting with numerical indicies, such as “I want to filter for rows 20-50 and select columns 2 and 8”. We are now going to switch to implicit subsetting in which we describe the subsetting criteria via comparision operators and column names, such as:

    +

    “I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex.”

    +

    Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names.

    +
    +

    3.3.0.1 Let’s convert our implicit subsetting criteria into code!

    +

    To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer:

    +
    metadata['OncotreeLineage'] == "Lung"
    +
    ## 0       False
    +## 1       False
    +## 2       False
    +## 3       False
    +## 4       False
    +##         ...  
    +## 1859    False
    +## 1860    False
    +## 1861    False
    +## 1862    False
    +## 1863     True
    +## Name: OncotreeLineage, Length: 1864, dtype: bool
    +

    Then, we will use the .loc operation (which is different than .iloc operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time:

    +
    metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]]
    +
    ##        Age     Sex
    +## 10    39.0  Female
    +## 13    44.0    Male
    +## 19    55.0  Female
    +## 27    39.0  Female
    +## 28    45.0    Male
    +## ...    ...     ...
    +## 1745  52.0    Male
    +## 1819  84.0    Male
    +## 1820  57.0  Female
    +## 1822  53.0    Male
    +## 1863  62.0    Male
    +## 
    +## [241 rows x 2 columns]
    +

    What’s going on here? The first component of the subset, metadata['OncotreeLineage'] == "Lung", subsets for the rows. It gives us a column of True and False values, and we keep rows that correspond to True values. Then, we specify the column names we want to subset for via a list.

    +

    Here’s another example:

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +                            
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    “I want to subset for rows such that the status is”treated” and subset for columns status and age_case.”

    +
    df.loc[df.status == "treated", ["status", "age_case"]]
    +
    ##     status  age_case
    +## 0  treated        25
    +## 4  treated         7
    +

    +
    +
    +
    +

    3.4 Summary Statistics

    +

    Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode.

    +

    If we look at the data structure of a Dataframe’s column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples:

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function methodWhat it takes inWhat it doesReturns
    metadata.Age.mean()metadata.Age as a numeric SeriesComputes the mean value of the Age column.Float (NumPy)
    metadata['Age'].median()metadata['Age'] as a numeric SeriesComputes the median value of the Age column.Float (NumPy)
    metadata.Age.max()metadata.Age as a numeric SeriesComputes the max value of the Age column.Float (NumPy)
    metadata.OncotreeSubtype.value_counts()metadata.OncotreeSubtype as a string SeriesCreates a frequency table of all unique elements in OncotreeSubtype column.Series
    +

    Let’s try it out, with some nice print formatting:

    +
    print("Mean value of Age column:", metadata['Age'].mean())
    +
    ## Mean value of Age column: 47.45187165775401
    +
    print("Frequency of column", metadata.OncotreeLineage.value_counts())
    +
    ## Frequency of column OncotreeLineage
    +## Lung                         241
    +## Lymphoid                     209
    +## CNS/Brain                    123
    +## Skin                         118
    +## Esophagus/Stomach             95
    +## Breast                        92
    +## Bowel                         87
    +## Head and Neck                 81
    +## Myeloid                       77
    +## Bone                          75
    +## Ovary/Fallopian Tube          74
    +## Pancreas                      65
    +## Kidney                        64
    +## Peripheral Nervous System     55
    +## Soft Tissue                   54
    +## Uterus                        41
    +## Fibroblast                    41
    +## Biliary Tract                 40
    +## Bladder/Urinary Tract         39
    +## Normal                        39
    +## Pleura                        35
    +## Liver                         28
    +## Cervix                        25
    +## Eye                           19
    +## Thyroid                       18
    +## Prostate                      14
    +## Vulva/Vagina                   5
    +## Ampulla of Vater               4
    +## Testis                         4
    +## Adrenal Gland                  1
    +## Other                          1
    +## Name: count, dtype: int64
    +

    Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we’re not focused on that in this course.)

    +
    +
    +

    3.5 Simple data visualization

    +

    We will dedicate extensive time later this course to talk about data visualization, but the Dataframe’s column, Series, has a method called .plot() that can help us make simple plots for one variable. The .plot() method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument kind a String value to specify the plot style. We use it for making a histogram or bar plot.

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + +
    Plot styleUseful forkind =Code
    HistogramNumerics“hist”metadata.Age.plot(kind = "hist")
    Bar plotStrings“bar”metadata.OncotreeSubtype.value_counts().plot(kind = "bar")
    +

    Let’s look at a histogram:

    +
    import matplotlib.pyplot as plt
    +
    +plt.figure()
    +metadata.Age.plot(kind = "hist")
    +plt.show()
    +

    +

    Let’s look at a bar plot:

    +
    plt.figure()
    +metadata.OncotreeLineage.value_counts().plot(kind = "bar")
    +plt.show()
    +

    +

    (The plt.figure() and plt.show() functions are used to render the plots on the website, but you don’t need to use it for your exercises. We will discuss this in more detail during our week of data visualization.)

    +
    +

    3.5.0.1 Chained function calls

    +

    Let’s look at our bar plot syntax more carefully. We start with the column metadata.OncotreeLineage, and then we first use the method .value_counts() to get a Series of a frequency table. Then, we take the frequency table Series and use the .plot() method.

    +

    It is quite common in Python to have multiple “chained” function calls, in which the output of .value_counts() is used for the input of .plot() all in one line of code. It takes a bit of time to get used to this!

    +

    Here’s another example of a chained function call, which looks quite complex, but let’s break it down:

    +
    plt.figure()
    +
    +metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar")
    +
    +plt.show()
    +

    +
      +
    1. We first take the entire metadata and do some subsetting, which outputs a Dataframe.
    2. +
    3. We access the OncotreeLineage column, which outputs a Series.
    4. +
    5. We use the method .value_counts(), which outputs a Series.
    6. +
    7. We make a plot out of it!
    8. +
    +

    We could have, alternatively, done this in several lines of code:

    +
    plt.figure()
    +
    +metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ]
    +metadata_subset_lineage = metadata_subset.OncotreeLineage
    +lineage_freq = metadata_subset_lineage.value_counts()
    +lineage_freq.plot(kind = "bar")
    +
    +plt.show()
    +

    +

    These are two different styles of code, but they do the exact same thing. It’s up to you to decide what is easier for you to understand.

    +
    +
    +
    +

    3.6 Exercises

    +

    Exercise for week 3 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/no_toc/data-wrangling-part-2.html b/docs/no_toc/data-wrangling-part-2.html new file mode 100644 index 0000000..3e87bd8 --- /dev/null +++ b/docs/no_toc/data-wrangling-part-2.html @@ -0,0 +1,660 @@ + + + + + + + Chapter 4 Data Wrangling, Part 2 | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 4 Data Wrangling, Part 2

    +

    We will continue to learn about data analysis with Dataframes. Let’s load our three Dataframes from the Depmap project in again:

    +
    import pandas as pd
    +import numpy as np
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +mutation = pd.read_csv("classroom_data/mutation.csv")
    +expression = pd.read_csv("classroom_data/expression.csv")
    +
    +

    4.1 Creating new columns

    +

    Often, we want to perform some kind of transformation on our data’s columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale.

    +

    To create a new column, you simply modify it as if it exists using the bracket operation [ ], and the column will be created:

    +
    metadata['AgePlusTen'] = metadata['Age'] + 10
    +expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp']
    +expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp'])
    +

    where np.log(x) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value.

    +

    Note: you cannot create a new column referring to the attribute of the Dataframe, such as: expression.KRAS_Exp_log = np.log(expression.KRAS_Exp).

    +
    +
    +

    4.2 Merging two Dataframes together

    +

    Suppose we have the following Dataframes:

    +

    expression

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDPIK3CA_Explog_PIK3CA_Exp
    “ACH-001113”5.1387331.636806
    “ACH-001289”3.1842801.158226
    “ACH-001339”3.1651081.152187
    +

    metadata

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDOncotreeLineageAge
    “ACH-001113”“Lung”69
    “ACH-001289”“CNS/Brain”NaN
    “ACH-001339”“Skin”14
    +

    Suppose that I want to compare the relationship between OncotreeLineage and PIK3CA_Exp, but they are columns in different Dataframes. We want a new Dataframe that looks like this:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDPIK3CA_Explog_PIK3CA_ExpOncotreeLineageAge
    “ACH-001113”5.1387331.636806“Lung”69
    “ACH-001289”3.1842801.158226“CNS/Brain”NaN
    “ACH-001339”3.1651081.152187“Skin”14
    +

    We see that in both dataframes,

    +
      +
    • the rows (observations) represent cell lines.

    • +
    • there is a common column ModelID, with shared values between the two dataframes that can faciltate the merging process. We call this an index.

    • +
    +

    We will use the method .merge() for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index.

    +
    merged = metadata.merge(expression)
    +

    It’s usually better to specify what that index column to avoid ambiguity, using the on optional argument:

    +
    merged = metadata.merge(expression, on='ModelID')
    +

    If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe:

    +
    merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID')
    +

    One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not:

    +

    The number of rows and columns of metadata:

    +
    metadata.shape
    +
    ## (1864, 31)
    +

    The number of rows and columns of expression:

    +
    expression.shape
    +
    ## (1450, 538)
    +

    The number of rows and columns of merged:

    +
    merged.shape
    +
    ## (1450, 568)
    +

    We see that the number of columns in merged combines the number of columns in metadata and expression, while the number of rows in merged is the smaller of the number of rows in metadata and expression: it only keeps rows that are found in both Dataframe’s index columns. This kind of join is called “inner join”, because in the Venn Diagram of elements common in both index column, we keep the inner overlap:

    +

    +

    You can specifiy the join style by changing the optional input argument how.

    +
      +
    • how = "outer" keeps all observations - also known as a “full join”

    • +
    • how = "left" keeps all observations in the left Dataframe.

    • +
    • how = "right" keeps all observations in the right Dataframe.

    • +
    • how = "inner" keeps observations common to both Dataframe. This is the default value of how.

    • +
    +
    +
    +

    4.3 Grouping and summarizing Dataframes

    +

    In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in metadata, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, OncotreeLineage, and look at the mean age for each cancer type.

    +

    We want to take metadata:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelIDOncotreeLineageAge
    “ACH-001113”“Lung”69
    “ACH-001289”“Lung”23
    “ACH-001339”“Skin”14
    “ACH-002342”“Brain”23
    “ACH-004854”“Brain”56
    “ACH-002921”“Brain”67
    +

    into:

    + + + + + + + + + + + + + + + + + + + + + +
    OncotreeLineageMeanAge
    “Lung”46
    “Skin”14
    “Brain”48.67
    +

    To get there, we need to:

    +
      +
    • Group the data based on some criteria, elements of OncotreeLineage

    • +
    • Summarize each group via a summary statistic performed on a column, such as Age.

    • +
    +

    We first subset the the two columns we need, and then use the methods .group_by(x) and .mean().

    +
    metadata_grouped = metadata.groupby("OncotreeLineage")
    +metadata_grouped['Age'].mean()
    +
    ## OncotreeLineage
    +## Adrenal Gland                55.000000
    +## Ampulla of Vater             65.500000
    +## Biliary Tract                58.450000
    +## Bladder/Urinary Tract        65.166667
    +## Bone                         20.854545
    +## Bowel                        58.611111
    +## Breast                       50.961039
    +## CNS/Brain                    43.849057
    +## Cervix                       47.136364
    +## Esophagus/Stomach            57.855556
    +## Eye                          51.100000
    +## Fibroblast                   38.194444
    +## Head and Neck                60.149254
    +## Kidney                       46.193548
    +## Liver                        43.928571
    +## Lung                         55.444444
    +## Lymphoid                     38.916667
    +## Myeloid                      38.810811
    +## Normal                       52.370370
    +## Other                        46.000000
    +## Ovary/Fallopian Tube         51.980769
    +## Pancreas                     60.226415
    +## Peripheral Nervous System     5.480000
    +## Pleura                       61.000000
    +## Prostate                     61.666667
    +## Skin                         49.033708
    +## Soft Tissue                  27.500000
    +## Testis                       25.000000
    +## Thyroid                      63.235294
    +## Uterus                       62.060606
    +## Vulva/Vagina                 75.400000
    +## Name: Age, dtype: float64
    +

    Here’s what’s going on:

    +
      +
    • We use the Dataframe method .group_by(x) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the metadata Dataframe, but it makes a note that it’s been grouped.

    • +
    • We subset to the column Age. The grouping information still persists (This is a Grouped Series object).

    • +
    • We use the method .mean() to calculate the mean value of Age within each group defined by OncotreeLineage.

    • +
    +

    Alternatively, this could have been done in a chain of methods:

    +
    metadata.groupby("OncotreeLineage")["Age"].mean()
    +
    ## OncotreeLineage
    +## Adrenal Gland                55.000000
    +## Ampulla of Vater             65.500000
    +## Biliary Tract                58.450000
    +## Bladder/Urinary Tract        65.166667
    +## Bone                         20.854545
    +## Bowel                        58.611111
    +## Breast                       50.961039
    +## CNS/Brain                    43.849057
    +## Cervix                       47.136364
    +## Esophagus/Stomach            57.855556
    +## Eye                          51.100000
    +## Fibroblast                   38.194444
    +## Head and Neck                60.149254
    +## Kidney                       46.193548
    +## Liver                        43.928571
    +## Lung                         55.444444
    +## Lymphoid                     38.916667
    +## Myeloid                      38.810811
    +## Normal                       52.370370
    +## Other                        46.000000
    +## Ovary/Fallopian Tube         51.980769
    +## Pancreas                     60.226415
    +## Peripheral Nervous System     5.480000
    +## Pleura                       61.000000
    +## Prostate                     61.666667
    +## Skin                         49.033708
    +## Soft Tissue                  27.500000
    +## Testis                       25.000000
    +## Thyroid                      63.235294
    +## Uterus                       62.060606
    +## Vulva/Vagina                 75.400000
    +## Name: Age, dtype: float64
    +

    Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as .mean(), .median(), .max(), can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is .count() which tells you how many entries are counted within each group.

    +
    +

    4.3.1 Optional: Multiple grouping, Multiple columns, Multiple summary statistics

    +

    Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously.

    +

    For example, you may want to group by a combination of OncotreeLineage and AgeCategory, such as “Lung” and “Adult” as one grouping. You can do so like this:

    +
    metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"])
    +metadata_grouped['Age'].mean()
    +
    ## OncotreeLineage   AgeCategory
    +## Adrenal Gland     Adult          55.000000
    +## Ampulla of Vater  Adult          65.500000
    +##                   Unknown              NaN
    +## Biliary Tract     Adult          58.450000
    +##                   Unknown              NaN
    +##                                    ...    
    +## Thyroid           Unknown              NaN
    +## Uterus            Adult          62.060606
    +##                   Fetus                NaN
    +##                   Unknown              NaN
    +## Vulva/Vagina      Adult          75.400000
    +## Name: Age, Length: 72, dtype: float64
    +

    You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the .agg(x) method on a Grouped Dataframe.

    +

    For example, coming back to our age case-control Dataframe,

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +                            
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    We group by status and summarize age_case and age_control with a few summary statistics each:

    +
    df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]})
    +
    ##            age_case age_control          
    +##                mean         min max  mean
    +## status                                   
    +## discharged     65.0          25  25  25.0
    +## treated        16.0          32  49  40.5
    +## untreated      32.0          20  32  26.0
    +

    The input argument to the .agg(x) method is called a Dictionary, which let’s you structure information in a paired relationship. You can learn more about dictionaries here.

    +
    +
    +
    +

    4.4 Exercises

    +

    Exercise for week 4 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/no_toc/images/colab.png b/docs/no_toc/images/colab.png new file mode 100644 index 0000000..ccd9004 Binary files /dev/null and b/docs/no_toc/images/colab.png differ diff --git a/docs/no_toc/images/join.png b/docs/no_toc/images/join.png new file mode 100644 index 0000000..d408d6b Binary files /dev/null and b/docs/no_toc/images/join.png differ diff --git a/docs/no_toc/images/pandas subset_1.png b/docs/no_toc/images/pandas subset_1.png new file mode 100644 index 0000000..45376f2 Binary files /dev/null and b/docs/no_toc/images/pandas subset_1.png differ diff --git a/docs/no_toc/images/pandas_subset_0.png b/docs/no_toc/images/pandas_subset_0.png new file mode 100644 index 0000000..2a37d28 Binary files /dev/null and b/docs/no_toc/images/pandas_subset_0.png differ diff --git a/docs/no_toc/images/pandas_subset_1.png b/docs/no_toc/images/pandas_subset_1.png new file mode 100644 index 0000000..0a491f5 Binary files /dev/null and b/docs/no_toc/images/pandas_subset_1.png differ diff --git a/docs/no_toc/images/pandas_subset_2.png b/docs/no_toc/images/pandas_subset_2.png new file mode 100644 index 0000000..eecc68e Binary files /dev/null and b/docs/no_toc/images/pandas_subset_2.png differ diff --git a/docs/no_toc/index.html b/docs/no_toc/index.html index 9e6b078..8204dc7 100644 --- a/docs/no_toc/index.html +++ b/docs/no_toc/index.html @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -202,7 +246,7 @@

    About this Course

    diff --git a/docs/no_toc/index.md b/docs/no_toc/index.md index bdb6e62..c4161aa 100644 --- a/docs/no_toc/index.md +++ b/docs/no_toc/index.md @@ -1,6 +1,6 @@ --- title: "Introduction to Python" -date: "August, 2024" +date: "September, 2024" site: bookdown::bookdown_site documentclass: book bibliography: [book.bib] diff --git a/docs/no_toc/intro-to-computing.html b/docs/no_toc/intro-to-computing.html index dcc75b6..6d0ee0a 100644 --- a/docs/no_toc/intro-to-computing.html +++ b/docs/no_toc/intro-to-computing.html @@ -29,7 +29,7 @@ - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -236,16 +280,17 @@

    1.3 A programming language has fo

    1.4 Google Colab Setup

    Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user.

    -

    Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, open up…

    +

    Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, you can view it here.

    +

    Today, we will pay close attention to:

      -
    • Python Console (Execution): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

    • +
    • Python Console (“Executions”): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you.

    • Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing.

    • Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code.

    The first thing we will do is see the different ways we can run Python code. You can do the following:

      -
    1. Type something into the Python Console (Execution) and type enter, such as 2+2. The Python Console will run it and give you an output.
    2. +
    3. Type something into the Python Console (Execution) and click the arrow button, such as 2+2. The Python Console will run it and give you an output.
    4. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data.
    5. Run every single Python code chunk via Runtime -> Run all.
    @@ -257,13 +302,14 @@

    1.4 Google Colab Setupnotable differences.

    Now, we will get to the basics of programming grammar.

    1.5 Grammar Structure 1: Evaluation of Expressions

    • Expressions are be built out of operations or functions.

    • -
    • Functions and operations take in data types, do something with them, and return another data type.

    • +
    • Functions and operations take in data types as inputs, do something with them, and return another data type as ouput.

    • We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it.

    For instance, consider the following expressions entered to the Python Console:

    @@ -285,9 +331,22 @@

    1.5 Grammar Structure 1: Evaluati
    ## 39
    add(18, add(21, 65))
    ## 104
    -

    Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the add() function isn’t typically used, it is not automatically available, so we used the import statement to load it in.)

    -
    -

    1.5.1 Data types

    +

    Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called modules that needs to be loaded. The import statement gives us permission to access the functions in the module “operator”.)

    +
    +

    1.5.1 Function machine schema

    +

    A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class:

    +
    +Function machine from algebra class. +
    Function machine from algebra class.
    +
    +

    Here are some aspects of this schema to pay attention to:

    +
      +
    • A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.

    • +
    • A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs.

    • +
    +
    +
    +

    1.5.2 Data types

    Here are some common data types we will be using in this course.

    @@ -320,16 +379,6 @@

    1.5.1 Data types -Function machine from algebra class. -
    Function machine from algebra class.
    - -

    Here are some aspects of this schema to pay attention to:

    -
      -
    • A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language.

    • -
    • A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs.

    • -
    @@ -342,18 +391,18 @@

    1.6.1 Execution rule for variable

    Evaluate the expression to the right of =.

    Bind variable to the left of = to the resulting value.

    -

    The variable is stored in the Variable Environment.

    +

    The variable is stored in the Variable Environment.

    The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined.

    -

    The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later.

    +

    The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM.

    Look, now x can be reused downstream:

    x - 2
    ## 37
    y = x * 2
    -

    It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python:

    +

    It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python:

    type(y)
    ## <class 'int'>
    -

    We should give useful variable names so that we know what to expect! Consider num_sales instead of y.

    +

    We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider num_sales instead of y.

    @@ -362,13 +411,15 @@

    1.7 Grammar Structure 3: Evaluati

    1.7.1 Execution rule for functions:

    -

    Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first.

    +

    Evaluate the function by its arguments if there’s any, and if the arguments are functions or contains operations, evaluate those functions or operations first.

    The output of functions is called the returned value.

    -

    Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by–step in the line of code below:

    -
    (len("hello") + 4) * 2
    -
    ## 18
    -

    If we don’t know how to use a function, such as pow() we can ask for help:

    +

    Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by–step in the lines of code below:

    +
    max(len("hello"), 4)
    +
    ## 5
    +
    (len("pumpkin") - 8) * 2
    +
    ## -2
    +

    If we don’t know how to use a function, such as pow(), we can ask for help:

    ?pow
     
     pow(base, exp, mod=None)
    @@ -376,33 +427,76 @@ 

    1.7.1 Execution rule for function Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form.

    -

    This shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to.

    +

    We can also find a similar help document, in a nicer rendered form online. We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own.

    +

    The documentation shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to.

    The following ways are equivalent ways of using the pow() function:

    -
    pow(2, 3)
    +
    pow(2, 3)
    ## 8
    -
    pow(base=2, exp=3)
    +
    pow(base=2, exp=3)
    ## 8
    -
    pow(exp=3, base=2)
    +
    pow(exp=3, base=2)
    ## 8

    but this will give you something different:

    -
    pow(3, 2)
    +
    pow(3, 2)
    ## 9

    And there is an operational equivalent:

    -
    2 ** 3
    +
    2 ** 3
    ## 8
    +

    We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let’s look at some examples of functions that don’t always have an input or output:

    +

    ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function callWhat it takes inWhat it doesReturns
    pow(a, b)integer a, integer bRaises a to the bth power.Integer
    time.sleep(x)Integer xWaits for x seconds.None
    dir()NothingGives a list of all the variables defined in the environment.List

    1.8 Tips on writing your first code

    Computer = powerful + stupid

    -

    Even the smallest spelling and formatting changes will cause unexpected output and errors!

    +

    Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners:

      -
    • Write incrementally, test often

    • +
    • Write incrementally, test often.

    • +
    • Don’t be afraid to break things: it is how we learn how things work in programming.

    • Check your assumptions, especially using new functions, operations, and new data types.

    • Live environments are great for testing, but not great for reproducibility.

    • Ask for help!

    To get more familiar with the errors Python gives you, take a look at this summary of Python error messages.

    +
    +
    +

    1.9 Exercises

    +

    Exercise for week 1 can be found here.

    @@ -421,7 +515,7 @@

    1.8 Tips on writing your first co

    - + diff --git a/docs/no_toc/reference-keys.txt b/docs/no_toc/reference-keys.txt index d172de4..2bb8bd5 100644 --- a/docs/no_toc/reference-keys.txt +++ b/docs/no_toc/reference-keys.txt @@ -8,10 +8,44 @@ what-is-a-computer-program a-programming-language-has-following-elements google-colab-setup grammar-structure-1-evaluation-of-expressions +function-machine-schema data-types grammar-structure-2-storing-data-types-in-the-variable-environment execution-rule-for-variable-assignment grammar-structure-3-evaluation-of-functions execution-rule-for-functions tips-on-writing-your-first-code +exercises +working-with-data-structures +lists +subsetting-lists +subsetting-multiple-elements-of-lists +objects-in-python +methods-vs-functions +dataframes +what-does-a-dataframe-contain +what-can-a-dataframe-do +subsetting-dataframes +exercises-1 +data-wrangling-part-1 +tidy-data +our-working-tidy-data-depmap-project +transform-what-do-you-want-to-do-with-this-dataframe +lets-convert-our-implicit-subsetting-criteria-into-code +summary-statistics +simple-data-visualization +chained-function-calls +exercises-2 +data-wrangling-part-2 +creating-new-columns +merging-two-dataframes-together +grouping-and-summarizing-dataframes +optional-multiple-grouping-multiple-columns-multiple-summary-statistics +exercises-3 +data-visualization +distributions-one-variable +relational-between-2-continuous-variables +categorical-between-1-categorical-and-1-continuous-variable +basic-plot-customization +exercises-4 references diff --git a/docs/no_toc/references.html b/docs/no_toc/references.html index d03090c..7c34cff 100644 --- a/docs/no_toc/references.html +++ b/docs/no_toc/references.html @@ -4,18 +4,18 @@ - Chapter 2 References | Introduction to Python + Chapter 6 References | Introduction to Python - + - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -200,8 +244,8 @@

    -
    -

    Chapter 2 References

    +
    +

    Chapter 6 References


    diff --git a/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..30316b9 Binary files /dev/null and b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png new file mode 100644 index 0000000..791e0d8 Binary files /dev/null and b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png differ diff --git a/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png new file mode 100644 index 0000000..9ef4a07 Binary files /dev/null and b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png differ diff --git a/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png new file mode 100644 index 0000000..9ef4a07 Binary files /dev/null and b/docs/no_toc/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png new file mode 100644 index 0000000..8d60f0b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-27.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-27.png new file mode 100644 index 0000000..d954333 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-27.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-28.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-28.png new file mode 100644 index 0000000..8d60f0b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-28.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-29.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-29.png new file mode 100644 index 0000000..d954333 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-29.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-30.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-30.png new file mode 100644 index 0000000..8d60f0b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-30.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-43.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-43.png new file mode 100644 index 0000000..d954333 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-43.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-44.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-44.png new file mode 100644 index 0000000..8d60f0b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-44.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-45.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-45.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-45.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png new file mode 100644 index 0000000..bcb02b5 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-31.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-31.png new file mode 100644 index 0000000..651791f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-31.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-32.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-32.png new file mode 100644 index 0000000..bcb02b5 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-32.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-33.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-33.png new file mode 100644 index 0000000..651791f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-33.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-34.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-34.png new file mode 100644 index 0000000..bcb02b5 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-34.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-49.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-49.png new file mode 100644 index 0000000..651791f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-49.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-50.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-50.png new file mode 100644 index 0000000..bcb02b5 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-50.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-51.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-51.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-51.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png new file mode 100644 index 0000000..c28c19e Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-35.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-35.png new file mode 100644 index 0000000..b7724b4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-35.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-36.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-36.png new file mode 100644 index 0000000..c28c19e Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-36.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-37.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-37.png new file mode 100644 index 0000000..b7724b4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-37.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-38.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-38.png new file mode 100644 index 0000000..c28c19e Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-38.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-55.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-55.png new file mode 100644 index 0000000..b7724b4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-55.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-56.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-56.png new file mode 100644 index 0000000..c28c19e Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-56.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-57.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-57.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-57.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png new file mode 100644 index 0000000..955505c Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-39.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-39.png new file mode 100644 index 0000000..c0237aa Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-39.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-40.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-40.png new file mode 100644 index 0000000..955505c Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-40.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-41.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-41.png new file mode 100644 index 0000000..c0237aa Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-41.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-42.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-42.png new file mode 100644 index 0000000..955505c Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-42.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-61.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-61.png new file mode 100644 index 0000000..c0237aa Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-61.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-62.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-62.png new file mode 100644 index 0000000..955505c Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-62.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-63.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-63.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-63.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png new file mode 100644 index 0000000..b48141d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png new file mode 100644 index 0000000..7448370 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-43.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-43.png new file mode 100644 index 0000000..b48141d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-43.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-44.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-44.png new file mode 100644 index 0000000..7448370 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-44.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-45.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-45.png new file mode 100644 index 0000000..b48141d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-45.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-46.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-46.png new file mode 100644 index 0000000..7448370 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-46.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-47.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-47.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-47.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-67.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-67.png new file mode 100644 index 0000000..b48141d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-67.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-68.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-68.png new file mode 100644 index 0000000..7448370 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-68.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-69.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-69.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-69.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-47.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-47.png new file mode 100644 index 0000000..db3a0d4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-47.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-48.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-48.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-48.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-49.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-49.png new file mode 100644 index 0000000..db3a0d4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-49.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-50.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-50.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-50.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-51.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-51.png new file mode 100644 index 0000000..db3a0d4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-51.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-52.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-52.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-52.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-53.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-53.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-53.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-73.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-73.png new file mode 100644 index 0000000..db3a0d4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-73.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-74.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-74.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-74.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-75.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-75.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-75.png differ diff --git a/docs/resources/images/02-chapter_of_course_files/figure-html/unnamed-chunk-4-1.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-16-51.png similarity index 100% rename from docs/resources/images/02-chapter_of_course_files/figure-html/unnamed-chunk-4-1.png rename to docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-16-51.png diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-16-53.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-16-53.png new file mode 100644 index 0000000..b17edc4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-16-53.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-17-1.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 0000000..b17edc4 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-18-1.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 0000000..41f9ad6 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-19-3.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-19-3.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-19-3.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..03f4cec Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-2.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-2.png new file mode 100644 index 0000000..03f4cec Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-2.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-3.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-3.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-3.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png new file mode 100644 index 0000000..c637f45 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-4.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-4.png new file mode 100644 index 0000000..c637f45 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-4.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-5.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-5.png new file mode 100644 index 0000000..774352f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-5.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-6.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-6.png new file mode 100644 index 0000000..c637f45 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-6.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-7.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-7.png new file mode 100644 index 0000000..774352f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-7.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-8.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-8.png new file mode 100644 index 0000000..c637f45 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-8.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-9.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-9.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-9.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-10.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-10.png new file mode 100644 index 0000000..2fc750a Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-10.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-13.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-13.png new file mode 100644 index 0000000..c564928 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-13.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-14.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-14.png new file mode 100644 index 0000000..2fc750a Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-14.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-15.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-15.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-15.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png new file mode 100644 index 0000000..2fc750a Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-7.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-7.png new file mode 100644 index 0000000..c564928 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-7.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-8.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-8.png new file mode 100644 index 0000000..2fc750a Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-8.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-9.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-9.png new file mode 100644 index 0000000..c564928 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-9.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-11.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-11.png new file mode 100644 index 0000000..fdf11af Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-11.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-12.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-12.png new file mode 100644 index 0000000..cdffe6b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-12.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-13.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-13.png new file mode 100644 index 0000000..fdf11af Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-13.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-14.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-14.png new file mode 100644 index 0000000..cdffe6b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-14.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-19.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-19.png new file mode 100644 index 0000000..fdf11af Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-19.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-20.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-20.png new file mode 100644 index 0000000..cdffe6b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-20.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-21.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-21.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-21.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png new file mode 100644 index 0000000..cdffe6b Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-15.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-15.png new file mode 100644 index 0000000..b9aafee Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-15.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-16.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-16.png new file mode 100644 index 0000000..f59f40d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-16.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-17.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-17.png new file mode 100644 index 0000000..b9aafee Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-17.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-18.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-18.png new file mode 100644 index 0000000..f59f40d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-18.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-25.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-25.png new file mode 100644 index 0000000..b9aafee Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-25.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-26.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-26.png new file mode 100644 index 0000000..f59f40d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-26.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-27.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-27.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-27.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png new file mode 100644 index 0000000..f59f40d Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png new file mode 100644 index 0000000..c9ffa6f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-19.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-19.png new file mode 100644 index 0000000..8ace120 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-19.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-20.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-20.png new file mode 100644 index 0000000..c9ffa6f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-20.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-21.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-21.png new file mode 100644 index 0000000..8ace120 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-21.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-22.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-22.png new file mode 100644 index 0000000..c9ffa6f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-22.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-31.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-31.png new file mode 100644 index 0000000..8ace120 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-31.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-32.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-32.png new file mode 100644 index 0000000..c9ffa6f Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-32.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-33.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-33.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-33.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png new file mode 100644 index 0000000..b82f5a0 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-23.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-23.png new file mode 100644 index 0000000..0970063 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-23.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-24.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-24.png new file mode 100644 index 0000000..b82f5a0 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-24.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-25.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-25.png new file mode 100644 index 0000000..0970063 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-25.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-26.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-26.png new file mode 100644 index 0000000..b82f5a0 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-26.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-37.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-37.png new file mode 100644 index 0000000..0970063 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-37.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-38.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-38.png new file mode 100644 index 0000000..b82f5a0 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-38.png differ diff --git a/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-39.png b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-39.png new file mode 100644 index 0000000..22f8179 Binary files /dev/null and b/docs/no_toc/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-39.png differ diff --git a/docs/no_toc/search_index.json b/docs/no_toc/search_index.json index aa8aa01..1264e16 100644 --- a/docs/no_toc/search_index.json +++ b/docs/no_toc/search_index.json @@ -1 +1 @@ -[["index.html", "Introduction to Python About this Course 0.1 Curriculum 0.2 Target Audience 0.3 Learning Objectives 0.4 Offerings", " Introduction to Python August, 2024 About this Course 0.1 Curriculum The course covers fundamentals of Python, a high-level programming language, and use it to wrangle data for analysis and visualization. 0.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application via the Python language. This course is also appropriate for folks who have explored data science or programming on their own and want to focus on some fundamentals. 0.3 Learning Objectives Analyze Tidy datasets in the Python programming language via data subsetting, joining, and transformations. Evaluate summary statistics and data visualization to understand scientific questions. Describe how the Python programming environment interpret complex expressions made out of functions, operations, and data structures, in a step-by-step way. Apply problem solving strategies to debug broken code. 0.4 Offerings This course is taught on a regular basis at Fred Hutch Cancer Center through the Data Science Lab. Announcements of course offering can be found here. "],["intro-to-computing.html", "Chapter 1 Intro to Computing 1.1 Goals of the course 1.2 What is a computer program? 1.3 A programming language has following elements: 1.4 Google Colab Setup 1.5 Grammar Structure 1: Evaluation of Expressions 1.6 Grammar Structure 2: Storing data types in the Variable Environment 1.7 Grammar Structure 3: Evaluation of Functions 1.8 Tips on writing your first code", " Chapter 1 Intro to Computing Welcome to Introduction to Python! Each week, we cover a chapter, which consists of a lesson and exercise. In our first week together, we will look at big conceptual themes in programming, see how code is run, and learn some basic grammar structures of programming. 1.1 Goals of the course In the next 6 weeks, we will explore: Fundamental concepts in high-level programming languages (Python, R, Julia, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow. Image source: R for Data Science. Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 1.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for Python Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 1.3 A programming language has following elements: Grammar structure to construct expressions; combining expressions to create more complex expressions Encapsulate complex expressions via functions to create modular and reusable tasks Encapsulate complex data via data structures to allow efficient manipulation of data 1.4 Google Colab Setup Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, open up… Today, we will pay close attention to: Python Console (Execution): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing. Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code. The first thing we will do is see the different ways we can run Python code. You can do the following: Type something into the Python Console (Execution) and type enter, such as 2+2. The Python Console will run it and give you an output. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. Run every single Python code chunk via Runtime -> Run all. Remember that the order that you run your code matters in programming. Your final product would be the result of Option 3, in which you run every Python code chunk from start to finish. However, sometimes it is nice to try out smaller parts of your code via Options 1 or 2. But you will be at risk of running your code out of order! To create your own content in the notebook, click on a section you want to insert content, and then click on “+ Code” or “+ Text” to add Python code or text, respectively. Python Notebook is great for data science work, because: It encourages reproducible data analysis, when you run your analysis from start to finish. It encourages excellent documentation, as you can have code, output from code, and prose combined together. It is flexible to use other programming languages, such as R. Now, we will get to the basics of programming grammar. 1.5 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Functions and operations take in data types, do something with them, and return another data type. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the Python Console: 18 + 21 ## 39 max(18, 21) ## 21 max(18 + 21, 65) ## 65 18 + (21 + 65) ## 104 len("ATCG") ## 4 Here, our input data types to the operation are integer in lines 1-4 and our input data type to the function is string in line 5. We will go over common data types shortly. Operations are just functions in hiding. We could have written: from operator import add add(18, 21) ## 39 add(18, add(21, 65)) ## 104 Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the add() function isn’t typically used, it is not automatically available, so we used the import statement to load it in.) 1.5.1 Data types Here are some common data types we will be using in this course. Data type name Data type shorthand Examples Integer int 2, 4 Float float 3.5, -34.1009 String str “hello”, “234-234-8594” Boolean bool True, False A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: Function machine from algebra class. Here are some aspects of this schema to pay attention to: A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. 1.6 Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Variable Environment, the variable x has a value of 39. 1.6.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the Variable Environment. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. Look, now x can be reused downstream: x - 2 ## 37 y = x * 2 It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python: type(y) ## <class 'int'> We should give useful variable names so that we know what to expect! Consider num_sales instead of y. 1.7 Grammar Structure 3: Evaluation of Functions Let’s look at functions a little bit more formally: A function has a function name, arguments, and returns a data type. 1.7.1 Execution rule for functions: Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by–step in the line of code below: (len("hello") + 4) * 2 ## 18 If we don’t know how to use a function, such as pow() we can ask for help: ?pow pow(base, exp, mod=None) Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. This shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to. The following ways are equivalent ways of using the pow() function: pow(2, 3) ## 8 pow(base=2, exp=3) ## 8 pow(exp=3, base=2) ## 8 but this will give you something different: pow(3, 2) ## 9 And there is an operational equivalent: 2 ** 3 ## 8 1.8 Tips on writing your first code Computer = powerful + stupid Even the smallest spelling and formatting changes will cause unexpected output and errors! Write incrementally, test often Check your assumptions, especially using new functions, operations, and new data types. Live environments are great for testing, but not great for reproducibility. Ask for help! To get more familiar with the errors Python gives you, take a look at this summary of Python error messages. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) FirstName LastName Lecturer(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved Delivered the course in some way - video or audio Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-08-07 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## askpass 1.2.0 2023-09-03 [1] RSPM (R 4.3.0) ## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fansi 1.0.6 2023-12-08 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## hms 1.1.3 2023-03-21 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## httr 1.4.7 2023-08-15 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## openssl 2.1.1 2023-09-25 [1] RSPM (R 4.3.0) ## ottrpal 1.2.1 2024-06-11 [1] Github (jhudsl/ottrpal@828539f) ## pillar 1.9.0 2023-03-22 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## readr 2.1.5 2024-01-10 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) ## tzdb 0.4.0 2023-05-12 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## utf8 1.2.4 2023-10-22 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) ## xml2 1.3.6 2023-12-04 [1] RSPM (R 4.3.0) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 2 References", " Chapter 2 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Introduction to Python About this Course 0.1 Curriculum 0.2 Target Audience 0.3 Learning Objectives 0.4 Offerings", " Introduction to Python September, 2024 About this Course 0.1 Curriculum The course covers fundamentals of Python, a high-level programming language, and use it to wrangle data for analysis and visualization. 0.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application via the Python language. This course is also appropriate for folks who have explored data science or programming on their own and want to focus on some fundamentals. 0.3 Learning Objectives Analyze Tidy datasets in the Python programming language via data subsetting, joining, and transformations. Evaluate summary statistics and data visualization to understand scientific questions. Describe how the Python programming environment interpret complex expressions made out of functions, operations, and data structures, in a step-by-step way. Apply problem solving strategies to debug broken code. 0.4 Offerings This course is taught on a regular basis at Fred Hutch Cancer Center through the Data Science Lab. Announcements of course offering can be found here. "],["intro-to-computing.html", "Chapter 1 Intro to Computing 1.1 Goals of the course 1.2 What is a computer program? 1.3 A programming language has following elements: 1.4 Google Colab Setup 1.5 Grammar Structure 1: Evaluation of Expressions 1.6 Grammar Structure 2: Storing data types in the Variable Environment 1.7 Grammar Structure 3: Evaluation of Functions 1.8 Tips on writing your first code 1.9 Exercises", " Chapter 1 Intro to Computing Welcome to Introduction to Python! Each week, we cover a chapter, which consists of a lesson and exercise. In our first week together, we will look at big conceptual themes in programming, see how code is run, and learn some basic grammar structures of programming. 1.1 Goals of the course In the next 6 weeks, we will explore: Fundamental concepts in high-level programming languages (Python, R, Julia, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow. Image source: R for Data Science. Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 1.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for Python Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 1.3 A programming language has following elements: Grammar structure to construct expressions; combining expressions to create more complex expressions Encapsulate complex expressions via functions to create modular and reusable tasks Encapsulate complex data via data structures to allow efficient manipulation of data 1.4 Google Colab Setup Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, you can view it here. Today, we will pay close attention to: Python Console (“Executions”): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing. Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code. The first thing we will do is see the different ways we can run Python code. You can do the following: Type something into the Python Console (Execution) and click the arrow button, such as 2+2. The Python Console will run it and give you an output. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. Run every single Python code chunk via Runtime -> Run all. Remember that the order that you run your code matters in programming. Your final product would be the result of Option 3, in which you run every Python code chunk from start to finish. However, sometimes it is nice to try out smaller parts of your code via Options 1 or 2. But you will be at risk of running your code out of order! To create your own content in the notebook, click on a section you want to insert content, and then click on “+ Code” or “+ Text” to add Python code or text, respectively. Python Notebook is great for data science work, because: It encourages reproducible data analysis, when you run your analysis from start to finish. It encourages excellent documentation, as you can have code, output from code, and prose combined together. It is flexible to use other programming languages, such as R. The version of Python used in this course and in Google Colab is Python 3, which is the version of Python that is most supported. Some Python software is written in Python 2, which is very similar but has some notable differences. Now, we will get to the basics of programming grammar. 1.5 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Functions and operations take in data types as inputs, do something with them, and return another data type as ouput. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the Python Console: 18 + 21 ## 39 max(18, 21) ## 21 max(18 + 21, 65) ## 65 18 + (21 + 65) ## 104 len("ATCG") ## 4 Here, our input data types to the operation are integer in lines 1-4 and our input data type to the function is string in line 5. We will go over common data types shortly. Operations are just functions in hiding. We could have written: from operator import add add(18, 21) ## 39 add(18, add(21, 65)) ## 104 Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called modules that needs to be loaded. The import statement gives us permission to access the functions in the module “operator”.) 1.5.1 Function machine schema A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: Function machine from algebra class. Here are some aspects of this schema to pay attention to: A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. 1.5.2 Data types Here are some common data types we will be using in this course. Data type name Data type shorthand Examples Integer int 2, 4 Float float 3.5, -34.1009 String str “hello”, “234-234-8594” Boolean bool True, False 1.6 Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Variable Environment, the variable x has a value of 39. 1.6.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the Variable Environment. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. Look, now x can be reused downstream: x - 2 ## 37 y = x * 2 It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python: type(y) ## <class 'int'> We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider num_sales instead of y. 1.7 Grammar Structure 3: Evaluation of Functions Let’s look at functions a little bit more formally: A function has a function name, arguments, and returns a data type. 1.7.1 Execution rule for functions: Evaluate the function by its arguments if there’s any, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by–step in the lines of code below: max(len("hello"), 4) ## 5 (len("pumpkin") - 8) * 2 ## -2 If we don’t know how to use a function, such as pow(), we can ask for help: ?pow pow(base, exp, mod=None) Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. We can also find a similar help document, in a nicer rendered form online. We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own. The documentation shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to. The following ways are equivalent ways of using the pow() function: pow(2, 3) ## 8 pow(base=2, exp=3) ## 8 pow(exp=3, base=2) ## 8 but this will give you something different: pow(3, 2) ## 9 And there is an operational equivalent: 2 ** 3 ## 8 We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let’s look at some examples of functions that don’t always have an input or output: Function call What it takes in What it does Returns pow(a, b) integer a, integer b Raises a to the bth power. Integer time.sleep(x) Integer x Waits for x seconds. None dir() Nothing Gives a list of all the variables defined in the environment. List 1.8 Tips on writing your first code Computer = powerful + stupid Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners: Write incrementally, test often. Don’t be afraid to break things: it is how we learn how things work in programming. Check your assumptions, especially using new functions, operations, and new data types. Live environments are great for testing, but not great for reproducibility. Ask for help! To get more familiar with the errors Python gives you, take a look at this summary of Python error messages. 1.9 Exercises Exercise for week 1 can be found here. "],["working-with-data-structures.html", "Chapter 2 Working with data structures 2.1 Lists 2.2 Objects in Python 2.3 Methods vs Functions 2.4 Dataframes 2.5 What does a Dataframe contain? 2.6 What can a Dataframe do? 2.7 Subsetting Dataframes 2.8 Exercises", " Chapter 2 Working with data structures In our second lesson, we start to look at two data structures, Lists and Dataframes, that can handle a large amount of data for analysis. 2.1 Lists In the first exercise, you started to explore data structures, which store information about data types. You explored lists, which is an ordered collection of data types or data structures. Each element of a list contains a data type or another data structure. We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive. We create a list via the bracket [ ] operation. staff = ["chris", "ted", "jeff"] chrNum = [2, 3, 1, 2, 2] mixedList = [False, False, False, "A", "B", 92] 2.1.1 Subsetting lists To access an element of a list, you can use the bracket notation [ ] to access the elements of the list. We simply access an element via the “index” number - the location of the data within the list. Here’s the tricky thing about the index number: it starts at 0! 1st element of chrNum: chrNum[0] 2nd element of chrNum: chrNum[1] … 5th element of chrNum: chrNum[4] With subsetting, you can modify elements of a list or use the element of a list as part of an expression. 2.1.2 Subsetting multiple elements of lists Suppose you want to access multiple elements of a list, such as accessing the first three elements of chrNum. You would use the slice operator :, which specifies: the index number to start the index number to stop, plus one. If you want to access the first three elements of chrNum: chrNum[0:3] ## [2, 3, 1] The first element’s index number is 0, the third element’s index number is 2, plus 1, which is 3. If you want to access the second and third elements of chrNum: chrNum[1:3] ## [3, 1] Another way of accessing the first 3 elements of chrNum: chrNum[:3] ## [2, 3, 1] Here, the start index number was not specified. When the start or stop index is not specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here’s another example, using negative indicies to count from 3 elements from the end of the list: chrNum[-3:] ## [1, 2, 2] You can find more discussion of list slicing, using negative indicies and incremental slicing, here. 2.2 Objects in Python The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined: What does it contain (in terms of data)? What can it do (in terms of functions)? And if it “makes sense” to us, then it is well-designed. The list data structure we have been working with is an example of an Object. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: Value that holds the essential data for the object. Attributes that hold subset or additional data for the object. Functions called Methods that are for the object and have to take in the variable referenced as an input This organizing structure on an object applies to pretty much all Python data types and data structures. Let’s see how this applies to the list: Value: the contents of the list, such as [2, 3, 4]. Attributes that store additional values: Not relevant for lists. Methods that can be used on the object: chrNum.count(2) counts the number of instances 2 appears as an element of chrNum. Object methods are functions that does something with the object you are using it on. You should think about chrNum.count(2) as a function that takes in chrNum and 2 as inputs. If you want to use the count function on list mixedList, you would use mixedList.count(x). Here are some more examples of methods with lists: Function method What it takes in What it does Returns chrNum.count(x) list chrNum, data type x Counts the number of instances x appears as an element of chrNum. Integer chrNum.append(x) list chrNum, data type x Appends x to the end of the chrNum. None (but chrNum is modified!) chrNum.sort() list chrNum Sorts chrNum by ascending order. None (but chrNum is modified!) chrNum.reverse() list chrNum Reverses the order of chrNum. None (but chrNum is modified!) 2.3 Methods vs Functions Methods have to take in the object of interest as an input: chrNum.count(2) automatically treat chrNum as an input. Methods are built for a specific Object type. Functions do not have an implied input: len(chrNum) requires specifying a list in the input. Otherwise, there is no strong distinction between the two. 2.4 Dataframes A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does. The Dataframe data structure is found within a Python module called “Pandas”. A Python module is an organized collection of functions and data structures. The import statement below gives us permission to access the “Pandas” module via the variable pd. To load in a Dataframe from existing spreadsheet data, we use the function pd.read_csv(): import pandas as pd metadata = pd.read_csv("classroom_data/metadata.csv") type(metadata) ## <class 'pandas.core.frame.DataFrame'> There is a similar function pd.read_excel() for loading in Excel spreadsheets. Let’s investigate the Dataframe as an object: What does a Dataframe contain (values, attributes)? What can a Dataframe do (methods)? 2.5 What does a Dataframe contain? We first take a look at the contents: metadata ## ModelID ... OncotreeLineage ## 0 ACH-000001 ... Ovary/Fallopian Tube ## 1 ACH-000002 ... Myeloid ## 2 ACH-000003 ... Bowel ## 3 ACH-000004 ... Myeloid ## 4 ACH-000005 ... Myeloid ## ... ... ... ... ## 1859 ACH-002968 ... Esophagus/Stomach ## 1860 ACH-002972 ... Esophagus/Stomach ## 1861 ACH-002979 ... Esophagus/Stomach ## 1862 ACH-002981 ... Esophagus/Stomach ## 1863 ACH-003071 ... Lung ## ## [1864 rows x 30 columns] It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data. metadata ## ModelID ... OncotreeLineage ## 0 ACH-000001 ... Ovary/Fallopian Tube ## 1 ACH-000002 ... Myeloid ## 2 ACH-000003 ... Bowel ## 3 ACH-000004 ... Myeloid ## 4 ACH-000005 ... Myeloid ## ... ... ... ... ## 1859 ACH-002968 ... Esophagus/Stomach ## 1860 ACH-002972 ... Esophagus/Stomach ## 1861 ACH-002979 ... Esophagus/Stomach ## 1862 ACH-002981 ... Esophagus/Stomach ## 1863 ACH-003071 ... Lung ## ## [1864 rows x 30 columns] We can look at specific columns by looking at attributes via the dot operation. We can also look at the columns via the bracket operation. metadata.ModelID ## 0 ACH-000001 ## 1 ACH-000002 ## 2 ACH-000003 ## 3 ACH-000004 ## 4 ACH-000005 ## ... ## 1859 ACH-002968 ## 1860 ACH-002972 ## 1861 ACH-002979 ## 1862 ACH-002981 ## 1863 ACH-003071 ## Name: ModelID, Length: 1864, dtype: object metadata['ModelID'] ## 0 ACH-000001 ## 1 ACH-000002 ## 2 ACH-000003 ## 3 ACH-000004 ## 4 ACH-000005 ## ... ## 1859 ACH-002968 ## 1860 ACH-002972 ## 1861 ACH-002979 ## 1862 ACH-002981 ## 1863 ACH-003071 ## Name: ModelID, Length: 1864, dtype: object The names of all columns is stored as an attribute, which can be accessed via the dot operation. metadata.columns ## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age', ## 'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory', ## 'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis', ## 'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype', ## 'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments', ## 'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus', ## 'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype', ## 'OncotreePrimaryDisease', 'OncotreeLineage'], ## dtype='object') The number of rows and columns are also stored as an attribute: metadata.shape ## (1864, 30) 2.6 What can a Dataframe do? We can use the .head() and .tail() methods to look at the first few rows and last few rows of metadata, respectively: metadata.head() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube ## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid ## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel ## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## ## [5 rows x 30 columns] metadata.tail() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 1859 ACH-002968 PT-pjhrsc ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1860 ACH-002972 PT-dkXZB1 ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1861 ACH-002979 PT-lyHTzo ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1862 ACH-002981 PT-Z9akXf ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1863 ACH-003071 PT-LAGmLq ... Lung Neuroendocrine Tumor Lung ## ## [5 rows x 30 columns] Both of these functions (without input arguments) are considered as methods: they are functions that does something with the Dataframe you are using it on. You should think about metadata.head() as a function that takes in metadata as an input. If we had another Dataframe called my_data and you want to use the same function, you will have to say my_data.head(). 2.7 Subsetting Dataframes Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. You will use the iloc attribute and bracket operations, and you give two slices: one for the row, and one for the column. Let’s start with a small dataframe to see how it works before returning to metadata: df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 Here is how the dataframe looks like with the row and column index numbers: Subset the first fourth rows, and the first two columns: Now, back to metadata dataframe: Subset the first 5 rows, and first two columns: metadata.iloc[:5, :2] ## ModelID PatientID ## 0 ACH-000001 PT-gj46wT ## 1 ACH-000002 PT-5qa3uk ## 2 ACH-000003 PT-puKIyc ## 3 ACH-000004 PT-q4K2cp ## 4 ACH-000005 PT-q4K2cp If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column: metadata.iloc[5:, [1, 10, 21]] ## PatientID GrowthPattern WTSIMasterCellID ## 5 PT-ej13Dz Suspension 2167.0 ## 6 PT-NOXwpH Adherent 569.0 ## 7 PT-fp8PeY Adherent 1806.0 ## 8 PT-puKIyc Adherent 2104.0 ## 9 PT-AR7W9o Adherent NaN ## ... ... ... ... ## 1859 PT-pjhrsc Organoid NaN ## 1860 PT-dkXZB1 Organoid NaN ## 1861 PT-lyHTzo Organoid NaN ## 1862 PT-Z9akXf Organoid NaN ## 1863 PT-LAGmLq Suspension NaN ## ## [1859 rows x 3 columns] When we subset via numerical indicies, it’s called explicit subsetting. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed. The second way is to subset by the column name and comparison operators, also known as implicit subsetting. This is much more robust in data analysis practice. You will learn about it next week! 2.8 Exercises Exercise for week 2 can be found here. "],["data-wrangling-part-1.html", "Chapter 3 Data Wrangling, Part 1 3.1 Tidy Data 3.2 Our working Tidy Data: DepMap Project 3.3 Transform: “What do you want to do with this Dataframe”? 3.4 Summary Statistics 3.5 Simple data visualization 3.6 Exercises", " Chapter 3 Data Wrangling, Part 1 From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis. Data science workflow. Image source: R for Data Science. For the rest of the course, we focus on Transform and Visualize with the assumption that our data is in a nice, “Tidy format”. First, we need to understand what it means for a data to be “Tidy”. 3.1 Tidy Data Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of tidy data, developed by Hadley Wickham: Each variable must have its own column. Each observation must have its own row. Each value must have its own cell. If you want to be technical about what variables and observations are, Hadley Wickham describes: A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes. A tidy dataframe. Image source: R for Data Science. 3.2 Our working Tidy Data: DepMap Project The Dependency Map project is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session. Metadata Somatic mutations Gene expression Drug sensitivity CRISPR knockout and more… Let’s load these datasets in, and see how these datasets fit the definition of Tidy data: import pandas as pd metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") metadata.head() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube ## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid ## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel ## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## ## [5 rows x 30 columns] mutation.head() ## ModelID CACNA1D_Mut CYP2D6_Mut ... CCDC28A_Mut C1orf194_Mut U2AF1_Mut ## 0 ACH-000001 False False ... False False False ## 1 ACH-000002 False False ... False False False ## 2 ACH-000004 False False ... False False False ## 3 ACH-000005 False False ... False False False ## 4 ACH-000006 False False ... False False False ## ## [5 rows x 540 columns] expression.head() ## ModelID ENPP4_Exp CREBBP_Exp ... OR5D13_Exp C2orf81_Exp OR8S1_Exp ## 0 ACH-001113 2.280956 4.094236 ... 0.0 1.726831 0.0 ## 1 ACH-001289 3.622930 3.606442 ... 0.0 0.790772 0.0 ## 2 ACH-001339 0.790772 2.970854 ... 0.0 0.575312 0.0 ## 3 ACH-001538 3.485427 2.801159 ... 0.0 1.077243 0.0 ## 4 ACH-000242 0.879706 3.327687 ... 0.0 0.722466 0.0 ## ## [5 rows x 536 columns] Dataframe The observation is Some variables are Some values are metadata Cell line ModelID, Age, OncotreeLineage “ACH-000001”, 60, “Myeloid” expression Cell line KRAS_Exp 2.4, .3 mutation Cell line KRAS_Mut TRUE, FALSE 3.3 Transform: “What do you want to do with this Dataframe”? Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something. Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows. Here’s a starting prompt: In the metadata dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question? We have been using explicit subsetting with numerical indicies, such as “I want to filter for rows 20-50 and select columns 2 and 8”. We are now going to switch to implicit subsetting in which we describe the subsetting criteria via comparision operators and column names, such as: “I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex.” Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names. 3.3.0.1 Let’s convert our implicit subsetting criteria into code! To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer: metadata['OncotreeLineage'] == "Lung" ## 0 False ## 1 False ## 2 False ## 3 False ## 4 False ## ... ## 1859 False ## 1860 False ## 1861 False ## 1862 False ## 1863 True ## Name: OncotreeLineage, Length: 1864, dtype: bool Then, we will use the .loc operation (which is different than .iloc operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]] ## Age Sex ## 10 39.0 Female ## 13 44.0 Male ## 19 55.0 Female ## 27 39.0 Female ## 28 45.0 Male ## ... ... ... ## 1745 52.0 Male ## 1819 84.0 Male ## 1820 57.0 Female ## 1822 53.0 Male ## 1863 62.0 Male ## ## [241 rows x 2 columns] What’s going on here? The first component of the subset, metadata['OncotreeLineage'] == \"Lung\", subsets for the rows. It gives us a column of True and False values, and we keep rows that correspond to True values. Then, we specify the column names we want to subset for via a list. Here’s another example: df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 “I want to subset for rows such that the status is”treated” and subset for columns status and age_case.” df.loc[df.status == "treated", ["status", "age_case"]] ## status age_case ## 0 treated 25 ## 4 treated 7 3.4 Summary Statistics Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode. If we look at the data structure of a Dataframe’s column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples: Function method What it takes in What it does Returns metadata.Age.mean() metadata.Age as a numeric Series Computes the mean value of the Age column. Float (NumPy) metadata['Age'].median() metadata['Age'] as a numeric Series Computes the median value of the Age column. Float (NumPy) metadata.Age.max() metadata.Age as a numeric Series Computes the max value of the Age column. Float (NumPy) metadata.OncotreeSubtype.value_counts() metadata.OncotreeSubtype as a string Series Creates a frequency table of all unique elements in OncotreeSubtype column. Series Let’s try it out, with some nice print formatting: print("Mean value of Age column:", metadata['Age'].mean()) ## Mean value of Age column: 47.45187165775401 print("Frequency of column", metadata.OncotreeLineage.value_counts()) ## Frequency of column OncotreeLineage ## Lung 241 ## Lymphoid 209 ## CNS/Brain 123 ## Skin 118 ## Esophagus/Stomach 95 ## Breast 92 ## Bowel 87 ## Head and Neck 81 ## Myeloid 77 ## Bone 75 ## Ovary/Fallopian Tube 74 ## Pancreas 65 ## Kidney 64 ## Peripheral Nervous System 55 ## Soft Tissue 54 ## Uterus 41 ## Fibroblast 41 ## Biliary Tract 40 ## Bladder/Urinary Tract 39 ## Normal 39 ## Pleura 35 ## Liver 28 ## Cervix 25 ## Eye 19 ## Thyroid 18 ## Prostate 14 ## Vulva/Vagina 5 ## Ampulla of Vater 4 ## Testis 4 ## Adrenal Gland 1 ## Other 1 ## Name: count, dtype: int64 Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we’re not focused on that in this course.) 3.5 Simple data visualization We will dedicate extensive time later this course to talk about data visualization, but the Dataframe’s column, Series, has a method called .plot() that can help us make simple plots for one variable. The .plot() method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument kind a String value to specify the plot style. We use it for making a histogram or bar plot. Plot style Useful for kind = Code Histogram Numerics “hist” metadata.Age.plot(kind = \"hist\") Bar plot Strings “bar” metadata.OncotreeSubtype.value_counts().plot(kind = \"bar\") Let’s look at a histogram: import matplotlib.pyplot as plt plt.figure() metadata.Age.plot(kind = "hist") plt.show() Let’s look at a bar plot: plt.figure() metadata.OncotreeLineage.value_counts().plot(kind = "bar") plt.show() (The plt.figure() and plt.show() functions are used to render the plots on the website, but you don’t need to use it for your exercises. We will discuss this in more detail during our week of data visualization.) 3.5.0.1 Chained function calls Let’s look at our bar plot syntax more carefully. We start with the column metadata.OncotreeLineage, and then we first use the method .value_counts() to get a Series of a frequency table. Then, we take the frequency table Series and use the .plot() method. It is quite common in Python to have multiple “chained” function calls, in which the output of .value_counts() is used for the input of .plot() all in one line of code. It takes a bit of time to get used to this! Here’s another example of a chained function call, which looks quite complex, but let’s break it down: plt.figure() metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar") plt.show() We first take the entire metadata and do some subsetting, which outputs a Dataframe. We access the OncotreeLineage column, which outputs a Series. We use the method .value_counts(), which outputs a Series. We make a plot out of it! We could have, alternatively, done this in several lines of code: plt.figure() metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ] metadata_subset_lineage = metadata_subset.OncotreeLineage lineage_freq = metadata_subset_lineage.value_counts() lineage_freq.plot(kind = "bar") plt.show() These are two different styles of code, but they do the exact same thing. It’s up to you to decide what is easier for you to understand. 3.6 Exercises Exercise for week 3 can be found here. "],["data-wrangling-part-2.html", "Chapter 4 Data Wrangling, Part 2 4.1 Creating new columns 4.2 Merging two Dataframes together 4.3 Grouping and summarizing Dataframes 4.4 Exercises", " Chapter 4 Data Wrangling, Part 2 We will continue to learn about data analysis with Dataframes. Let’s load our three Dataframes from the Depmap project in again: import pandas as pd import numpy as np metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") 4.1 Creating new columns Often, we want to perform some kind of transformation on our data’s columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale. To create a new column, you simply modify it as if it exists using the bracket operation [ ], and the column will be created: metadata['AgePlusTen'] = metadata['Age'] + 10 expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp'] expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp']) where np.log(x) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value. Note: you cannot create a new column referring to the attribute of the Dataframe, such as: expression.KRAS_Exp_log = np.log(expression.KRAS_Exp). 4.2 Merging two Dataframes together Suppose we have the following Dataframes: expression ModelID PIK3CA_Exp log_PIK3CA_Exp “ACH-001113” 5.138733 1.636806 “ACH-001289” 3.184280 1.158226 “ACH-001339” 3.165108 1.152187 metadata ModelID OncotreeLineage Age “ACH-001113” “Lung” 69 “ACH-001289” “CNS/Brain” NaN “ACH-001339” “Skin” 14 Suppose that I want to compare the relationship between OncotreeLineage and PIK3CA_Exp, but they are columns in different Dataframes. We want a new Dataframe that looks like this: ModelID PIK3CA_Exp log_PIK3CA_Exp OncotreeLineage Age “ACH-001113” 5.138733 1.636806 “Lung” 69 “ACH-001289” 3.184280 1.158226 “CNS/Brain” NaN “ACH-001339” 3.165108 1.152187 “Skin” 14 We see that in both dataframes, the rows (observations) represent cell lines. there is a common column ModelID, with shared values between the two dataframes that can faciltate the merging process. We call this an index. We will use the method .merge() for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index. merged = metadata.merge(expression) It’s usually better to specify what that index column to avoid ambiguity, using the on optional argument: merged = metadata.merge(expression, on='ModelID') If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe: merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID') One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not: The number of rows and columns of metadata: metadata.shape ## (1864, 31) The number of rows and columns of expression: expression.shape ## (1450, 538) The number of rows and columns of merged: merged.shape ## (1450, 568) We see that the number of columns in merged combines the number of columns in metadata and expression, while the number of rows in merged is the smaller of the number of rows in metadata and expression: it only keeps rows that are found in both Dataframe’s index columns. This kind of join is called “inner join”, because in the Venn Diagram of elements common in both index column, we keep the inner overlap: You can specifiy the join style by changing the optional input argument how. how = \"outer\" keeps all observations - also known as a “full join” how = \"left\" keeps all observations in the left Dataframe. how = \"right\" keeps all observations in the right Dataframe. how = \"inner\" keeps observations common to both Dataframe. This is the default value of how. 4.3 Grouping and summarizing Dataframes In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in metadata, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, OncotreeLineage, and look at the mean age for each cancer type. We want to take metadata: ModelID OncotreeLineage Age “ACH-001113” “Lung” 69 “ACH-001289” “Lung” 23 “ACH-001339” “Skin” 14 “ACH-002342” “Brain” 23 “ACH-004854” “Brain” 56 “ACH-002921” “Brain” 67 into: OncotreeLineage MeanAge “Lung” 46 “Skin” 14 “Brain” 48.67 To get there, we need to: Group the data based on some criteria, elements of OncotreeLineage Summarize each group via a summary statistic performed on a column, such as Age. We first subset the the two columns we need, and then use the methods .group_by(x) and .mean(). metadata_grouped = metadata.groupby("OncotreeLineage") metadata_grouped['Age'].mean() ## OncotreeLineage ## Adrenal Gland 55.000000 ## Ampulla of Vater 65.500000 ## Biliary Tract 58.450000 ## Bladder/Urinary Tract 65.166667 ## Bone 20.854545 ## Bowel 58.611111 ## Breast 50.961039 ## CNS/Brain 43.849057 ## Cervix 47.136364 ## Esophagus/Stomach 57.855556 ## Eye 51.100000 ## Fibroblast 38.194444 ## Head and Neck 60.149254 ## Kidney 46.193548 ## Liver 43.928571 ## Lung 55.444444 ## Lymphoid 38.916667 ## Myeloid 38.810811 ## Normal 52.370370 ## Other 46.000000 ## Ovary/Fallopian Tube 51.980769 ## Pancreas 60.226415 ## Peripheral Nervous System 5.480000 ## Pleura 61.000000 ## Prostate 61.666667 ## Skin 49.033708 ## Soft Tissue 27.500000 ## Testis 25.000000 ## Thyroid 63.235294 ## Uterus 62.060606 ## Vulva/Vagina 75.400000 ## Name: Age, dtype: float64 Here’s what’s going on: We use the Dataframe method .group_by(x) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the metadata Dataframe, but it makes a note that it’s been grouped. We subset to the column Age. The grouping information still persists (This is a Grouped Series object). We use the method .mean() to calculate the mean value of Age within each group defined by OncotreeLineage. Alternatively, this could have been done in a chain of methods: metadata.groupby("OncotreeLineage")["Age"].mean() ## OncotreeLineage ## Adrenal Gland 55.000000 ## Ampulla of Vater 65.500000 ## Biliary Tract 58.450000 ## Bladder/Urinary Tract 65.166667 ## Bone 20.854545 ## Bowel 58.611111 ## Breast 50.961039 ## CNS/Brain 43.849057 ## Cervix 47.136364 ## Esophagus/Stomach 57.855556 ## Eye 51.100000 ## Fibroblast 38.194444 ## Head and Neck 60.149254 ## Kidney 46.193548 ## Liver 43.928571 ## Lung 55.444444 ## Lymphoid 38.916667 ## Myeloid 38.810811 ## Normal 52.370370 ## Other 46.000000 ## Ovary/Fallopian Tube 51.980769 ## Pancreas 60.226415 ## Peripheral Nervous System 5.480000 ## Pleura 61.000000 ## Prostate 61.666667 ## Skin 49.033708 ## Soft Tissue 27.500000 ## Testis 25.000000 ## Thyroid 63.235294 ## Uterus 62.060606 ## Vulva/Vagina 75.400000 ## Name: Age, dtype: float64 Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as .mean(), .median(), .max(), can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is .count() which tells you how many entries are counted within each group. 4.3.1 Optional: Multiple grouping, Multiple columns, Multiple summary statistics Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously. For example, you may want to group by a combination of OncotreeLineage and AgeCategory, such as “Lung” and “Adult” as one grouping. You can do so like this: metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"]) metadata_grouped['Age'].mean() ## OncotreeLineage AgeCategory ## Adrenal Gland Adult 55.000000 ## Ampulla of Vater Adult 65.500000 ## Unknown NaN ## Biliary Tract Adult 58.450000 ## Unknown NaN ## ... ## Thyroid Unknown NaN ## Uterus Adult 62.060606 ## Fetus NaN ## Unknown NaN ## Vulva/Vagina Adult 75.400000 ## Name: Age, Length: 72, dtype: float64 You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the .agg(x) method on a Grouped Dataframe. For example, coming back to our age case-control Dataframe, df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 We group by status and summarize age_case and age_control with a few summary statistics each: df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]}) ## age_case age_control ## mean min max mean ## status ## discharged 65.0 25 25 25.0 ## treated 16.0 32 49 40.5 ## untreated 32.0 20 32 26.0 The input argument to the .agg(x) method is called a Dictionary, which let’s you structure information in a paired relationship. You can learn more about dictionaries here. 4.4 Exercises Exercise for week 4 can be found here. "],["data-visualization.html", "Chapter 5 Data Visualization 5.1 Distributions (one variable) 5.2 Relational (between 2 continuous variables) 5.3 Categorical (between 1 categorical and 1 continuous variable) 5.4 Basic plot customization 5.5 Exercises", " Chapter 5 Data Visualization In our final to last week together, we learn about how to visualize our data. There are several different data visualization modules in Python: matplotlib is a general purpose plotting module that is commonly used. seaborn is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course. plotnine is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package “ggplot”. To get started, we will consider these most simple and common plots: Distributions (one variable) Histograms Relational (between 2 continuous variables) Scatterplots Line plots Categorical (between 1 categorical and 1 continuous variable) Bar plots Violin plots Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale. Let’s load in our genomics datasets and start making some plots from them. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") 5.1 Distributions (one variable) To create a histogram, we use the function sns.displot() and we specify the input argument data as our dataframe, and the input argument x as the column name in a String. plot = sns.displot(data=metadata, x="Age") (For the webpage’s purpose, assign the plot to a variable plot. In practice, you don’t need to do that. You can just write sns.displot(data=metadata, x=\"Age\")). A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via binwidth argument, or the number of bins via bins argument. plot = sns.displot(data=metadata, x="Age", binwidth = 10) Our histogram also works for categorical variables, such as “Sex”. plot = sns.displot(data=metadata, x="Sex") Conditioning on other variables Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the hue input argument: plot = sns.displot(data=metadata, x="Age", hue="Sex") It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via multiple=\"dodge\" input argument: plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge") Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable’s value via col=\"Sex\" or row=\"Sex\": plot = sns.displot(data=metadata, x="Age", col="Sex") You can find a lot more details about distributions and histograms in the Seaborn tutorial. 5.2 Relational (between 2 continuous variables) To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function sns.relplot() and we specify the input argument data as our dataframe, and the input arguments x and y as the column names in a String: plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") To conditional on other variables, plotting features are used to distinguish conditional variable values: hue (similar to the histogram) style size Let’s merge expression and metadata together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color: expression_metadata = expression.merge(metadata) plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis") Here is the scatterplot with different shapes: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis") You can also try plotting with size=PrimaryOrMetastasis\" if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis") You can also conditional on multiple variables by assigning a different variable to the conditioning options: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory") You can find a lot more details about relational plots such as scatterplots and lineplots in the Seaborn tutorial. 5.3 Categorical (between 1 categorical and 1 continuous variable) A very similar pattern follows for categorical plots. We start with sns.catplot() as our main plotting function, with the basic input arguments: data x y You can change the plot styles via the input arguments: kind: “strip”, “box”, “swarm”, etc. You can add additional conditional variables via the input arguments: hue col row See categorical plots in the Seaborn tutorial. 5.4 Basic plot customization You can easily change the axis labels and title if you modify the plot object, using the method .set(): exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship") You can change the color palette by setting adding the palette input argument to any of the plots. You can explore available color palettes here: plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow') ) ## <string>:1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended. 5.5 Exercises Exercise for week 5 can be found here. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) FirstName LastName Lecturer(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved Delivered the course in some way - video or audio Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-09-26 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## askpass 1.2.0 2023-09-03 [1] RSPM (R 4.3.0) ## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fansi 1.0.6 2023-12-08 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## hms 1.1.3 2023-03-21 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## httr 1.4.7 2023-08-15 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## openssl 2.1.1 2023-09-25 [1] RSPM (R 4.3.0) ## ottrpal 1.2.1 2024-06-11 [1] Github (jhudsl/ottrpal@828539f) ## pillar 1.9.0 2023-03-22 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## readr 2.1.5 2024-01-10 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) ## tzdb 0.4.0 2023-05-12 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## utf8 1.2.4 2023-10-22 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) ## xml2 1.3.6 2023-12-04 [1] RSPM (R 4.3.0) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 6 References", " Chapter 6 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/no_toc/working-with-data-structures.html b/docs/no_toc/working-with-data-structures.html new file mode 100644 index 0000000..9cb49e5 --- /dev/null +++ b/docs/no_toc/working-with-data-structures.html @@ -0,0 +1,603 @@ + + + + + + + Chapter 2 Working with data structures | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 2 Working with data structures

    +

    In our second lesson, we start to look at two data structures, Lists and Dataframes, that can handle a large amount of data for analysis.

    +
    +

    2.1 Lists

    +

    In the first exercise, you started to explore data structures, which store information about data types. You explored lists, which is an ordered collection of data types or data structures. Each element of a list contains a data type or another data structure.

    +

    We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive.

    +

    We create a list via the bracket [ ] operation.

    +
    staff = ["chris", "ted", "jeff"]
    +chrNum = [2, 3, 1, 2, 2]
    +mixedList = [False, False, False, "A", "B", 92]
    +
    +

    2.1.1 Subsetting lists

    +

    To access an element of a list, you can use the bracket notation [ ] to access the elements of the list. We simply access an element via the “index” number - the location of the data within the list.

    +

    Here’s the tricky thing about the index number: it starts at 0!

    +

    1st element of chrNum: chrNum[0]

    +

    2nd element of chrNum: chrNum[1]

    +

    +

    5th element of chrNum: chrNum[4]

    +

    With subsetting, you can modify elements of a list or use the element of a list as part of an expression.

    +
    +
    +

    2.1.2 Subsetting multiple elements of lists

    +

    Suppose you want to access multiple elements of a list, such as accessing the first three elements of chrNum. You would use the slice operator :, which specifies:

    +
      +
    • the index number to start

    • +
    • the index number to stop, plus one.

    • +
    +

    If you want to access the first three elements of chrNum:

    +
    chrNum[0:3]
    +
    ## [2, 3, 1]
    +

    The first element’s index number is 0, the third element’s index number is 2, plus 1, which is 3.

    +

    If you want to access the second and third elements of chrNum:

    +
    chrNum[1:3]
    +
    ## [3, 1]
    +

    Another way of accessing the first 3 elements of chrNum:

    +
    chrNum[:3]
    +
    ## [2, 3, 1]
    +

    Here, the start index number was not specified. When the start or stop index is not specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here’s another example, using negative indicies to count from 3 elements from the end of the list:

    +
    chrNum[-3:]
    +
    ## [1, 2, 2]
    +

    You can find more discussion of list slicing, using negative indicies and incremental slicing, here.

    +
    +
    +
    +

    2.2 Objects in Python

    +

    The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined:

    +
      +
    • What does it contain (in terms of data)?

    • +
    • What can it do (in terms of functions)?

    • +
    +

    And if it “makes sense” to us, then it is well-designed.

    +

    The list data structure we have been working with is an example of an Object. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following:

    +
      +
    • Value that holds the essential data for the object.

    • +
    • Attributes that hold subset or additional data for the object.

    • +
    • Functions called Methods that are for the object and have to take in the variable referenced as an input

    • +
    +

    This organizing structure on an object applies to pretty much all Python data types and data structures.

    +

    Let’s see how this applies to the list:

    +
      +
    • Value: the contents of the list, such as [2, 3, 4].

    • +
    • Attributes that store additional values: Not relevant for lists.

    • +
    • Methods that can be used on the object: chrNum.count(2) counts the number of instances 2 appears as an element of chrNum.

    • +
    +

    Object methods are functions that does something with the object you are using it on. You should think about chrNum.count(2) as a function that takes in chrNum and 2 as inputs. If you want to use the count function on list mixedList, you would use mixedList.count(x).

    +

    Here are some more examples of methods with lists:

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function methodWhat it takes inWhat it doesReturns
    chrNum.count(x)list chrNum, data type xCounts the number of instances x appears as an element of chrNum.Integer
    chrNum.append(x)list chrNum, data type xAppends x to the end of the chrNum.None (but chrNum is modified!)
    chrNum.sort()list chrNumSorts chrNum by ascending order.None (but chrNum is modified!)
    chrNum.reverse()list chrNumReverses the order of chrNum.None (but chrNum is modified!)
    +
    +
    +

    2.3 Methods vs Functions

    +

    Methods have to take in the object of interest as an input: chrNum.count(2) automatically treat chrNum as an input. Methods are built for a specific Object type.

    +

    Functions do not have an implied input: len(chrNum) requires specifying a list in the input.

    +

    Otherwise, there is no strong distinction between the two.

    +
    +
    +

    2.4 Dataframes

    +

    A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does.

    +

    The Dataframe data structure is found within a Python module called “Pandas”. A Python module is an organized collection of functions and data structures. The import statement below gives us permission to access the “Pandas” module via the variable pd.

    +

    To load in a Dataframe from existing spreadsheet data, we use the function pd.read_csv():

    +
    import pandas as pd
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +type(metadata)
    +
    ## <class 'pandas.core.frame.DataFrame'>
    +

    There is a similar function pd.read_excel() for loading in Excel spreadsheets.

    +

    Let’s investigate the Dataframe as an object:

    +
      +
    • What does a Dataframe contain (values, attributes)?

    • +
    • What can a Dataframe do (methods)?

    • +
    +
    +
    +

    2.5 What does a Dataframe contain?

    +

    We first take a look at the contents:

    +
    metadata
    +
    ##          ModelID  ...       OncotreeLineage
    +## 0     ACH-000001  ...  Ovary/Fallopian Tube
    +## 1     ACH-000002  ...               Myeloid
    +## 2     ACH-000003  ...                 Bowel
    +## 3     ACH-000004  ...               Myeloid
    +## 4     ACH-000005  ...               Myeloid
    +## ...          ...  ...                   ...
    +## 1859  ACH-002968  ...     Esophagus/Stomach
    +## 1860  ACH-002972  ...     Esophagus/Stomach
    +## 1861  ACH-002979  ...     Esophagus/Stomach
    +## 1862  ACH-002981  ...     Esophagus/Stomach
    +## 1863  ACH-003071  ...                  Lung
    +## 
    +## [1864 rows x 30 columns]
    +

    It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data.

    +
    metadata
    +
    ##          ModelID  ...       OncotreeLineage
    +## 0     ACH-000001  ...  Ovary/Fallopian Tube
    +## 1     ACH-000002  ...               Myeloid
    +## 2     ACH-000003  ...                 Bowel
    +## 3     ACH-000004  ...               Myeloid
    +## 4     ACH-000005  ...               Myeloid
    +## ...          ...  ...                   ...
    +## 1859  ACH-002968  ...     Esophagus/Stomach
    +## 1860  ACH-002972  ...     Esophagus/Stomach
    +## 1861  ACH-002979  ...     Esophagus/Stomach
    +## 1862  ACH-002981  ...     Esophagus/Stomach
    +## 1863  ACH-003071  ...                  Lung
    +## 
    +## [1864 rows x 30 columns]
    +

    We can look at specific columns by looking at attributes via the dot operation. We can also look at the columns via the bracket operation.

    +
    metadata.ModelID
    +
    ## 0       ACH-000001
    +## 1       ACH-000002
    +## 2       ACH-000003
    +## 3       ACH-000004
    +## 4       ACH-000005
    +##            ...    
    +## 1859    ACH-002968
    +## 1860    ACH-002972
    +## 1861    ACH-002979
    +## 1862    ACH-002981
    +## 1863    ACH-003071
    +## Name: ModelID, Length: 1864, dtype: object
    +
    metadata['ModelID']
    +
    ## 0       ACH-000001
    +## 1       ACH-000002
    +## 2       ACH-000003
    +## 3       ACH-000004
    +## 4       ACH-000005
    +##            ...    
    +## 1859    ACH-002968
    +## 1860    ACH-002972
    +## 1861    ACH-002979
    +## 1862    ACH-002981
    +## 1863    ACH-003071
    +## Name: ModelID, Length: 1864, dtype: object
    +

    The names of all columns is stored as an attribute, which can be accessed via the dot operation.

    +
    metadata.columns
    +
    ## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age',
    +##        'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory',
    +##        'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis',
    +##        'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype',
    +##        'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments',
    +##        'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus',
    +##        'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype',
    +##        'OncotreePrimaryDisease', 'OncotreeLineage'],
    +##       dtype='object')
    +

    The number of rows and columns are also stored as an attribute:

    +
    metadata.shape
    +
    ## (1864, 30)
    +
    +
    +

    2.6 What can a Dataframe do?

    +

    We can use the .head() and .tail() methods to look at the first few rows and last few rows of metadata, respectively:

    +
    metadata.head()
    +
    ##       ModelID  PatientID  ...     OncotreePrimaryDisease       OncotreeLineage
    +## 0  ACH-000001  PT-gj46wT  ...   Ovarian Epithelial Tumor  Ovary/Fallopian Tube
    +## 1  ACH-000002  PT-5qa3uk  ...     Acute Myeloid Leukemia               Myeloid
    +## 2  ACH-000003  PT-puKIyc  ...  Colorectal Adenocarcinoma                 Bowel
    +## 3  ACH-000004  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 4  ACH-000005  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 
    +## [5 rows x 30 columns]
    +
    metadata.tail()
    +
    ##          ModelID  PatientID  ...          OncotreePrimaryDisease    OncotreeLineage
    +## 1859  ACH-002968  PT-pjhrsc  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1860  ACH-002972  PT-dkXZB1  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1861  ACH-002979  PT-lyHTzo  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1862  ACH-002981  PT-Z9akXf  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1863  ACH-003071  PT-LAGmLq  ...       Lung Neuroendocrine Tumor               Lung
    +## 
    +## [5 rows x 30 columns]
    +

    Both of these functions (without input arguments) are considered as methods: they are functions that does something with the Dataframe you are using it on. You should think about metadata.head() as a function that takes in metadata as an input. If we had another Dataframe called my_data and you want to use the same function, you will have to say my_data.head().

    +
    +
    +

    2.7 Subsetting Dataframes

    +

    Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists.

    +

    You will use the iloc attribute and bracket operations, and you give two slices: one for the row, and one for the column.

    +

    Let’s start with a small dataframe to see how it works before returning to metadata:

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    Here is how the dataframe looks like with the row and column index numbers:

    +

    +

    Subset the first fourth rows, and the first two columns:

    +

    +

    Now, back to metadata dataframe:

    +

    Subset the first 5 rows, and first two columns:

    +
    metadata.iloc[:5, :2]
    +
    ##       ModelID  PatientID
    +## 0  ACH-000001  PT-gj46wT
    +## 1  ACH-000002  PT-5qa3uk
    +## 2  ACH-000003  PT-puKIyc
    +## 3  ACH-000004  PT-q4K2cp
    +## 4  ACH-000005  PT-q4K2cp
    +

    If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column:

    +
    metadata.iloc[5:, [1, 10, 21]]
    +
    ##       PatientID GrowthPattern  WTSIMasterCellID
    +## 5     PT-ej13Dz    Suspension            2167.0
    +## 6     PT-NOXwpH      Adherent             569.0
    +## 7     PT-fp8PeY      Adherent            1806.0
    +## 8     PT-puKIyc      Adherent            2104.0
    +## 9     PT-AR7W9o      Adherent               NaN
    +## ...         ...           ...               ...
    +## 1859  PT-pjhrsc      Organoid               NaN
    +## 1860  PT-dkXZB1      Organoid               NaN
    +## 1861  PT-lyHTzo      Organoid               NaN
    +## 1862  PT-Z9akXf      Organoid               NaN
    +## 1863  PT-LAGmLq    Suspension               NaN
    +## 
    +## [1859 rows x 3 columns]
    +

    When we subset via numerical indicies, it’s called explicit subsetting. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed.

    +

    The second way is to subset by the column name and comparison operators, also known as implicit subsetting. This is much more robust in data analysis practice. You will learn about it next week!

    +
    +
    +

    2.8 Exercises

    +

    Exercise for week 2 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt index d172de4..2bb8bd5 100644 --- a/docs/reference-keys.txt +++ b/docs/reference-keys.txt @@ -8,10 +8,44 @@ what-is-a-computer-program a-programming-language-has-following-elements google-colab-setup grammar-structure-1-evaluation-of-expressions +function-machine-schema data-types grammar-structure-2-storing-data-types-in-the-variable-environment execution-rule-for-variable-assignment grammar-structure-3-evaluation-of-functions execution-rule-for-functions tips-on-writing-your-first-code +exercises +working-with-data-structures +lists +subsetting-lists +subsetting-multiple-elements-of-lists +objects-in-python +methods-vs-functions +dataframes +what-does-a-dataframe-contain +what-can-a-dataframe-do +subsetting-dataframes +exercises-1 +data-wrangling-part-1 +tidy-data +our-working-tidy-data-depmap-project +transform-what-do-you-want-to-do-with-this-dataframe +lets-convert-our-implicit-subsetting-criteria-into-code +summary-statistics +simple-data-visualization +chained-function-calls +exercises-2 +data-wrangling-part-2 +creating-new-columns +merging-two-dataframes-together +grouping-and-summarizing-dataframes +optional-multiple-grouping-multiple-columns-multiple-summary-statistics +exercises-3 +data-visualization +distributions-one-variable +relational-between-2-continuous-variables +categorical-between-1-categorical-and-1-continuous-variable +basic-plot-customization +exercises-4 references diff --git a/docs/references.html b/docs/references.html index d03090c..7c34cff 100644 --- a/docs/references.html +++ b/docs/references.html @@ -4,18 +4,18 @@ - Chapter 2 References | Introduction to Python + Chapter 6 References | Introduction to Python - + - + @@ -152,7 +152,8 @@
  • 1.4 Google Colab Setup
  • 1.5 Grammar Structure 1: Evaluation of Expressions
  • 1.6 Grammar Structure 2: Storing data types in the Variable Environment
  • 1.8 Tips on writing your first code
  • +
  • 1.9 Exercises
  • + +
  • 2 Working with data structures +
  • +
  • 3 Data Wrangling, Part 1 +
  • +
  • 4 Data Wrangling, Part 2 +
  • +
  • 5 Data Visualization +
  • About the Authors
  • -
  • 2 References
  • +
  • 6 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -200,8 +244,8 @@

    -
    -

    Chapter 2 References

    +
    +

    Chapter 6 References


    diff --git a/docs/resources/images/02-chapter_of_course_files/figure-html/1YmwKdIy9BeQ3EShgZhvtb3MgR8P6iDX4DfFD65W_gdQ_gcc4fbee202_0_141.png b/docs/resources/images/02-chapter_of_course_files/figure-html/1YmwKdIy9BeQ3EShgZhvtb3MgR8P6iDX4DfFD65W_gdQ_gcc4fbee202_0_141.png deleted file mode 100644 index b865852..0000000 Binary files a/docs/resources/images/02-chapter_of_course_files/figure-html/1YmwKdIy9BeQ3EShgZhvtb3MgR8P6iDX4DfFD65W_gdQ_gcc4fbee202_0_141.png and /dev/null differ diff --git a/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..30316b9 Binary files /dev/null and b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png new file mode 100644 index 0000000..791e0d8 Binary files /dev/null and b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-12-3.png differ diff --git a/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png new file mode 100644 index 0000000..9ef4a07 Binary files /dev/null and b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-13-5.png differ diff --git a/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png new file mode 100644 index 0000000..9ef4a07 Binary files /dev/null and b/docs/resources/images/03-data-wrangling1_files/figure-html/unnamed-chunk-14-7.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png new file mode 100644 index 0000000..8d60f0b Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-10-15.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png new file mode 100644 index 0000000..bcb02b5 Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-11-17.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png new file mode 100644 index 0000000..c28c19e Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-12-19.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png new file mode 100644 index 0000000..955505c Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-13-21.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png new file mode 100644 index 0000000..b48141d Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-23.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png new file mode 100644 index 0000000..7448370 Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-14-24.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png new file mode 100644 index 0000000..b3ce476 Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-15-27.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..03f4cec Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png new file mode 100644 index 0000000..c637f45 Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-4-3.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png new file mode 100644 index 0000000..2fc750a Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-5-5.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png new file mode 100644 index 0000000..cdffe6b Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-6-7.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png new file mode 100644 index 0000000..f59f40d Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-7-9.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png new file mode 100644 index 0000000..c9ffa6f Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-8-11.png differ diff --git a/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png new file mode 100644 index 0000000..b82f5a0 Binary files /dev/null and b/docs/resources/images/05-data-visualization_files/figure-html/unnamed-chunk-9-13.png differ diff --git a/docs/search_index.json b/docs/search_index.json index 52d469e..3650bd2 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1 +1 @@ -[["index.html", "Introduction to Python About this Course 0.1 Curriculum 0.2 Target Audience 0.3 Learning Objectives 0.4 Offerings", " Introduction to Python August, 2024 About this Course 0.1 Curriculum The course covers fundamentals of Python, a high-level programming language, and use it to wrangle data for analysis and visualization. 0.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application via the Python language. This course is also appropriate for folks who have explored data science or programming on their own and want to focus on some fundamentals. 0.3 Learning Objectives Analyze Tidy datasets in the Python programming language via data subsetting, joining, and transformations. Evaluate summary statistics and data visualization to understand scientific questions. Describe how the Python programming environment interpret complex expressions made out of functions, operations, and data structures, in a step-by-step way. Apply problem solving strategies to debug broken code. 0.4 Offerings This course is taught on a regular basis at Fred Hutch Cancer Center through the Data Science Lab. Announcements of course offering can be found here. "],["intro-to-computing.html", "Chapter 1 Intro to Computing 1.1 Goals of the course 1.2 What is a computer program? 1.3 A programming language has following elements: 1.4 Google Colab Setup 1.5 Grammar Structure 1: Evaluation of Expressions 1.6 Grammar Structure 2: Storing data types in the Variable Environment 1.7 Grammar Structure 3: Evaluation of Functions 1.8 Tips on writing your first code", " Chapter 1 Intro to Computing Welcome to Introduction to Python! Each week, we cover a chapter, which consists of a lesson and exercise. In our first week together, we will look at big conceptual themes in programming, see how code is run, and learn some basic grammar structures of programming. 1.1 Goals of the course In the next 6 weeks, we will explore: Fundamental concepts in high-level programming languages (Python, R, Julia, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow. Image source: R for Data Science. Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 1.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for Python Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 1.3 A programming language has following elements: Grammar structure to construct expressions; combining expressions to create more complex expressions Encapsulate complex expressions via functions to create modular and reusable tasks Encapsulate complex data via data structures to allow efficient manipulation of data 1.4 Google Colab Setup Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, open up… Today, we will pay close attention to: Python Console (Execution): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing. Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code. The first thing we will do is see the different ways we can run Python code. You can do the following: Type something into the Python Console (Execution) and type enter, such as 2+2. The Python Console will run it and give you an output. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. Run every single Python code chunk via Runtime -> Run all. Remember that the order that you run your code matters in programming. Your final product would be the result of Option 3, in which you run every Python code chunk from start to finish. However, sometimes it is nice to try out smaller parts of your code via Options 1 or 2. But you will be at risk of running your code out of order! To create your own content in the notebook, click on a section you want to insert content, and then click on “+ Code” or “+ Text” to add Python code or text, respectively. Python Notebook is great for data science work, because: It encourages reproducible data analysis, when you run your analysis from start to finish. It encourages excellent documentation, as you can have code, output from code, and prose combined together. It is flexible to use other programming languages, such as R. Now, we will get to the basics of programming grammar. 1.5 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Functions and operations take in data types, do something with them, and return another data type. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the Python Console: 18 + 21 ## 39 max(18, 21) ## 21 max(18 + 21, 65) ## 65 18 + (21 + 65) ## 104 len("ATCG") ## 4 Here, our input data types to the operation are integer in lines 1-4 and our input data type to the function is string in line 5. We will go over common data types shortly. Operations are just functions in hiding. We could have written: from operator import add add(18, 21) ## 39 add(18, add(21, 65)) ## 104 Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Because the add() function isn’t typically used, it is not automatically available, so we used the import statement to load it in.) 1.5.1 Data types Here are some common data types we will be using in this course. Data type name Data type shorthand Examples Integer int 2, 4 Float float 3.5, -34.1009 String str “hello”, “234-234-8594” Boolean bool True, False A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: Function machine from algebra class. Here are some aspects of this schema to pay attention to: A programmer should not need to know how the function is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. 1.6 Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Variable Environment, the variable x has a value of 39. 1.6.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the Variable Environment. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. When we work with large datasets, if you assign a variable to a data type larger than the available RAM, it will not work. More on this later. Look, now x can be reused downstream: x - 2 ## 37 y = x * 2 It is quite common for programmers to not know what data type a variable is while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python: type(y) ## <class 'int'> We should give useful variable names so that we know what to expect! Consider num_sales instead of y. 1.7 Grammar Structure 3: Evaluation of Functions Let’s look at functions a little bit more formally: A function has a function name, arguments, and returns a data type. 1.7.1 Execution rule for functions: Evaluate the function by its arguments, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. Often, we will use multiple functions, in a nested way, or use parenthesis to change the order of operation. Being able to read nested operations, nested functions, and parenthesis is very important. Think about what the Python is going to do step-by–step in the line of code below: (len("hello") + 4) * 2 ## 18 If we don’t know how to use a function, such as pow() we can ask for help: ?pow pow(base, exp, mod=None) Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. This shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to. The following ways are equivalent ways of using the pow() function: pow(2, 3) ## 8 pow(base=2, exp=3) ## 8 pow(exp=3, base=2) ## 8 but this will give you something different: pow(3, 2) ## 9 And there is an operational equivalent: 2 ** 3 ## 8 1.8 Tips on writing your first code Computer = powerful + stupid Even the smallest spelling and formatting changes will cause unexpected output and errors! Write incrementally, test often Check your assumptions, especially using new functions, operations, and new data types. Live environments are great for testing, but not great for reproducibility. Ask for help! To get more familiar with the errors Python gives you, take a look at this summary of Python error messages. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) FirstName LastName Lecturer(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved Delivered the course in some way - video or audio Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-08-07 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 2 References", " Chapter 2 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "Introduction to Python About this Course 0.1 Curriculum 0.2 Target Audience 0.3 Learning Objectives 0.4 Offerings", " Introduction to Python September, 2024 About this Course 0.1 Curriculum The course covers fundamentals of Python, a high-level programming language, and use it to wrangle data for analysis and visualization. 0.2 Target Audience The course is intended for researchers who want to learn coding for the first time with a data science application via the Python language. This course is also appropriate for folks who have explored data science or programming on their own and want to focus on some fundamentals. 0.3 Learning Objectives Analyze Tidy datasets in the Python programming language via data subsetting, joining, and transformations. Evaluate summary statistics and data visualization to understand scientific questions. Describe how the Python programming environment interpret complex expressions made out of functions, operations, and data structures, in a step-by-step way. Apply problem solving strategies to debug broken code. 0.4 Offerings This course is taught on a regular basis at Fred Hutch Cancer Center through the Data Science Lab. Announcements of course offering can be found here. "],["intro-to-computing.html", "Chapter 1 Intro to Computing 1.1 Goals of the course 1.2 What is a computer program? 1.3 A programming language has following elements: 1.4 Google Colab Setup 1.5 Grammar Structure 1: Evaluation of Expressions 1.6 Grammar Structure 2: Storing data types in the Variable Environment 1.7 Grammar Structure 3: Evaluation of Functions 1.8 Tips on writing your first code 1.9 Exercises", " Chapter 1 Intro to Computing Welcome to Introduction to Python! Each week, we cover a chapter, which consists of a lesson and exercise. In our first week together, we will look at big conceptual themes in programming, see how code is run, and learn some basic grammar structures of programming. 1.1 Goals of the course In the next 6 weeks, we will explore: Fundamental concepts in high-level programming languages (Python, R, Julia, etc.) that is transferable: How do programs run, and how do we solve problems using functions and data structures? Beginning of data science fundamentals: How do you translate your scientific question to a data wrangling problem and answer it? Data science workflow. Image source: R for Data Science. Find a nice balance between the two throughout the course: we will try to reproduce a figure from a scientific publication using new data. 1.2 What is a computer program? A sequence of instructions to manipulate data for the computer to execute. A series of translations: English <-> Programming Code for Interpreter <-> Machine Code for Central Processing Unit (CPU) We will focus on English <-> Programming Code for Python Interpreter in this class. More importantly: How we organize ideas <-> Instructing a computer to do something. 1.3 A programming language has following elements: Grammar structure to construct expressions; combining expressions to create more complex expressions Encapsulate complex expressions via functions to create modular and reusable tasks Encapsulate complex data via data structures to allow efficient manipulation of data 1.4 Google Colab Setup Google Colab is a Integrated Development Environment (IDE) on a web browser. Think about it as Microsoft Word to a plain text editor. It provides extra bells and whistles to using Python that is easier for the user. Let’s open up the KRAS analysis in Google Colab. If you are taking this course while it is in session, the project name is probably named “KRAS Demo” in your Google Classroom workspace. If you are taking this course on your own time, you can view it here. Today, we will pay close attention to: Python Console (“Executions”): Open it via View -> Executed code history. You give it one line of Python code, and the console executes that single line of code; you give it a single piece of instruction, and it executes it for you. Notebook: in the central panel of the website, you will see Python code interspersed with word document text. This is called a Python Notebook (other similar services include Jupyter Notebook, iPython Notebook), which has chunks of plain text and Python code, and it helps us understand better the code we are writing. Variable Environment: Open it by clicking on the “{x}” button on the left-hand panel. Often, your code will store information in the Variable Environment, so that information can be reused. For instance, we often load in data and store it in the Variable Environment, and use it throughout rest of your Python code. The first thing we will do is see the different ways we can run Python code. You can do the following: Type something into the Python Console (Execution) and click the arrow button, such as 2+2. The Python Console will run it and give you an output. Look through the Python Notebook, and when you see a chunk of Python Code, click the arrow button. It will copy the Python code chunk to the Python Console and run all of it. You will likely see variables created in the Variables panel as you load in and manipulate data. Run every single Python code chunk via Runtime -> Run all. Remember that the order that you run your code matters in programming. Your final product would be the result of Option 3, in which you run every Python code chunk from start to finish. However, sometimes it is nice to try out smaller parts of your code via Options 1 or 2. But you will be at risk of running your code out of order! To create your own content in the notebook, click on a section you want to insert content, and then click on “+ Code” or “+ Text” to add Python code or text, respectively. Python Notebook is great for data science work, because: It encourages reproducible data analysis, when you run your analysis from start to finish. It encourages excellent documentation, as you can have code, output from code, and prose combined together. It is flexible to use other programming languages, such as R. The version of Python used in this course and in Google Colab is Python 3, which is the version of Python that is most supported. Some Python software is written in Python 2, which is very similar but has some notable differences. Now, we will get to the basics of programming grammar. 1.5 Grammar Structure 1: Evaluation of Expressions Expressions are be built out of operations or functions. Functions and operations take in data types as inputs, do something with them, and return another data type as ouput. We can combine multiple expressions together to form more complex expressions: an expression can have other expressions nested inside it. For instance, consider the following expressions entered to the Python Console: 18 + 21 ## 39 max(18, 21) ## 21 max(18 + 21, 65) ## 65 18 + (21 + 65) ## 104 len("ATCG") ## 4 Here, our input data types to the operation are integer in lines 1-4 and our input data type to the function is string in line 5. We will go over common data types shortly. Operations are just functions in hiding. We could have written: from operator import add add(18, 21) ## 39 add(18, add(21, 65)) ## 104 Remember that the Python language is supposed to help us understand what we are writing in code easily, lending to readable code. Therefore, it is sometimes useful to come up with operations that is easier to read. (Most functions in Python are stored in a collection of functions called modules that needs to be loaded. The import statement gives us permission to access the functions in the module “operator”.) 1.5.1 Function machine schema A nice way to summarize this first grammar structure is using the function machine schema, way back from algebra class: Function machine from algebra class. Here are some aspects of this schema to pay attention to: A programmer should not need to know how the function or operation is implemented in order to use it - this emphasizes abstraction and modular thinking, a foundation in any programming language. A function can have different kinds of inputs and outputs - it doesn’t need to be numbers. In the len() function, the input is a String, and the output is an Integer. We will see increasingly complex functions with all sorts of different inputs and outputs. 1.5.2 Data types Here are some common data types we will be using in this course. Data type name Data type shorthand Examples Integer int 2, 4 Float float 3.5, -34.1009 String str “hello”, “234-234-8594” Boolean bool True, False 1.6 Grammar Structure 2: Storing data types in the Variable Environment To build up a computer program, we need to store our returned data type from our expression somewhere for downstream use. We can assign a variable to it as follows: x = 18 + 21 If you enter this in the Console, you will see that in the Variable Environment, the variable x has a value of 39. 1.6.1 Execution rule for variable assignment Evaluate the expression to the right of =. Bind variable to the left of = to the resulting value. The variable is stored in the Variable Environment. The Variable Environment is where all the variables are stored, and can be used for an expression anytime once it is defined. Only one unique variable name can be defined. The variable is stored in the working memory of your computer, Random Access Memory (RAM). This is temporary memory storage on the computer that can be accessed quickly. Typically a personal computer has 8, 16, 32 Gigabytes of RAM. Look, now x can be reused downstream: x - 2 ## 37 y = x * 2 It is quite common for programmers to have to look up the data type of a variable while they are coding. To learn about the data type of a variable, use the type() function on any variable in Python: type(y) ## <class 'int'> We should give useful variable names so that we know what to expect! If you are working with numerical sales data, consider num_sales instead of y. 1.7 Grammar Structure 3: Evaluation of Functions Let’s look at functions a little bit more formally: A function has a function name, arguments, and returns a data type. 1.7.1 Execution rule for functions: Evaluate the function by its arguments if there’s any, and if the arguments are functions or contains operations, evaluate those functions or operations first. The output of functions is called the returned value. Often, we will use multiple functions in a nested way, and it is important to understand how the Python console understand the order of operation. We can also use parenthesis to change the order of operation. Think about what the Python is going to do step-by–step in the lines of code below: max(len("hello"), 4) ## 5 (len("pumpkin") - 8) * 2 ## -2 If we don’t know how to use a function, such as pow(), we can ask for help: ?pow pow(base, exp, mod=None) Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. We can also find a similar help document, in a nicer rendered form online. We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own. The documentation shows the function takes in three input arguments: base, exp, and mod=None. When an argument has an assigned value of mod=None, that means the input argument already has a value, and you don’t need to specify anything, unless you want to. The following ways are equivalent ways of using the pow() function: pow(2, 3) ## 8 pow(base=2, exp=3) ## 8 pow(exp=3, base=2) ## 8 but this will give you something different: pow(3, 2) ## 9 And there is an operational equivalent: 2 ** 3 ## 8 We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let’s look at some examples of functions that don’t always have an input or output: Function call What it takes in What it does Returns pow(a, b) integer a, integer b Raises a to the bth power. Integer time.sleep(x) Integer x Waits for x seconds. None dir() Nothing Gives a list of all the variables defined in the environment. List 1.8 Tips on writing your first code Computer = powerful + stupid Computers are excellent at doing something specific over and over again, but is extremely rigid and lack flexibility. Here are some tips that is helpful for beginners: Write incrementally, test often. Don’t be afraid to break things: it is how we learn how things work in programming. Check your assumptions, especially using new functions, operations, and new data types. Live environments are great for testing, but not great for reproducibility. Ask for help! To get more familiar with the errors Python gives you, take a look at this summary of Python error messages. 1.9 Exercises Exercise for week 1 can be found here. "],["working-with-data-structures.html", "Chapter 2 Working with data structures 2.1 Lists 2.2 Objects in Python 2.3 Methods vs Functions 2.4 Dataframes 2.5 What does a Dataframe contain? 2.6 What can a Dataframe do? 2.7 Subsetting Dataframes 2.8 Exercises", " Chapter 2 Working with data structures In our second lesson, we start to look at two data structures, Lists and Dataframes, that can handle a large amount of data for analysis. 2.1 Lists In the first exercise, you started to explore data structures, which store information about data types. You explored lists, which is an ordered collection of data types or data structures. Each element of a list contains a data type or another data structure. We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive. We create a list via the bracket [ ] operation. staff = ["chris", "ted", "jeff"] chrNum = [2, 3, 1, 2, 2] mixedList = [False, False, False, "A", "B", 92] 2.1.1 Subsetting lists To access an element of a list, you can use the bracket notation [ ] to access the elements of the list. We simply access an element via the “index” number - the location of the data within the list. Here’s the tricky thing about the index number: it starts at 0! 1st element of chrNum: chrNum[0] 2nd element of chrNum: chrNum[1] … 5th element of chrNum: chrNum[4] With subsetting, you can modify elements of a list or use the element of a list as part of an expression. 2.1.2 Subsetting multiple elements of lists Suppose you want to access multiple elements of a list, such as accessing the first three elements of chrNum. You would use the slice operator :, which specifies: the index number to start the index number to stop, plus one. If you want to access the first three elements of chrNum: chrNum[0:3] ## [2, 3, 1] The first element’s index number is 0, the third element’s index number is 2, plus 1, which is 3. If you want to access the second and third elements of chrNum: chrNum[1:3] ## [3, 1] Another way of accessing the first 3 elements of chrNum: chrNum[:3] ## [2, 3, 1] Here, the start index number was not specified. When the start or stop index is not specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here’s another example, using negative indicies to count from 3 elements from the end of the list: chrNum[-3:] ## [1, 2, 2] You can find more discussion of list slicing, using negative indicies and incremental slicing, here. 2.2 Objects in Python The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined: What does it contain (in terms of data)? What can it do (in terms of functions)? And if it “makes sense” to us, then it is well-designed. The list data structure we have been working with is an example of an Object. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following: Value that holds the essential data for the object. Attributes that hold subset or additional data for the object. Functions called Methods that are for the object and have to take in the variable referenced as an input This organizing structure on an object applies to pretty much all Python data types and data structures. Let’s see how this applies to the list: Value: the contents of the list, such as [2, 3, 4]. Attributes that store additional values: Not relevant for lists. Methods that can be used on the object: chrNum.count(2) counts the number of instances 2 appears as an element of chrNum. Object methods are functions that does something with the object you are using it on. You should think about chrNum.count(2) as a function that takes in chrNum and 2 as inputs. If you want to use the count function on list mixedList, you would use mixedList.count(x). Here are some more examples of methods with lists: Function method What it takes in What it does Returns chrNum.count(x) list chrNum, data type x Counts the number of instances x appears as an element of chrNum. Integer chrNum.append(x) list chrNum, data type x Appends x to the end of the chrNum. None (but chrNum is modified!) chrNum.sort() list chrNum Sorts chrNum by ascending order. None (but chrNum is modified!) chrNum.reverse() list chrNum Reverses the order of chrNum. None (but chrNum is modified!) 2.3 Methods vs Functions Methods have to take in the object of interest as an input: chrNum.count(2) automatically treat chrNum as an input. Methods are built for a specific Object type. Functions do not have an implied input: len(chrNum) requires specifying a list in the input. Otherwise, there is no strong distinction between the two. 2.4 Dataframes A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does. The Dataframe data structure is found within a Python module called “Pandas”. A Python module is an organized collection of functions and data structures. The import statement below gives us permission to access the “Pandas” module via the variable pd. To load in a Dataframe from existing spreadsheet data, we use the function pd.read_csv(): import pandas as pd metadata = pd.read_csv("classroom_data/metadata.csv") type(metadata) ## <class 'pandas.core.frame.DataFrame'> There is a similar function pd.read_excel() for loading in Excel spreadsheets. Let’s investigate the Dataframe as an object: What does a Dataframe contain (values, attributes)? What can a Dataframe do (methods)? 2.5 What does a Dataframe contain? We first take a look at the contents: metadata ## ModelID ... OncotreeLineage ## 0 ACH-000001 ... Ovary/Fallopian Tube ## 1 ACH-000002 ... Myeloid ## 2 ACH-000003 ... Bowel ## 3 ACH-000004 ... Myeloid ## 4 ACH-000005 ... Myeloid ## ... ... ... ... ## 1859 ACH-002968 ... Esophagus/Stomach ## 1860 ACH-002972 ... Esophagus/Stomach ## 1861 ACH-002979 ... Esophagus/Stomach ## 1862 ACH-002981 ... Esophagus/Stomach ## 1863 ACH-003071 ... Lung ## ## [1864 rows x 30 columns] It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data. metadata ## ModelID ... OncotreeLineage ## 0 ACH-000001 ... Ovary/Fallopian Tube ## 1 ACH-000002 ... Myeloid ## 2 ACH-000003 ... Bowel ## 3 ACH-000004 ... Myeloid ## 4 ACH-000005 ... Myeloid ## ... ... ... ... ## 1859 ACH-002968 ... Esophagus/Stomach ## 1860 ACH-002972 ... Esophagus/Stomach ## 1861 ACH-002979 ... Esophagus/Stomach ## 1862 ACH-002981 ... Esophagus/Stomach ## 1863 ACH-003071 ... Lung ## ## [1864 rows x 30 columns] We can look at specific columns by looking at attributes via the dot operation. We can also look at the columns via the bracket operation. metadata.ModelID ## 0 ACH-000001 ## 1 ACH-000002 ## 2 ACH-000003 ## 3 ACH-000004 ## 4 ACH-000005 ## ... ## 1859 ACH-002968 ## 1860 ACH-002972 ## 1861 ACH-002979 ## 1862 ACH-002981 ## 1863 ACH-003071 ## Name: ModelID, Length: 1864, dtype: object metadata['ModelID'] ## 0 ACH-000001 ## 1 ACH-000002 ## 2 ACH-000003 ## 3 ACH-000004 ## 4 ACH-000005 ## ... ## 1859 ACH-002968 ## 1860 ACH-002972 ## 1861 ACH-002979 ## 1862 ACH-002981 ## 1863 ACH-003071 ## Name: ModelID, Length: 1864, dtype: object The names of all columns is stored as an attribute, which can be accessed via the dot operation. metadata.columns ## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age', ## 'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory', ## 'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis', ## 'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype', ## 'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments', ## 'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus', ## 'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype', ## 'OncotreePrimaryDisease', 'OncotreeLineage'], ## dtype='object') The number of rows and columns are also stored as an attribute: metadata.shape ## (1864, 30) 2.6 What can a Dataframe do? We can use the .head() and .tail() methods to look at the first few rows and last few rows of metadata, respectively: metadata.head() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube ## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid ## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel ## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## ## [5 rows x 30 columns] metadata.tail() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 1859 ACH-002968 PT-pjhrsc ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1860 ACH-002972 PT-dkXZB1 ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1861 ACH-002979 PT-lyHTzo ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1862 ACH-002981 PT-Z9akXf ... Esophagogastric Adenocarcinoma Esophagus/Stomach ## 1863 ACH-003071 PT-LAGmLq ... Lung Neuroendocrine Tumor Lung ## ## [5 rows x 30 columns] Both of these functions (without input arguments) are considered as methods: they are functions that does something with the Dataframe you are using it on. You should think about metadata.head() as a function that takes in metadata as an input. If we had another Dataframe called my_data and you want to use the same function, you will have to say my_data.head(). 2.7 Subsetting Dataframes Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. You will use the iloc attribute and bracket operations, and you give two slices: one for the row, and one for the column. Let’s start with a small dataframe to see how it works before returning to metadata: df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 Here is how the dataframe looks like with the row and column index numbers: Subset the first fourth rows, and the first two columns: Now, back to metadata dataframe: Subset the first 5 rows, and first two columns: metadata.iloc[:5, :2] ## ModelID PatientID ## 0 ACH-000001 PT-gj46wT ## 1 ACH-000002 PT-5qa3uk ## 2 ACH-000003 PT-puKIyc ## 3 ACH-000004 PT-q4K2cp ## 4 ACH-000005 PT-q4K2cp If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column: metadata.iloc[5:, [1, 10, 21]] ## PatientID GrowthPattern WTSIMasterCellID ## 5 PT-ej13Dz Suspension 2167.0 ## 6 PT-NOXwpH Adherent 569.0 ## 7 PT-fp8PeY Adherent 1806.0 ## 8 PT-puKIyc Adherent 2104.0 ## 9 PT-AR7W9o Adherent NaN ## ... ... ... ... ## 1859 PT-pjhrsc Organoid NaN ## 1860 PT-dkXZB1 Organoid NaN ## 1861 PT-lyHTzo Organoid NaN ## 1862 PT-Z9akXf Organoid NaN ## 1863 PT-LAGmLq Suspension NaN ## ## [1859 rows x 3 columns] When we subset via numerical indicies, it’s called explicit subsetting. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed. The second way is to subset by the column name and comparison operators, also known as implicit subsetting. This is much more robust in data analysis practice. You will learn about it next week! 2.8 Exercises Exercise for week 2 can be found here. "],["data-wrangling-part-1.html", "Chapter 3 Data Wrangling, Part 1 3.1 Tidy Data 3.2 Our working Tidy Data: DepMap Project 3.3 Transform: “What do you want to do with this Dataframe”? 3.4 Summary Statistics 3.5 Simple data visualization 3.6 Exercises", " Chapter 3 Data Wrangling, Part 1 From our first two lessons, we are now equipped with enough fundamental programming skills to apply it to various steps in the data science workflow, which is a natural cycle that occurs in data analysis. Data science workflow. Image source: R for Data Science. For the rest of the course, we focus on Transform and Visualize with the assumption that our data is in a nice, “Tidy format”. First, we need to understand what it means for a data to be “Tidy”. 3.1 Tidy Data Here, we describe a standard of organizing data. It is important to have standards, as it facilitates a consistent way of thinking about data organization and building tools (functions) that make use of that standard. The principles of tidy data, developed by Hadley Wickham: Each variable must have its own column. Each observation must have its own row. Each value must have its own cell. If you want to be technical about what variables and observations are, Hadley Wickham describes: A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes. A tidy dataframe. Image source: R for Data Science. 3.2 Our working Tidy Data: DepMap Project The Dependency Map project is a multi-omics profiling of cancer cell lines combined with functional assays such as CRISPR and drug sensitivity to help identify cancer vulnerabilities and drug targets. Here are some of the data that we have public access to. We have been looking at the metadata since last session. Metadata Somatic mutations Gene expression Drug sensitivity CRISPR knockout and more… Let’s load these datasets in, and see how these datasets fit the definition of Tidy data: import pandas as pd metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") metadata.head() ## ModelID PatientID ... OncotreePrimaryDisease OncotreeLineage ## 0 ACH-000001 PT-gj46wT ... Ovarian Epithelial Tumor Ovary/Fallopian Tube ## 1 ACH-000002 PT-5qa3uk ... Acute Myeloid Leukemia Myeloid ## 2 ACH-000003 PT-puKIyc ... Colorectal Adenocarcinoma Bowel ## 3 ACH-000004 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## 4 ACH-000005 PT-q4K2cp ... Acute Myeloid Leukemia Myeloid ## ## [5 rows x 30 columns] mutation.head() ## ModelID CACNA1D_Mut CYP2D6_Mut ... CCDC28A_Mut C1orf194_Mut U2AF1_Mut ## 0 ACH-000001 False False ... False False False ## 1 ACH-000002 False False ... False False False ## 2 ACH-000004 False False ... False False False ## 3 ACH-000005 False False ... False False False ## 4 ACH-000006 False False ... False False False ## ## [5 rows x 540 columns] expression.head() ## ModelID ENPP4_Exp CREBBP_Exp ... OR5D13_Exp C2orf81_Exp OR8S1_Exp ## 0 ACH-001113 2.280956 4.094236 ... 0.0 1.726831 0.0 ## 1 ACH-001289 3.622930 3.606442 ... 0.0 0.790772 0.0 ## 2 ACH-001339 0.790772 2.970854 ... 0.0 0.575312 0.0 ## 3 ACH-001538 3.485427 2.801159 ... 0.0 1.077243 0.0 ## 4 ACH-000242 0.879706 3.327687 ... 0.0 0.722466 0.0 ## ## [5 rows x 536 columns] Dataframe The observation is Some variables are Some values are metadata Cell line ModelID, Age, OncotreeLineage “ACH-000001”, 60, “Myeloid” expression Cell line KRAS_Exp 2.4, .3 mutation Cell line KRAS_Mut TRUE, FALSE 3.3 Transform: “What do you want to do with this Dataframe”? Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something. Until now, we haven’t focused too much on how we organize our scientific ideas to interact with what we can do with code. Let’s pivot to write our code driven by our scientific curiosity. After we are sure that we are working with Tidy data, we can ponder how we want to transform our data that satisfies our scientific question. We will look at several ways we can transform Tidy data, starting with subsetting columns and rows. Here’s a starting prompt: In the metadata dataframe, which rows would you subset for and columns would you subset for that relate to a scientific question? We have been using explicit subsetting with numerical indicies, such as “I want to filter for rows 20-50 and select columns 2 and 8”. We are now going to switch to implicit subsetting in which we describe the subsetting criteria via comparision operators and column names, such as: “I want to subset for rows such that the OncotreeLineage is breast cancer and subset for columns Age and Sex.” Notice that when we subset for rows in an implicit way, we formulate our criteria in terms of the columns.This is because we are guaranteed to have column names in Dataframes, but not row names. 3.3.0.1 Let’s convert our implicit subsetting criteria into code! To subset for rows implicitly, we will use the conditional operators on Dataframe columns you used in Exercise 2. To formulate a conditional operator expression that OncotreeLineage is breast cancer: metadata['OncotreeLineage'] == "Lung" ## 0 False ## 1 False ## 2 False ## 3 False ## 4 False ## ... ## 1859 False ## 1860 False ## 1861 False ## 1862 False ## 1863 True ## Name: OncotreeLineage, Length: 1864, dtype: bool Then, we will use the .loc operation (which is different than .iloc operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]] ## Age Sex ## 10 39.0 Female ## 13 44.0 Male ## 19 55.0 Female ## 27 39.0 Female ## 28 45.0 Male ## ... ... ... ## 1745 52.0 Male ## 1819 84.0 Male ## 1820 57.0 Female ## 1822 53.0 Male ## 1863 62.0 Male ## ## [241 rows x 2 columns] What’s going on here? The first component of the subset, metadata['OncotreeLineage'] == \"Lung\", subsets for the rows. It gives us a column of True and False values, and we keep rows that correspond to True values. Then, we specify the column names we want to subset for via a list. Here’s another example: df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 “I want to subset for rows such that the status is”treated” and subset for columns status and age_case.” df.loc[df.status == "treated", ["status", "age_case"]] ## status age_case ## 0 treated 25 ## 4 treated 7 3.4 Summary Statistics Now that your Dataframe has be transformed based on your scientific question, you can start doing some analysis on it! A common data science task is to examine summary statistics of a dataset, which summarizes the all the values from a variable in a numeric summary, such as mean, median, or mode. If we look at the data structure of a Dataframe’s column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let’s take a look at a few popular examples: Function method What it takes in What it does Returns metadata.Age.mean() metadata.Age as a numeric Series Computes the mean value of the Age column. Float (NumPy) metadata['Age'].median() metadata['Age'] as a numeric Series Computes the median value of the Age column. Float (NumPy) metadata.Age.max() metadata.Age as a numeric Series Computes the max value of the Age column. Float (NumPy) metadata.OncotreeSubtype.value_counts() metadata.OncotreeSubtype as a string Series Creates a frequency table of all unique elements in OncotreeSubtype column. Series Let’s try it out, with some nice print formatting: print("Mean value of Age column:", metadata['Age'].mean()) ## Mean value of Age column: 47.45187165775401 print("Frequency of column", metadata.OncotreeLineage.value_counts()) ## Frequency of column OncotreeLineage ## Lung 241 ## Lymphoid 209 ## CNS/Brain 123 ## Skin 118 ## Esophagus/Stomach 95 ## Breast 92 ## Bowel 87 ## Head and Neck 81 ## Myeloid 77 ## Bone 75 ## Ovary/Fallopian Tube 74 ## Pancreas 65 ## Kidney 64 ## Peripheral Nervous System 55 ## Soft Tissue 54 ## Uterus 41 ## Fibroblast 41 ## Biliary Tract 40 ## Bladder/Urinary Tract 39 ## Normal 39 ## Pleura 35 ## Liver 28 ## Cervix 25 ## Eye 19 ## Thyroid 18 ## Prostate 14 ## Vulva/Vagina 5 ## Ampulla of Vater 4 ## Testis 4 ## Adrenal Gland 1 ## Other 1 ## Name: count, dtype: int64 Notice that the output of some of these methods are Float (NumPy). This refers to a Python Object called NumPy that is extremely popular for scientific computing, but we’re not focused on that in this course.) 3.5 Simple data visualization We will dedicate extensive time later this course to talk about data visualization, but the Dataframe’s column, Series, has a method called .plot() that can help us make simple plots for one variable. The .plot() method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument kind a String value to specify the plot style. We use it for making a histogram or bar plot. Plot style Useful for kind = Code Histogram Numerics “hist” metadata.Age.plot(kind = \"hist\") Bar plot Strings “bar” metadata.OncotreeSubtype.value_counts().plot(kind = \"bar\") Let’s look at a histogram: import matplotlib.pyplot as plt plt.figure() metadata.Age.plot(kind = "hist") plt.show() Let’s look at a bar plot: plt.figure() metadata.OncotreeLineage.value_counts().plot(kind = "bar") plt.show() (The plt.figure() and plt.show() functions are used to render the plots on the website, but you don’t need to use it for your exercises. We will discuss this in more detail during our week of data visualization.) 3.5.0.1 Chained function calls Let’s look at our bar plot syntax more carefully. We start with the column metadata.OncotreeLineage, and then we first use the method .value_counts() to get a Series of a frequency table. Then, we take the frequency table Series and use the .plot() method. It is quite common in Python to have multiple “chained” function calls, in which the output of .value_counts() is used for the input of .plot() all in one line of code. It takes a bit of time to get used to this! Here’s another example of a chained function call, which looks quite complex, but let’s break it down: plt.figure() metadata.loc[metadata.AgeCategory == "Adult", ].OncotreeLineage.value_counts().plot(kind="bar") plt.show() We first take the entire metadata and do some subsetting, which outputs a Dataframe. We access the OncotreeLineage column, which outputs a Series. We use the method .value_counts(), which outputs a Series. We make a plot out of it! We could have, alternatively, done this in several lines of code: plt.figure() metadata_subset = metadata.loc[metadata.AgeCategory == "Adult", ] metadata_subset_lineage = metadata_subset.OncotreeLineage lineage_freq = metadata_subset_lineage.value_counts() lineage_freq.plot(kind = "bar") plt.show() These are two different styles of code, but they do the exact same thing. It’s up to you to decide what is easier for you to understand. 3.6 Exercises Exercise for week 3 can be found here. "],["data-wrangling-part-2.html", "Chapter 4 Data Wrangling, Part 2 4.1 Creating new columns 4.2 Merging two Dataframes together 4.3 Grouping and summarizing Dataframes 4.4 Exercises", " Chapter 4 Data Wrangling, Part 2 We will continue to learn about data analysis with Dataframes. Let’s load our three Dataframes from the Depmap project in again: import pandas as pd import numpy as np metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") 4.1 Creating new columns Often, we want to perform some kind of transformation on our data’s columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale. To create a new column, you simply modify it as if it exists using the bracket operation [ ], and the column will be created: metadata['AgePlusTen'] = metadata['Age'] + 10 expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp'] expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp']) where np.log(x) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value. Note: you cannot create a new column referring to the attribute of the Dataframe, such as: expression.KRAS_Exp_log = np.log(expression.KRAS_Exp). 4.2 Merging two Dataframes together Suppose we have the following Dataframes: expression ModelID PIK3CA_Exp log_PIK3CA_Exp “ACH-001113” 5.138733 1.636806 “ACH-001289” 3.184280 1.158226 “ACH-001339” 3.165108 1.152187 metadata ModelID OncotreeLineage Age “ACH-001113” “Lung” 69 “ACH-001289” “CNS/Brain” NaN “ACH-001339” “Skin” 14 Suppose that I want to compare the relationship between OncotreeLineage and PIK3CA_Exp, but they are columns in different Dataframes. We want a new Dataframe that looks like this: ModelID PIK3CA_Exp log_PIK3CA_Exp OncotreeLineage Age “ACH-001113” 5.138733 1.636806 “Lung” 69 “ACH-001289” 3.184280 1.158226 “CNS/Brain” NaN “ACH-001339” 3.165108 1.152187 “Skin” 14 We see that in both dataframes, the rows (observations) represent cell lines. there is a common column ModelID, with shared values between the two dataframes that can faciltate the merging process. We call this an index. We will use the method .merge() for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index. merged = metadata.merge(expression) It’s usually better to specify what that index column to avoid ambiguity, using the on optional argument: merged = metadata.merge(expression, on='ModelID') If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe: merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID') One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not: The number of rows and columns of metadata: metadata.shape ## (1864, 31) The number of rows and columns of expression: expression.shape ## (1450, 538) The number of rows and columns of merged: merged.shape ## (1450, 568) We see that the number of columns in merged combines the number of columns in metadata and expression, while the number of rows in merged is the smaller of the number of rows in metadata and expression: it only keeps rows that are found in both Dataframe’s index columns. This kind of join is called “inner join”, because in the Venn Diagram of elements common in both index column, we keep the inner overlap: You can specifiy the join style by changing the optional input argument how. how = \"outer\" keeps all observations - also known as a “full join” how = \"left\" keeps all observations in the left Dataframe. how = \"right\" keeps all observations in the right Dataframe. how = \"inner\" keeps observations common to both Dataframe. This is the default value of how. 4.3 Grouping and summarizing Dataframes In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in metadata, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, OncotreeLineage, and look at the mean age for each cancer type. We want to take metadata: ModelID OncotreeLineage Age “ACH-001113” “Lung” 69 “ACH-001289” “Lung” 23 “ACH-001339” “Skin” 14 “ACH-002342” “Brain” 23 “ACH-004854” “Brain” 56 “ACH-002921” “Brain” 67 into: OncotreeLineage MeanAge “Lung” 46 “Skin” 14 “Brain” 48.67 To get there, we need to: Group the data based on some criteria, elements of OncotreeLineage Summarize each group via a summary statistic performed on a column, such as Age. We first subset the the two columns we need, and then use the methods .group_by(x) and .mean(). metadata_grouped = metadata.groupby("OncotreeLineage") metadata_grouped['Age'].mean() ## OncotreeLineage ## Adrenal Gland 55.000000 ## Ampulla of Vater 65.500000 ## Biliary Tract 58.450000 ## Bladder/Urinary Tract 65.166667 ## Bone 20.854545 ## Bowel 58.611111 ## Breast 50.961039 ## CNS/Brain 43.849057 ## Cervix 47.136364 ## Esophagus/Stomach 57.855556 ## Eye 51.100000 ## Fibroblast 38.194444 ## Head and Neck 60.149254 ## Kidney 46.193548 ## Liver 43.928571 ## Lung 55.444444 ## Lymphoid 38.916667 ## Myeloid 38.810811 ## Normal 52.370370 ## Other 46.000000 ## Ovary/Fallopian Tube 51.980769 ## Pancreas 60.226415 ## Peripheral Nervous System 5.480000 ## Pleura 61.000000 ## Prostate 61.666667 ## Skin 49.033708 ## Soft Tissue 27.500000 ## Testis 25.000000 ## Thyroid 63.235294 ## Uterus 62.060606 ## Vulva/Vagina 75.400000 ## Name: Age, dtype: float64 Here’s what’s going on: We use the Dataframe method .group_by(x) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the metadata Dataframe, but it makes a note that it’s been grouped. We subset to the column Age. The grouping information still persists (This is a Grouped Series object). We use the method .mean() to calculate the mean value of Age within each group defined by OncotreeLineage. Alternatively, this could have been done in a chain of methods: metadata.groupby("OncotreeLineage")["Age"].mean() ## OncotreeLineage ## Adrenal Gland 55.000000 ## Ampulla of Vater 65.500000 ## Biliary Tract 58.450000 ## Bladder/Urinary Tract 65.166667 ## Bone 20.854545 ## Bowel 58.611111 ## Breast 50.961039 ## CNS/Brain 43.849057 ## Cervix 47.136364 ## Esophagus/Stomach 57.855556 ## Eye 51.100000 ## Fibroblast 38.194444 ## Head and Neck 60.149254 ## Kidney 46.193548 ## Liver 43.928571 ## Lung 55.444444 ## Lymphoid 38.916667 ## Myeloid 38.810811 ## Normal 52.370370 ## Other 46.000000 ## Ovary/Fallopian Tube 51.980769 ## Pancreas 60.226415 ## Peripheral Nervous System 5.480000 ## Pleura 61.000000 ## Prostate 61.666667 ## Skin 49.033708 ## Soft Tissue 27.500000 ## Testis 25.000000 ## Thyroid 63.235294 ## Uterus 62.060606 ## Vulva/Vagina 75.400000 ## Name: Age, dtype: float64 Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as .mean(), .median(), .max(), can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is .count() which tells you how many entries are counted within each group. 4.3.1 Optional: Multiple grouping, Multiple columns, Multiple summary statistics Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously. For example, you may want to group by a combination of OncotreeLineage and AgeCategory, such as “Lung” and “Adult” as one grouping. You can do so like this: metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"]) metadata_grouped['Age'].mean() ## OncotreeLineage AgeCategory ## Adrenal Gland Adult 55.000000 ## Ampulla of Vater Adult 65.500000 ## Unknown NaN ## Biliary Tract Adult 58.450000 ## Unknown NaN ## ... ## Thyroid Unknown NaN ## Uterus Adult 62.060606 ## Fetus NaN ## Unknown NaN ## Vulva/Vagina Adult 75.400000 ## Name: Age, Length: 72, dtype: float64 You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the .agg(x) method on a Grouped Dataframe. For example, coming back to our age case-control Dataframe, df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], 'age_case': [25, 43, 21, 65, 7], 'age_control': [49, 20, 32, 25, 32]}) df ## status age_case age_control ## 0 treated 25 49 ## 1 untreated 43 20 ## 2 untreated 21 32 ## 3 discharged 65 25 ## 4 treated 7 32 We group by status and summarize age_case and age_control with a few summary statistics each: df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]}) ## age_case age_control ## mean min max mean ## status ## discharged 65.0 25 25 25.0 ## treated 16.0 32 49 40.5 ## untreated 32.0 20 32 26.0 The input argument to the .agg(x) method is called a Dictionary, which let’s you structure information in a paired relationship. You can learn more about dictionaries here. 4.4 Exercises Exercise for week 4 can be found here. "],["data-visualization.html", "Chapter 5 Data Visualization 5.1 Distributions (one variable) 5.2 Relational (between 2 continuous variables) 5.3 Categorical (between 1 categorical and 1 continuous variable) 5.4 Basic plot customization 5.5 Exercises", " Chapter 5 Data Visualization In our final to last week together, we learn about how to visualize our data. There are several different data visualization modules in Python: matplotlib is a general purpose plotting module that is commonly used. seaborn is a plotting module built on top of matplotlib focused on data science and statistical visualization. We will focus on this module for this course. plotnine is a plotting module based on the grammar of graphics organization of making plots. This is very similar to the R package “ggplot”. To get started, we will consider these most simple and common plots: Distributions (one variable) Histograms Relational (between 2 continuous variables) Scatterplots Line plots Categorical (between 1 categorical and 1 continuous variable) Bar plots Violin plots Why do we focus on these common plots? Our eyes are better at distinguishing certain visual features more than others. All of these plots are focused on their position to depict data, which gives us the most effective visual scale. Let’s load in our genomics datasets and start making some plots from them. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt metadata = pd.read_csv("classroom_data/metadata.csv") mutation = pd.read_csv("classroom_data/mutation.csv") expression = pd.read_csv("classroom_data/expression.csv") 5.1 Distributions (one variable) To create a histogram, we use the function sns.displot() and we specify the input argument data as our dataframe, and the input argument x as the column name in a String. plot = sns.displot(data=metadata, x="Age") (For the webpage’s purpose, assign the plot to a variable plot. In practice, you don’t need to do that. You can just write sns.displot(data=metadata, x=\"Age\")). A common parameter to consider when making histogram is how big the bins are. You can specify the bin width via binwidth argument, or the number of bins via bins argument. plot = sns.displot(data=metadata, x="Age", binwidth = 10) Our histogram also works for categorical variables, such as “Sex”. plot = sns.displot(data=metadata, x="Sex") Conditioning on other variables Sometimes, you want to examine a distribution, such as Age, conditional on other variables, such as Age for Female, Age for Male, and Age for Unknown: what is the distribution of age when compared with sex? There are several ways of doing it. First, you could color variables by color, using the hue input argument: plot = sns.displot(data=metadata, x="Age", hue="Sex") It is rather hard to tell the groups apart from the coloring. So, we add a new option that we want to separate each bar category via multiple=\"dodge\" input argument: plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge") Lastly, an alternative to using colors to display the conditional variable, we could make a subplot for each conditional variable’s value via col=\"Sex\" or row=\"Sex\": plot = sns.displot(data=metadata, x="Age", col="Sex") You can find a lot more details about distributions and histograms in the Seaborn tutorial. 5.2 Relational (between 2 continuous variables) To visualize two continuous variables, it is common to use a scatterplot or a lineplot. We use the function sns.relplot() and we specify the input argument data as our dataframe, and the input arguments x and y as the column names in a String: plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") To conditional on other variables, plotting features are used to distinguish conditional variable values: hue (similar to the histogram) style size Let’s merge expression and metadata together, so that we can examine KRAS and EGFR relationships conditional on primary vs. metastatic cancer status. Here is the scatterplot with different color: expression_metadata = expression.merge(metadata) plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis") Here is the scatterplot with different shapes: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", style="PrimaryOrMetastasis") You can also try plotting with size=PrimaryOrMetastasis\" if you like. None of these seem pretty effective at distinguishing the two groups, so we will try subplot faceting as we did for the histogram: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", col="PrimaryOrMetastasis") You can also conditional on multiple variables by assigning a different variable to the conditioning options: plot = sns.relplot(data=expression_metadata, x="KRAS_Exp", y="EGFR_Exp", hue="PrimaryOrMetastasis", col="AgeCategory") You can find a lot more details about relational plots such as scatterplots and lineplots in the Seaborn tutorial. 5.3 Categorical (between 1 categorical and 1 continuous variable) A very similar pattern follows for categorical plots. We start with sns.catplot() as our main plotting function, with the basic input arguments: data x y You can change the plot styles via the input arguments: kind: “strip”, “box”, “swarm”, etc. You can add additional conditional variables via the input arguments: hue col row See categorical plots in the Seaborn tutorial. 5.4 Basic plot customization You can easily change the axis labels and title if you modify the plot object, using the method .set(): exp_plot = sns.relplot(data=expression, x="KRAS_Exp", y="EGFR_Exp") exp_plot.set(xlabel="KRAS Espression", ylabel="EGFR Expression", title="Gene expression relationship") You can change the color palette by setting adding the palette input argument to any of the plots. You can explore available color palettes here: plot = sns.displot(data=metadata, x="Age", hue="Sex", multiple="dodge", palette=sns.color_palette(palette='rainbow') ) ## <string>:1: UserWarning: The palette list has more values (6) than needed (3), which may not be intended. 5.5 Exercises Exercise for week 5 can be found here. "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Lead Content Instructor(s) FirstName LastName Lecturer(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved Delivered the course in some way - video or audio Content Author(s) (include chapter name/link in parentheses if only for specific chapters) - make new line if more than one chapter involved If any other authors besides lead instructor Content Contributor(s) (include section name/link in parentheses) - make new line if more than one section involved Wrote less than a chapter Content Editor(s)/Reviewer(s) Checked your content Content Director(s) Helped guide the content direction Content Consultants (include chapter name/link in parentheses or word “General”) - make new line if more than one chapter involved Gave high level advice on content Acknowledgments Gave small assistance to content but not to the level of consulting Production Content Publisher(s) Helped with publishing platform Content Publishing Reviewer(s) Reviewed overall content and aesthetics on publishing platform Technical Course Publishing Engineer(s) Helped with the code for the technical aspects related to the specific course generation Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Ava Hoffman, Candace Savonen Package Developers (ottrpal) Candace Savonen, John Muschelli, Carrie Wright Art and Design Illustrator(s) Created graphics for the course Figure Artist(s) Created figures/plots for course Videographer(s) Filmed videos Videography Editor(s) Edited film Audiographer(s) Recorded audio Audiography Editor(s) Edited audio recordings Funding Funder(s) Institution/individual who funded course including grant number Funding Staff Staff members who help with funding   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-09-26 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 6 References", " Chapter 6 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/working-with-data-structures.html b/docs/working-with-data-structures.html new file mode 100644 index 0000000..9cb49e5 --- /dev/null +++ b/docs/working-with-data-structures.html @@ -0,0 +1,603 @@ + + + + + + + Chapter 2 Working with data structures | Introduction to Python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + +
    + +
    + +
    +
    + + +
    +
    + +
    + + + + + + + + + +
    + +
    +
    +

    Chapter 2 Working with data structures

    +

    In our second lesson, we start to look at two data structures, Lists and Dataframes, that can handle a large amount of data for analysis.

    +
    +

    2.1 Lists

    +

    In the first exercise, you started to explore data structures, which store information about data types. You explored lists, which is an ordered collection of data types or data structures. Each element of a list contains a data type or another data structure.

    +

    We can now store a vast amount of information in a list, and assign it to a single variable. Even more, we can use operations and functions on a list, modifying many elements within the list at once! This makes analyzing data much more scalable and less repetitive.

    +

    We create a list via the bracket [ ] operation.

    +
    staff = ["chris", "ted", "jeff"]
    +chrNum = [2, 3, 1, 2, 2]
    +mixedList = [False, False, False, "A", "B", 92]
    +
    +

    2.1.1 Subsetting lists

    +

    To access an element of a list, you can use the bracket notation [ ] to access the elements of the list. We simply access an element via the “index” number - the location of the data within the list.

    +

    Here’s the tricky thing about the index number: it starts at 0!

    +

    1st element of chrNum: chrNum[0]

    +

    2nd element of chrNum: chrNum[1]

    +

    +

    5th element of chrNum: chrNum[4]

    +

    With subsetting, you can modify elements of a list or use the element of a list as part of an expression.

    +
    +
    +

    2.1.2 Subsetting multiple elements of lists

    +

    Suppose you want to access multiple elements of a list, such as accessing the first three elements of chrNum. You would use the slice operator :, which specifies:

    +
      +
    • the index number to start

    • +
    • the index number to stop, plus one.

    • +
    +

    If you want to access the first three elements of chrNum:

    +
    chrNum[0:3]
    +
    ## [2, 3, 1]
    +

    The first element’s index number is 0, the third element’s index number is 2, plus 1, which is 3.

    +

    If you want to access the second and third elements of chrNum:

    +
    chrNum[1:3]
    +
    ## [3, 1]
    +

    Another way of accessing the first 3 elements of chrNum:

    +
    chrNum[:3]
    +
    ## [2, 3, 1]
    +

    Here, the start index number was not specified. When the start or stop index is not specified, it implies that you are subsetting starting the from the beginning of the list or subsetting to the end of the list, respectively. Here’s another example, using negative indicies to count from 3 elements from the end of the list:

    +
    chrNum[-3:]
    +
    ## [1, 2, 2]
    +

    You can find more discussion of list slicing, using negative indicies and incremental slicing, here.

    +
    +
    +
    +

    2.2 Objects in Python

    +

    The list data structure has an organization and functionality that metaphorically represents a pen-and-paper list in our physical world. Like a physical object, we have examined:

    +
      +
    • What does it contain (in terms of data)?

    • +
    • What can it do (in terms of functions)?

    • +
    +

    And if it “makes sense” to us, then it is well-designed.

    +

    The list data structure we have been working with is an example of an Object. The definition of an object allows us to ask the questions above: what does it contain, and what can it do? It is an organizational tool for a collection of data and functions that we can relate to, like a physical object. Formally, an object contains the following:

    +
      +
    • Value that holds the essential data for the object.

    • +
    • Attributes that hold subset or additional data for the object.

    • +
    • Functions called Methods that are for the object and have to take in the variable referenced as an input

    • +
    +

    This organizing structure on an object applies to pretty much all Python data types and data structures.

    +

    Let’s see how this applies to the list:

    +
      +
    • Value: the contents of the list, such as [2, 3, 4].

    • +
    • Attributes that store additional values: Not relevant for lists.

    • +
    • Methods that can be used on the object: chrNum.count(2) counts the number of instances 2 appears as an element of chrNum.

    • +
    +

    Object methods are functions that does something with the object you are using it on. You should think about chrNum.count(2) as a function that takes in chrNum and 2 as inputs. If you want to use the count function on list mixedList, you would use mixedList.count(x).

    +

    Here are some more examples of methods with lists:

    + ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Function methodWhat it takes inWhat it doesReturns
    chrNum.count(x)list chrNum, data type xCounts the number of instances x appears as an element of chrNum.Integer
    chrNum.append(x)list chrNum, data type xAppends x to the end of the chrNum.None (but chrNum is modified!)
    chrNum.sort()list chrNumSorts chrNum by ascending order.None (but chrNum is modified!)
    chrNum.reverse()list chrNumReverses the order of chrNum.None (but chrNum is modified!)
    +
    +
    +

    2.3 Methods vs Functions

    +

    Methods have to take in the object of interest as an input: chrNum.count(2) automatically treat chrNum as an input. Methods are built for a specific Object type.

    +

    Functions do not have an implied input: len(chrNum) requires specifying a list in the input.

    +

    Otherwise, there is no strong distinction between the two.

    +
    +
    +

    2.4 Dataframes

    +

    A Dataframe is a two-dimensional data structure that stores data like a spreadsheet does.

    +

    The Dataframe data structure is found within a Python module called “Pandas”. A Python module is an organized collection of functions and data structures. The import statement below gives us permission to access the “Pandas” module via the variable pd.

    +

    To load in a Dataframe from existing spreadsheet data, we use the function pd.read_csv():

    +
    import pandas as pd
    +
    +metadata = pd.read_csv("classroom_data/metadata.csv")
    +type(metadata)
    +
    ## <class 'pandas.core.frame.DataFrame'>
    +

    There is a similar function pd.read_excel() for loading in Excel spreadsheets.

    +

    Let’s investigate the Dataframe as an object:

    +
      +
    • What does a Dataframe contain (values, attributes)?

    • +
    • What can a Dataframe do (methods)?

    • +
    +
    +
    +

    2.5 What does a Dataframe contain?

    +

    We first take a look at the contents:

    +
    metadata
    +
    ##          ModelID  ...       OncotreeLineage
    +## 0     ACH-000001  ...  Ovary/Fallopian Tube
    +## 1     ACH-000002  ...               Myeloid
    +## 2     ACH-000003  ...                 Bowel
    +## 3     ACH-000004  ...               Myeloid
    +## 4     ACH-000005  ...               Myeloid
    +## ...          ...  ...                   ...
    +## 1859  ACH-002968  ...     Esophagus/Stomach
    +## 1860  ACH-002972  ...     Esophagus/Stomach
    +## 1861  ACH-002979  ...     Esophagus/Stomach
    +## 1862  ACH-002981  ...     Esophagus/Stomach
    +## 1863  ACH-003071  ...                  Lung
    +## 
    +## [1864 rows x 30 columns]
    +

    It looks like there are 1864 rows and 30 columns in this Dataframe, and when we display it it shows some of the data.

    +
    metadata
    +
    ##          ModelID  ...       OncotreeLineage
    +## 0     ACH-000001  ...  Ovary/Fallopian Tube
    +## 1     ACH-000002  ...               Myeloid
    +## 2     ACH-000003  ...                 Bowel
    +## 3     ACH-000004  ...               Myeloid
    +## 4     ACH-000005  ...               Myeloid
    +## ...          ...  ...                   ...
    +## 1859  ACH-002968  ...     Esophagus/Stomach
    +## 1860  ACH-002972  ...     Esophagus/Stomach
    +## 1861  ACH-002979  ...     Esophagus/Stomach
    +## 1862  ACH-002981  ...     Esophagus/Stomach
    +## 1863  ACH-003071  ...                  Lung
    +## 
    +## [1864 rows x 30 columns]
    +

    We can look at specific columns by looking at attributes via the dot operation. We can also look at the columns via the bracket operation.

    +
    metadata.ModelID
    +
    ## 0       ACH-000001
    +## 1       ACH-000002
    +## 2       ACH-000003
    +## 3       ACH-000004
    +## 4       ACH-000005
    +##            ...    
    +## 1859    ACH-002968
    +## 1860    ACH-002972
    +## 1861    ACH-002979
    +## 1862    ACH-002981
    +## 1863    ACH-003071
    +## Name: ModelID, Length: 1864, dtype: object
    +
    metadata['ModelID']
    +
    ## 0       ACH-000001
    +## 1       ACH-000002
    +## 2       ACH-000003
    +## 3       ACH-000004
    +## 4       ACH-000005
    +##            ...    
    +## 1859    ACH-002968
    +## 1860    ACH-002972
    +## 1861    ACH-002979
    +## 1862    ACH-002981
    +## 1863    ACH-003071
    +## Name: ModelID, Length: 1864, dtype: object
    +

    The names of all columns is stored as an attribute, which can be accessed via the dot operation.

    +
    metadata.columns
    +
    ## Index(['ModelID', 'PatientID', 'CellLineName', 'StrippedCellLineName', 'Age',
    +##        'SourceType', 'SangerModelID', 'RRID', 'DepmapModelType', 'AgeCategory',
    +##        'GrowthPattern', 'LegacyMolecularSubtype', 'PrimaryOrMetastasis',
    +##        'SampleCollectionSite', 'Sex', 'SourceDetail', 'LegacySubSubtype',
    +##        'CatalogNumber', 'CCLEName', 'COSMICID', 'PublicComments',
    +##        'WTSIMasterCellID', 'EngineeredModel', 'TreatmentStatus',
    +##        'OnboardedMedia', 'PlateCoating', 'OncotreeCode', 'OncotreeSubtype',
    +##        'OncotreePrimaryDisease', 'OncotreeLineage'],
    +##       dtype='object')
    +

    The number of rows and columns are also stored as an attribute:

    +
    metadata.shape
    +
    ## (1864, 30)
    +
    +
    +

    2.6 What can a Dataframe do?

    +

    We can use the .head() and .tail() methods to look at the first few rows and last few rows of metadata, respectively:

    +
    metadata.head()
    +
    ##       ModelID  PatientID  ...     OncotreePrimaryDisease       OncotreeLineage
    +## 0  ACH-000001  PT-gj46wT  ...   Ovarian Epithelial Tumor  Ovary/Fallopian Tube
    +## 1  ACH-000002  PT-5qa3uk  ...     Acute Myeloid Leukemia               Myeloid
    +## 2  ACH-000003  PT-puKIyc  ...  Colorectal Adenocarcinoma                 Bowel
    +## 3  ACH-000004  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 4  ACH-000005  PT-q4K2cp  ...     Acute Myeloid Leukemia               Myeloid
    +## 
    +## [5 rows x 30 columns]
    +
    metadata.tail()
    +
    ##          ModelID  PatientID  ...          OncotreePrimaryDisease    OncotreeLineage
    +## 1859  ACH-002968  PT-pjhrsc  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1860  ACH-002972  PT-dkXZB1  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1861  ACH-002979  PT-lyHTzo  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1862  ACH-002981  PT-Z9akXf  ...  Esophagogastric Adenocarcinoma  Esophagus/Stomach
    +## 1863  ACH-003071  PT-LAGmLq  ...       Lung Neuroendocrine Tumor               Lung
    +## 
    +## [5 rows x 30 columns]
    +

    Both of these functions (without input arguments) are considered as methods: they are functions that does something with the Dataframe you are using it on. You should think about metadata.head() as a function that takes in metadata as an input. If we had another Dataframe called my_data and you want to use the same function, you will have to say my_data.head().

    +
    +
    +

    2.7 Subsetting Dataframes

    +

    Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists.

    +

    You will use the iloc attribute and bracket operations, and you give two slices: one for the row, and one for the column.

    +

    Let’s start with a small dataframe to see how it works before returning to metadata:

    +
    df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"],
    +                            'age_case': [25, 43, 21, 65, 7],
    +                            'age_control': [49, 20, 32, 25, 32]})
    +df
    +
    ##        status  age_case  age_control
    +## 0     treated        25           49
    +## 1   untreated        43           20
    +## 2   untreated        21           32
    +## 3  discharged        65           25
    +## 4     treated         7           32
    +

    Here is how the dataframe looks like with the row and column index numbers:

    +

    +

    Subset the first fourth rows, and the first two columns:

    +

    +

    Now, back to metadata dataframe:

    +

    Subset the first 5 rows, and first two columns:

    +
    metadata.iloc[:5, :2]
    +
    ##       ModelID  PatientID
    +## 0  ACH-000001  PT-gj46wT
    +## 1  ACH-000002  PT-5qa3uk
    +## 2  ACH-000003  PT-puKIyc
    +## 3  ACH-000004  PT-q4K2cp
    +## 4  ACH-000005  PT-q4K2cp
    +

    If we want a custom slice that is not sequential, we can use an integer list. Subset the last 5 rows, and the 1st and 10 and 21th column:

    +
    metadata.iloc[5:, [1, 10, 21]]
    +
    ##       PatientID GrowthPattern  WTSIMasterCellID
    +## 5     PT-ej13Dz    Suspension            2167.0
    +## 6     PT-NOXwpH      Adherent             569.0
    +## 7     PT-fp8PeY      Adherent            1806.0
    +## 8     PT-puKIyc      Adherent            2104.0
    +## 9     PT-AR7W9o      Adherent               NaN
    +## ...         ...           ...               ...
    +## 1859  PT-pjhrsc      Organoid               NaN
    +## 1860  PT-dkXZB1      Organoid               NaN
    +## 1861  PT-lyHTzo      Organoid               NaN
    +## 1862  PT-Z9akXf      Organoid               NaN
    +## 1863  PT-LAGmLq    Suspension               NaN
    +## 
    +## [1859 rows x 3 columns]
    +

    When we subset via numerical indicies, it’s called explicit subsetting. This is a great way to start thinking about subsetting your dataframes for analysis, but explicit subsetting can lead to some inconsistencies in the long run. For instance, suppose your collaborator added a new cell line to the metadata and changed the order of the column. Then your code to subset the last 5 rows and the columns will get you a different answer once the spreadsheet is changed.

    +

    The second way is to subset by the column name and comparison operators, also known as implicit subsetting. This is much more robust in data analysis practice. You will learn about it next week!

    +
    +
    +

    2.8 Exercises

    +

    Exercise for week 2 can be found here.

    + +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/manuscript/1-Intro-to-Computing.md b/manuscript/1-Intro-to-Computing.md new file mode 100644 index 0000000..46e4cf8 --- /dev/null +++ b/manuscript/1-Intro-to-Computing.md @@ -0,0 +1,7 @@ +# 1 Intro to Computing + +{type: iframe, title:1 Intro to Computing, width:800, height:600, poster:resources/chapt_screen_images/intro-to-computing.png} +![](https://hutchdatascience.org/Intro_to_Python/intro-to-computing.html) + + + diff --git a/manuscript/2-Working-with-data-structures.md b/manuscript/2-Working-with-data-structures.md new file mode 100644 index 0000000..bd51cf0 --- /dev/null +++ b/manuscript/2-Working-with-data-structures.md @@ -0,0 +1,7 @@ +# 2 Working with data structures + +{type: iframe, title:2 Working with data structures, width:800, height:600, poster:resources/chapt_screen_images/working-with-data-structures.png} +![](https://hutchdatascience.org/Intro_to_Python/working-with-data-structures.html) + + + diff --git a/manuscript/3-Data-Wrangling,-Part-1.md b/manuscript/3-Data-Wrangling,-Part-1.md new file mode 100644 index 0000000..bac04e4 --- /dev/null +++ b/manuscript/3-Data-Wrangling,-Part-1.md @@ -0,0 +1,7 @@ +# 3 Data Wrangling, Part 1 + +{type: iframe, title:3 Data Wrangling, Part 1, width:800, height:600, poster:resources/chapt_screen_images/data-wrangling-part-1.png} +![](https://hutchdatascience.org/Intro_to_Python/data-wrangling-part-1.html) + + + diff --git a/manuscript/4-Data-Wrangling,-Part-2.md b/manuscript/4-Data-Wrangling,-Part-2.md new file mode 100644 index 0000000..8b5f020 --- /dev/null +++ b/manuscript/4-Data-Wrangling,-Part-2.md @@ -0,0 +1,7 @@ +# 4 Data Wrangling, Part 2 + +{type: iframe, title:4 Data Wrangling, Part 2, width:800, height:600, poster:resources/chapt_screen_images/data-wrangling-part-2.png} +![](https://hutchdatascience.org/Intro_to_Python/data-wrangling-part-2.html) + + + diff --git a/manuscript/5-Data-Visualization.md b/manuscript/5-Data-Visualization.md new file mode 100644 index 0000000..ba4d554 --- /dev/null +++ b/manuscript/5-Data-Visualization.md @@ -0,0 +1,7 @@ +# 5 Data Visualization + +{type: iframe, title:5 Data Visualization, width:800, height:600, poster:resources/chapt_screen_images/data-visualization.png} +![](https://hutchdatascience.org/Intro_to_Python/data-visualization.html) + + + diff --git a/manuscript/6-References.md b/manuscript/6-References.md new file mode 100644 index 0000000..0b42ddc --- /dev/null +++ b/manuscript/6-References.md @@ -0,0 +1,7 @@ +# 6 References + +{type: iframe, title:6 References, width:800, height:600, poster:resources/chapt_screen_images/references.png} +![](https://hutchdatascience.org/Intro_to_Python/references.html) + + + diff --git a/manuscript/About-the-Authors.md b/manuscript/About-the-Authors.md new file mode 100644 index 0000000..33bb63d --- /dev/null +++ b/manuscript/About-the-Authors.md @@ -0,0 +1,7 @@ +# About the Authors + +{type: iframe, title:About the Authors, width:800, height:600, poster:resources/chapt_screen_images/about-the-authors.png} +![](https://hutchdatascience.org/Intro_to_Python/about-the-authors.html) + + + diff --git a/manuscript/About-this-Course.md b/manuscript/About-this-Course.md new file mode 100644 index 0000000..ab60358 --- /dev/null +++ b/manuscript/About-this-Course.md @@ -0,0 +1,7 @@ +# About this Course + +{type: iframe, title:About this Course, width:800, height:600, poster:resources/chapt_screen_images/index.png} +![](https://hutchdatascience.org/Intro_to_Python/index.html) + + + diff --git a/manuscript/Book.txt b/manuscript/Book.txt new file mode 100644 index 0000000..0537e51 --- /dev/null +++ b/manuscript/Book.txt @@ -0,0 +1,8 @@ +1-Intro-to-Computing.md +2-Working-with-data-structures.md +3-Data-Wrangling,-Part-1.md +4-Data-Wrangling,-Part-2.md +5-Data-Visualization.md +6-References.md +About-this-Course.md +About-the-Authors.md diff --git a/manuscript/resources/chapt_screen_images/about-the-authors.png b/manuscript/resources/chapt_screen_images/about-the-authors.png new file mode 100644 index 0000000..f6ff1b5 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/about-the-authors.png differ diff --git a/manuscript/resources/chapt_screen_images/data-visualization.png b/manuscript/resources/chapt_screen_images/data-visualization.png new file mode 100644 index 0000000..69a9453 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/data-visualization.png differ diff --git a/manuscript/resources/chapt_screen_images/data-wrangling-part-1.png b/manuscript/resources/chapt_screen_images/data-wrangling-part-1.png new file mode 100644 index 0000000..7543182 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/data-wrangling-part-1.png differ diff --git a/manuscript/resources/chapt_screen_images/data-wrangling-part-2.png b/manuscript/resources/chapt_screen_images/data-wrangling-part-2.png new file mode 100644 index 0000000..e2a67e7 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/data-wrangling-part-2.png differ diff --git a/manuscript/resources/chapt_screen_images/index.png b/manuscript/resources/chapt_screen_images/index.png new file mode 100644 index 0000000..080abc1 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/index.png differ diff --git a/manuscript/resources/chapt_screen_images/intro-to-computing.png b/manuscript/resources/chapt_screen_images/intro-to-computing.png new file mode 100644 index 0000000..47e0677 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/intro-to-computing.png differ diff --git a/manuscript/resources/chapt_screen_images/references.png b/manuscript/resources/chapt_screen_images/references.png new file mode 100644 index 0000000..46c27b0 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/references.png differ diff --git a/manuscript/resources/chapt_screen_images/working-with-data-structures.png b/manuscript/resources/chapt_screen_images/working-with-data-structures.png new file mode 100644 index 0000000..126b6a8 Binary files /dev/null and b/manuscript/resources/chapt_screen_images/working-with-data-structures.png differ diff --git a/resources/chapt_screen_images/about-the-authors.png b/resources/chapt_screen_images/about-the-authors.png index 4d131e6..f6ff1b5 100644 Binary files a/resources/chapt_screen_images/about-the-authors.png and b/resources/chapt_screen_images/about-the-authors.png differ diff --git a/resources/chapt_screen_images/chapter_urls.tsv b/resources/chapt_screen_images/chapter_urls.tsv index 847fb4e..db350af 100644 --- a/resources/chapt_screen_images/chapter_urls.tsv +++ b/resources/chapt_screen_images/chapter_urls.tsv @@ -1,5 +1,9 @@ url chapt_title img_path https://hutchdatascience.org/Intro_to_Python/index.html About this Course resources/chapt_screen_images/index.png https://hutchdatascience.org/Intro_to_Python/intro-to-computing.html 1 Intro to Computing resources/chapt_screen_images/intro-to-computing.png +https://hutchdatascience.org/Intro_to_Python/working-with-data-structures.html 2 Working with data structures resources/chapt_screen_images/working-with-data-structures.png +https://hutchdatascience.org/Intro_to_Python/data-wrangling-part-1.html 3 Data Wrangling, Part 1 resources/chapt_screen_images/data-wrangling-part-1.png +https://hutchdatascience.org/Intro_to_Python/data-wrangling-part-2.html 4 Data Wrangling, Part 2 resources/chapt_screen_images/data-wrangling-part-2.png +https://hutchdatascience.org/Intro_to_Python/data-visualization.html 5 Data Visualization resources/chapt_screen_images/data-visualization.png https://hutchdatascience.org/Intro_to_Python/about-the-authors.html About the Authors resources/chapt_screen_images/about-the-authors.png -https://hutchdatascience.org/Intro_to_Python/references.html 2 References resources/chapt_screen_images/references.png +https://hutchdatascience.org/Intro_to_Python/references.html 6 References resources/chapt_screen_images/references.png diff --git a/resources/chapt_screen_images/data-visualization.png b/resources/chapt_screen_images/data-visualization.png new file mode 100644 index 0000000..69a9453 Binary files /dev/null and b/resources/chapt_screen_images/data-visualization.png differ diff --git a/resources/chapt_screen_images/data-wrangling-part-1.png b/resources/chapt_screen_images/data-wrangling-part-1.png new file mode 100644 index 0000000..7543182 Binary files /dev/null and b/resources/chapt_screen_images/data-wrangling-part-1.png differ diff --git a/resources/chapt_screen_images/data-wrangling-part-2.png b/resources/chapt_screen_images/data-wrangling-part-2.png new file mode 100644 index 0000000..e2a67e7 Binary files /dev/null and b/resources/chapt_screen_images/data-wrangling-part-2.png differ diff --git a/resources/chapt_screen_images/index.png b/resources/chapt_screen_images/index.png index 98f4567..080abc1 100644 Binary files a/resources/chapt_screen_images/index.png and b/resources/chapt_screen_images/index.png differ diff --git a/resources/chapt_screen_images/intro-to-computing.png b/resources/chapt_screen_images/intro-to-computing.png index 5d11049..47e0677 100644 Binary files a/resources/chapt_screen_images/intro-to-computing.png and b/resources/chapt_screen_images/intro-to-computing.png differ diff --git a/resources/chapt_screen_images/references.png b/resources/chapt_screen_images/references.png index b28dd87..46c27b0 100644 Binary files a/resources/chapt_screen_images/references.png and b/resources/chapt_screen_images/references.png differ diff --git a/resources/chapt_screen_images/working-with-data-structures.png b/resources/chapt_screen_images/working-with-data-structures.png new file mode 100644 index 0000000..126b6a8 Binary files /dev/null and b/resources/chapt_screen_images/working-with-data-structures.png differ