from wordcloud import WordCloud
+import matplotlib.pyplot as plt
-
Categories
-Large Data Work: Intro to parquet files in R +Working with ggplot2: A Short Guide
ggplot2
to create visualizations using the cars
dataset. This dataset contains the speed of cars and the distances taken to stop.
Visualize text frequency with {wordcloud}
Show your data
We download US Presidential State of the Union speeches as a demo dataset - from Washington to Obama.
Demonstrate wordcloud
# Generate a word cloud image
+wordcloud = WordCloud(max_font_size=40).generate(text)
-
Your turn!
What happens when you change the max_font_size
?
wordcloud2 = WordCloud(max_font_size=______).generate(text)
+# Display the generated image:
+plt.imshow(wordcloud2, interpolation='bilinear')
+plt.axis("off")
+plt.show()
Learning ObjectivesOur Cohort: Penguins
We’re going to use the palmerpenguins
dataset as our example cohort. As a reminder, here’s the first few rows of this dataset.
library(palmerpenguins)
+library(gtsummary)
+library(dplyr)
-
Summary Table of
{gtsummary}
lets you build up a summary demographics table with dplyr
commands and special summarization commands.
Here, we’re
penguins |>
+ select(species, island, bill_length_mm) |>
+ tbl_summary()
Comparing Groups
penguins |>
+ tbl_summary(include=c(island, bill_length_mm),
+ by=species,
+ missing="no")
+
We can also add N’s and P-values:
penguins |>
+ tbl_summary(include=c(island, bill_length_mm),
+ by=species,
+ missing="no") |>
+ add_n() |>
+ add_p()
+
Here you can see we did a chi-squared test to look at combinations of island
and species
, and we did a Kruskal-Wallis rank sum to compare bill_length_mm
across species
.
This is just the tip of the iceberg for {gtsummary}
. You also can output to Microsoft Word for further tweaks.
Comparing Groups
Packages Used
sessionInfo()
Citation
@online{laderas2024,
@@ -279,17 +240,6 @@ Packages Used
Laderas, Ted. 2024. “Make Your Table 1 with {Gtsummary}.”
September 4, 2024.
On this page
Our Dataset
data(penguins)
+library(gt)
-
Visualizi
My favorite way to look for these patterns is a package called {naniar}
written by my friend Nick Tierney. naniar visualizes rows of data as lines in a rectangle. Columns are represented by line sections.
Let’s take a look at the missing values in the penguins
data.
library(naniar)
+vis_miss(penguins)
What I like about this visual representation is that it lets you see the association of missing values as holes in the visualization, as well as percent missing values in each variable. In this example, you can see that some penguins are missing information such as sex
.
gg_miss_upset(penguins)
In this example, reading the combinations from left to right, we can see:
-
@@ -220,26 +200,28 @@
-
-
-
-
-
-
+ggplot(airquality,
+ aes(x = Ozone,
+ y = Solar.R)) +
+ geom_miss_point() +
+
+ ##everything past this point is just
+ #to explain the visualization
+ theme_minimal() +
+ geom_vline(xintercept=0) +
+ geom_hline(yintercept = 0) +
+ annotate("text",x=-5 ,y=150, label= "missing ozone", angle=90) +
+ annotate("text", y=-15, x=75, label="missing Solar.R") +
+ annotate("text", y=-20, x=-20, label="missing\nboth") +
+ annotate("text", y=150, x=75, label="no missing data")
ggplot(airquality,
+ aes(x = Ozone,
+ y = Solar.R)) +
+ geom_miss_point() +
+
+ ##everything past this point is just
+ #to explain the visualization
+ theme_minimal() +
+ geom_vline(xintercept=0) +
+ geom_hline(yintercept = 0) +
+ annotate("text",x=-5 ,y=150, label= "missing ozone", angle=90) +
+ annotate("text", y=-15, x=75, label="missing Solar.R") +
+ annotate("text", y=-20, x=-20, label="missing\nboth") +
+ annotate("text", y=150, x=75, label="no missing data")
In this plot, the missing values are represented by red points that are below the zero line for both axes (they are jittered so they don’t all occupy the same line). Specifically, the points on the left side have values for Solar.R
but are missing values for Ozone
. In this case, the points are distributed across the entire range of Solar.R
. Note that this isn’t the case for missing values of Solar.R
, which are represented in the lower right of the plot. These missing values are not distributed evenly across Ozone
, showing a bias towards lower values of Ozone
.
ful when you facet on a categorical variable, to look for conditioned randomness, MAR/MNAR.
ggplot(airquality,
+ aes(x = Ozone,
+ y = Solar.R)) +
+ geom_miss_point() + facet_wrap(~Month)
Here we can see a possible bias in missing values by the month (compare month=6 to month=9).
@@ -248,23 +230,6 @@I
I’ve barely scratched the surface of all you can do with {naniar}
. Nick has come up with all sorts of visualizations to address issues with missing values. I especially like the visualizations he’s added around imputations, which is one way to address missing values. Check his package out!
Citation
@online{laderas2024,
@@ -277,17 +242,6 @@ I
Laderas, Ted. 2024. “What’s Missing with `{Naniar}`.”
September 22, 2024.
What is {patchwork
Penguins Data
Just a quick reminder of the penguins data:
-
-
+#| edit: false
+data(penguins)
+library(gt)
-
-
-
+gt(head(penguins))
Let’s start with two plots
Let’s make two different views of the palmerpenguins
data. The first is a bar plot of the penguin species
:
-
-
+#| autorun: false
+#| warning: false
+library(palmerpenguins)
+library(ggplot2)
-
-
-
+penguin_species <- ggplot(penguins, aes(y=species, fill=species)) +
+ geom_bar()
+
+penguin_species
Let’s do a histogram of penguin bill_length_mm
, colored by species
:
-
-
+#| autorun: false
+#| warning: false
+penguin_bill_length <- ggplot(penguins, aes(y=bill_length_mm, fill=species)) +
+ geom_histogram(bins=20)
-
-
-
+penguin_bill_length
@@ -232,99 +225,67 @@ Composing Plots together
The {patchwork} package has two basic operations. +
composes the plots side by side, and /
composes one plot on top of each other.
Let’s try out a side by side composition:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+library(patchwork)
+penguin_species + penguin_bill_length
Let’s try stacking the plots on top of each other:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+penguin_species / penguin_bill_length
We can remove the legends from both:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+(penguin_species + theme(legend.position="none")) /
+ (penguin_bill_length + theme(legend.position="none"))
Side by side and Stacked
How about three figures? We can compose them with a combination of +
and /
:
-
-
+#| autorun: false
+#| warning: false
+penguin_island <- ggplot(penguins, aes(y=island)) +
+ geom_bar()
-
-
-
+(penguin_species + penguin_island) / penguin_bill_length
There is an equivalent syntax of using |
(the pipe character), which does the same thing as +
:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+(penguin_species | penguin_island) / penguin_bill_length
Plot Labeling
You can automatically label plots in your figure using plot_annotation()
:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+(penguin_species + penguin_island) / penguin_bill_length +
+ plot_annotation(tag_levels="A")
Finally, let’s add a title for our figure:
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+(penguin_species + penguin_island) / penguin_bill_length +
+ plot_annotation(tag_levels="A") +
+ plot_annotation(title="Penguins are Very Surprising")
Try it out!
Try out a different combination of plots, such as one plot on top and another on the bottom. Or make your own penguins
plot and compose them.
-
-
-
-
-
-
+#| autorun: false
+#| warning: false
+
@@ -332,23 +293,6 @@ Go Further
This is just the tip of the iceberg. You can learn way more about {patchwork}
at Thomas Lin Pedersen’s website: https://patchwork.data-imaginist.com/index.html
-
-
-
-
-
-
-
-
-
-
-
@@ -362,17 +306,6 @@ Go Further
Laderas, Ted. 2024. “Compose Plots with {Patchwork}.”
September 13, 2024.
Penguins Data
Just a quick reminder of the penguins data:
#| edit: false
+data(penguins)
+library(gt)
-
Let’s start with two plots
Let’s make two different views of the palmerpenguins
data. The first is a bar plot of the penguin species
:
#| autorun: false
+#| warning: false
+library(palmerpenguins)
+library(ggplot2)
-
Let’s do a histogram of penguin bill_length_mm
, colored by species
:
#| autorun: false
+#| warning: false
+penguin_bill_length <- ggplot(penguins, aes(y=bill_length_mm, fill=species)) +
+ geom_histogram(bins=20)
-
Composing Plots together
The {patchwork} package has two basic operations. +
composes the plots side by side, and /
composes one plot on top of each other.
Let’s try out a side by side composition:
#| autorun: false
+#| warning: false
+library(patchwork)
+penguin_species + penguin_bill_length
Let’s try stacking the plots on top of each other:
#| autorun: false
+#| warning: false
+penguin_species / penguin_bill_length
We can remove the legends from both:
#| autorun: false
+#| warning: false
+(penguin_species + theme(legend.position="none")) /
+ (penguin_bill_length + theme(legend.position="none"))
Side by side and Stacked
How about three figures? We can compose them with a combination of +
and /
:
#| autorun: false
+#| warning: false
+penguin_island <- ggplot(penguins, aes(y=island)) +
+ geom_bar()
-
There is an equivalent syntax of using |
(the pipe character), which does the same thing as +
:
#| autorun: false
+#| warning: false
+(penguin_species | penguin_island) / penguin_bill_length
Plot Labeling
You can automatically label plots in your figure using plot_annotation()
:
#| autorun: false
+#| warning: false
+(penguin_species + penguin_island) / penguin_bill_length +
+ plot_annotation(tag_levels="A")
Finally, let’s add a title for our figure:
#| autorun: false
+#| warning: false
+(penguin_species + penguin_island) / penguin_bill_length +
+ plot_annotation(tag_levels="A") +
+ plot_annotation(title="Penguins are Very Surprising")
Try it out!
Try out a different combination of plots, such as one plot on top and another on the bottom. Or make your own penguins
plot and compose them.
#| autorun: false
+#| warning: false
+
Go Further
This is just the tip of the iceberg. You can learn way more about {patchwork}
at Thomas Lin Pedersen’s website: https://patchwork.data-imaginist.com/index.html