Skip to content

Commit

Permalink
add back full penguins
Browse files Browse the repository at this point in the history
  • Loading branch information
jkrumbiegel committed Jan 15, 2025
1 parent 76b7a4f commit 894bda2
Showing 1 changed file with 206 additions and 0 deletions.
206 changes: 206 additions & 0 deletions docs/src/generated/penguins.jl
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,209 @@ draw(penguin_bill; axis = axis)
plt = penguin_bill * mapping(color = :species)
draw(plt; axis = axis)

# Ha! Within each species, penguins with a longer bill also have a deeper bill.
# We can confirm that with a linear regression

plt = penguin_bill * linear() * mapping(color = :species)
draw(plt; axis = axis)

# This unfortunately no longer shows our data!
# We can use `+` to plot both things on top of each other:

plt = penguin_bill * linear() * mapping(color = :species) + penguin_bill * mapping(color = :species)
draw(plt; axis = axis)

# Note that the above expression seems a bit redundant, as we wrote the same thing twice.
# We can "factor it out" as follows

plt = penguin_bill * (linear() + mapping()) * mapping(color = :species)
draw(plt; axis = axis)

# where `mapping()` is a neutral multiplicative element.
# Of course, the above could be refactored as

layers = linear() + mapping()
plt = penguin_bill * layers * mapping(color = :species)
draw(plt; axis = axis)

# We could actually take advantage of the spare `mapping()` and use it to pass some
# extra info to the scatter, while still using all the species members to compute
# the linear fit.

layers = linear() + mapping(marker = :sex)
plt = penguin_bill * layers * mapping(color = :species)
draw(plt; axis = axis)

# This plot is getting a little bit crowded. We could instead show female and
# male penguins in separate subplots.

layers = linear() + mapping(col = :sex)
plt = penguin_bill * layers * mapping(color = :species)
draw(plt; axis = axis)

# See how both plots show the same fit, because the `sex` mapping is not applied
# to `linear()`. The following on the other hand produces a separate fit for
# males and females:

layers = linear() + mapping()
plt = penguin_bill * layers * mapping(color = :species, col = :sex)
draw(plt; axis = axis)

# ## Smooth density plots
#
# An alternative approach to understanding how two variables interact is to consider
# their joint probability density distribution (pdf).

using AlgebraOfGraphics: density
plt = penguin_bill * density(npoints=50) * mapping(col = :species)
draw(plt; axis = axis)

# The default colormap is multi-hue, but it is possible to pass single-hue colormaps as well.
# The color range is inferred from the data by default, but it can also be passed manually.
# Both settings are passed via `scales` to `draw`, because multiple plots
# can share the same colormap, so `visual` is not the appropriate place for this setting.

draw(plt, scales(Color = (; colormap = :grayC, colorrange = (0, 6))); axis = axis)

# A `Heatmap` (the default visualization for a 2D density) is a bit unfortunate if
# we want to mark species by color. In that case, one can use `visual` to change
# the default visualization and, optionally, fine tune some arguments.
# In this case, a `Wireframe` with thin lines looks quite nice. (Note that, for the
# time being, we must specify explicitly that we require a 3D axis.)

axis = (type = Axis3, width = 300, height = 300)
layer = density() * visual(Wireframe, linewidth=0.05)
plt = penguin_bill * layer * mapping(color = :species)
draw(plt; axis = axis)

# Of course, a more traditional approach would be to use a `Contour` plot instead:

axis = (width = 225, height = 225)
layer = density() * visual(Contour)
plt = penguin_bill * layer * mapping(color = :species)
draw(plt; axis = axis)

# The data and the linear fit can also be added back to the plot:

layers = density() * visual(Contour) + linear() + mapping()
plt = penguin_bill * layers * mapping(color = :species)
draw(plt; axis = axis)

# In the case of many layers (contour, density and scatter) it is important to think
# about balance. In the above plot, the markers are quite heavy and can obscure the linear
# fit and the contour lines.
# We can lighten the markers using alpha transparency.

layers = density() * visual(Contour) + linear() + visual(alpha = 0.5)
plt = penguin_bill * layers * mapping(color = :species)
draw(plt; axis = axis)

# ## Correlating three variables
#
# We are now mostly up to speed with `bill` size, but we have not considered how
# it relates to other penguin features, such as their weight.
# For that, a possible approach is to use a continuous color
# on a gradient to denote weight and different marker shapes to denote species.
# Here we use `group` to split the data for the linear regression without adding
# any additional style.

body_mass = :body_mass_g => (t -> t / 1000) => "body mass (kg)"
layers = linear() * mapping(group = :species) + mapping(color = body_mass, marker = :species)
plt = penguin_bill * layers
draw(plt; axis = axis)

# Naturally, within each species, heavier penguins have bigger bills, but perhaps
# counter-intuitively the species with the shallowest bills features the heaviest penguins.
# We could also try and see the interplay of these three variables in a 3D plot.

axis = (type = Axis3, width = 300, height = 300)
plt = penguin_bill * mapping(body_mass, color = :species)
draw(plt; axis = axis)

#

plt = penguin_bill * mapping(body_mass, color = :species, layout = :sex)
draw(plt; axis = axis)

# Note that static 3D plot can be misleading, as they only show one projection
# of 3D data. They are mostly useful when shown interactively.
#
# ## Machine Learning
#
# Finally, let us use Machine Learning techniques to build an automated penguin classifier!
#
# We would like to investigate whether it is possible to predict the species of a penguin
# based on its bill size. To do so, we will use a standard classifier technique
# called [Support-Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine).
#
# The strategy is quite simple. We split the data into training and testing
# subdatasets. We then train our classifier on the training dataset and use it to
# make predictions on the whole data. We then add the new columns obtained this way
# to the dataset and visually inspect how well the classifier performed in both
# training and testing.

using LIBSVM, Random

## use approximately 80% of penguins for training
Random.seed!(1234) # for reproducibility
N = nrow(penguins)
train = fill(false, N)
perm = randperm(N)
train_idxs = perm[1:floor(Int, 0.8N)]
train[train_idxs] .= true
nothing # hide

## fit model on training data and make predictions on the whole dataset
X = hcat(penguins.bill_length_mm, penguins.bill_depth_mm)
y = penguins.species
model = SVC() # Support-Vector Machine Classifier
fit!(model, X[train, :], y[train])
ŷ = predict(model, X)

## incorporate relevant information in the dataset
penguins.train = train
penguins.predicted_species =
nothing #hide

# Now, we have all the columns we need to evaluate how well our classifier performed.

axis = (width = 225, height = 225)
dataset =:train => renamer(true => "training", false => "testing") => "Dataset"
accuracy = (:species, :predicted_species) => isequal => "accuracy"
plt = data(penguins) *
expectation() *
mapping(:species, accuracy) *
mapping(col = dataset)
draw(plt; axis = axis)

# That is a bit hard to read, as all values are very close to `1`.
# Let us visualize the error rate instead.

error_rate = (:species, :predicted_species) => !isequal => "error rate"
plt = data(penguins) *
expectation() *
mapping(:species, error_rate) *
mapping(col = dataset)
draw(plt; axis = axis)

# So, mostly our classifier is doing quite well, but there are some mistakes,
# especially among `Chinstrap` penguins. Using *at the same time* the `species` and
# `predicted_species` mappings on different attributes, we can see which penguins
# are problematic.

prediction = :predicted_species => "predicted species"
datalayer = mapping(color = prediction, row = :species, col = dataset)
plt = penguin_bill * datalayer
draw(plt; axis = axis)

# Um, some of the penguins are indeed being misclassified... Let us try to understand why
# by adding an extra layer, which describes the density of the distributions of the three
# species.

pdflayer = density() * visual(Contour, colormap=Reverse(:grays)) * mapping(group = :species)
layers = pdflayer + datalayer
plt = penguin_bill * layers
draw(plt; axis = axis)

# We can conclude that the classifier is doing a reasonable job:
# it is mostly making mistakes on outlier penguins.

0 comments on commit 894bda2

Please sign in to comment.