Skip to content

Commit

Permalink
delete machine learning
Browse files Browse the repository at this point in the history
  • Loading branch information
jkrumbiegel committed Jan 15, 2025
1 parent 894bda2 commit fa65c1a
Showing 1 changed file with 0 additions and 79 deletions.
79 changes: 0 additions & 79 deletions docs/src/generated/penguins.jl
Original file line number Diff line number Diff line change
Expand Up @@ -226,82 +226,3 @@ draw(plt; axis = axis)
# Note that static 3D plot can be misleading, as they only show one projection
# of 3D data. They are mostly useful when shown interactively.
#
# ## Machine Learning
#
# Finally, let us use Machine Learning techniques to build an automated penguin classifier!
#
# We would like to investigate whether it is possible to predict the species of a penguin
# based on its bill size. To do so, we will use a standard classifier technique
# called [Support-Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine).
#
# The strategy is quite simple. We split the data into training and testing
# subdatasets. We then train our classifier on the training dataset and use it to
# make predictions on the whole data. We then add the new columns obtained this way
# to the dataset and visually inspect how well the classifier performed in both
# training and testing.

using LIBSVM, Random

## use approximately 80% of penguins for training
Random.seed!(1234) # for reproducibility
N = nrow(penguins)
train = fill(false, N)
perm = randperm(N)
train_idxs = perm[1:floor(Int, 0.8N)]
train[train_idxs] .= true
nothing # hide

## fit model on training data and make predictions on the whole dataset
X = hcat(penguins.bill_length_mm, penguins.bill_depth_mm)
y = penguins.species
model = SVC() # Support-Vector Machine Classifier
fit!(model, X[train, :], y[train])
ŷ = predict(model, X)

## incorporate relevant information in the dataset
penguins.train = train
penguins.predicted_species =
nothing #hide

# Now, we have all the columns we need to evaluate how well our classifier performed.

axis = (width = 225, height = 225)
dataset =:train => renamer(true => "training", false => "testing") => "Dataset"
accuracy = (:species, :predicted_species) => isequal => "accuracy"
plt = data(penguins) *
expectation() *
mapping(:species, accuracy) *
mapping(col = dataset)
draw(plt; axis = axis)

# That is a bit hard to read, as all values are very close to `1`.
# Let us visualize the error rate instead.

error_rate = (:species, :predicted_species) => !isequal => "error rate"
plt = data(penguins) *
expectation() *
mapping(:species, error_rate) *
mapping(col = dataset)
draw(plt; axis = axis)

# So, mostly our classifier is doing quite well, but there are some mistakes,
# especially among `Chinstrap` penguins. Using *at the same time* the `species` and
# `predicted_species` mappings on different attributes, we can see which penguins
# are problematic.

prediction = :predicted_species => "predicted species"
datalayer = mapping(color = prediction, row = :species, col = dataset)
plt = penguin_bill * datalayer
draw(plt; axis = axis)

# Um, some of the penguins are indeed being misclassified... Let us try to understand why
# by adding an extra layer, which describes the density of the distributions of the three
# species.

pdflayer = density() * visual(Contour, colormap=Reverse(:grays)) * mapping(group = :species)
layers = pdflayer + datalayer
plt = penguin_bill * layers
draw(plt; axis = axis)

# We can conclude that the classifier is doing a reasonable job:
# it is mostly making mistakes on outlier penguins.

0 comments on commit fa65c1a

Please sign in to comment.