TA tutorial, Machine Learning (2019 Spring)
- Package Requirements
- NumPy Array Manipulation
- PyTorch
- Start building a model
Note: This is a tutorial for PyTorch==1.0.1
version
- PyTorch == 1.0.1
- NumPy >= 1.14
- SciPy == 1.2.1
Some useful functions that you may use for managing your training data. We must carefully check our data dimensions are logically correct.
-
np.concatenate((arr_1, arr_2, ...), axis=0)
Note that the shape of array in the sequence should be the same except the dimension corresponds to the axis.
# concatenate two array a1 = np.array([[1, 2], [3, 4], [5, 6]]) # shape: (3, 2) a2 = np.array([[3, 4], [5, 6], [7, 8]]) # shape: (3, 2) # along the axis = 0 a3 = np.concatenate((a1, a2), axis=0) # shape: (6, 2) # along the axis = 1 a4 = np.concatenate((a1, a2), axis=1) # shape: (3, 4)
-
np.transpose(arr, axis)
Mostly we use it to align the dimension of our data.
# transpose 2D array a5 = np.array([[1, 2], [3, 4], [5, 6]]) # shape: (3, 2) np.transpose(a5) # shape: (2, 3)
We can also permute multiple axis of the array.
a6 = np.array([[[1, 2], [3, 4], [5, 6]]]) # shape: (1, 3, 2) np.transpose((a6), axes=(2, 1, 0)) # shape: (2, 3, 1)
A torch.tensor
is conceptually identical to a numpy array, but with GPU support and additional attributes to allow Pytorch operations.
-
Create a tensor
b1 = torch.tensor([[[1, 2, 3], [4, 5, 6]]])
-
Some frequently-used functions you can use
b1.size() # to check to size of the tensor # torch.Size([1, 2, 3]) b1.view((1, 3, 2)) # same as reshape in numpy (same underlying data, different interpretations) # tensor([[[1, 2], # [3, 4], # [5, 6]]]) b1.squeeze() # removes all the dimensions of size 1 # tensor([[1, 2, 3], # [4, 5, 6]]) b1.unsqueeze() # inserts a new dimension of size one in a specific position # tensor([[[[1, 2, 3], # [4, 5, 6]]]])
-
Other manipulation functions are similar to that of NumPy, we omitted it here for simplification. For more information, please check the PyTorch documentation: https://pytorch.org/docs/stable/tensors.html
-
Some important attributes of
torch.tensor
-
b1.grad # gradient of the tensor b1.grad_fn # the gradient function the tensor b1.is_leaf # check if tensor is a leaf node of the graph b1.requires_grad # if set to True, starts tracking all operations performed
torch.Auotgrad is a package that provides functions implementing differentiation for scalar outputs.
For example:
-
Create a tensor and set
requires_grad=True
to track the computation with it.x1 = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) # x1.grad None # x1.grad_fn None # x1.is_leaf True # x1.requires_grad True x2 = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) # x2.grad None # x2.grad_fn None # x2.is_leaf True # x2.requires_grad True
It also enables the tensor to do gradient computations later on.
Note: Only floating dtype can require gradients.
-
Do some simple operation
z = (0.5 * x1 + x2).sum() # x2.grad None # x2.grad_fn <SumBackward0> # x2.is_leaf False # x2.requires_grad True
Note: If we view
x1
as and view
x2
as -
Call
backward()
function to compute gradients automaticallyz.backward() # this is identical to calling z.backward(torch.tensor(1.))
z.backward()
is actually just the derivative of z with respect to inputs (tensors whoseis_leaf
andrequires_grad
both equalsTrue
)For example, if we want to know the derivative of
z
with respect tox_1
, it is: -
Check the gradients using
.grad
x1.grad x2.grad
Output will be something like this
tensor([[[0.5000, 0.5000], # x1.grad [0.5000, 0.5000]]]) tensor([[[1., 1.], # x2.grad [1., 1.]]])
More in-depth explanation of Autograd can be found in this awesome youtube video: Link
Pytorch provides a convenient way for interacting with datasets by torch.utils.data.Dataset
, an abstract class representing a dataset. When datasets are large, the RAM on our machine may not be large enough to fit all the data at once. Instead, we load only a portion of the data when needed, and move it back to disk when finished using.
A simple dataset is created as follows:
import csv
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, label_path):
"""
let's assume the csv is as follows:
================================
image_path label
imgs/001.png 1
imgs/002.png 0
imgs/003.png 2
imgs/004.png 1
.
.
.
================================
And we define a function parse_csv() that parses the csv into a list of tuples
[('imgs/001.png', 1), ('imgs/002.png', 0)...]
"""
self.label = parse_csv()
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
img_path, label = self.label[idx]
# imread: a function that reads an image from path
img = imread(img_path)
# some operations/transformations
return torch.tensor(img), torch.tensor(label)
Note that MyDataset
inherits Dataset
. If we look at the source code, we can see that the default behavior of __len__
and __getitem__
is to raise a NotImplementedError
, meaning that we should override them every time we create a custom dataset.
We can iterate through the dataset with a for
loop, but we cannot shuffle, batch or load the data in parallel. torch.utils.data.Dataloader
is an iterator which provides all those features. We can specify the batch size, whether to shuffle the data, and number of workers to load the data.
from torch.utils.data import DataLoader
dataset = MydDataset('/imgs')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
for batch_id, batch in enumerate(dataloader):
imgs, labels = batch
"""
do something for each batch
ex:
output = model(imgs)
loss = cross_entropy(output, labels)
"""
Pytorch provides an nn.Module
for easy definition of a model. A simple CNN model is defined as such:
import torch
import torch.nn as nn
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__() # call parent __init__ function
self.fc = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(inplace=True),
nn.Linear(128, 64),
nn.ReLU(inplace=True),
nn.Linear(64, 10),
)
self.output = nn.Softmax(dim=1)
def forward(self, x):
out = self.fc(x.view(-1, 28*28))
out = self.output(out)
return out
We let our model inherit from the nn.Module
class. But why do we need to call super
in the __init__
function whereas in the Dataset
case we don't ? If we look at the source code of nn.Module
we can see that there are certain attributes needed in order for the model to work. In the case of Dataset
, there is no __init__
function, so no super
is needed.
In addition, forward
is also by default not implemented, so we need to override it with our own forward propagation function.
A full example of a MNIST classifier: Link