D2L Torch
Various Deep Learning Architectures and Paradigms
Learning PyTorch through the D2L book. A series of notebooks for the same I created this repo, as a code along space for the textbook. The book’s original Github can be found here. I have made a few modifications to the code in the book and added comments and references from the text in the notebooks wherever I thought would be helpful. I have also debugged some of the issues I ran into while running original code.
D2LTorch
Learning PyTorch through the D2L book. A series of notebooks for the same
@article{zhang2021dive,
title={Dive into Deep Learning},
author={Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J.},
journal={arXiv preprint arXiv:2106.11342},
year={2021}
}
Hi there! I created this repo, as a code along space for the textbook. The book's original Github can be found here. I have made a few modifications to the code in the book and added comments and references from the text in the notebooks wherever I thought would be helpful. I have also debugged some of the issues I ran into while running original code.
Since I have some experience in Deep Learning (with Tensorflow), my aim here was to learn PyTorch through the book and I'd recommend this repo to those who want to learn torch and already know deep learning in other frameworks. For absolute beginners I'd strongly recommend reading the book
Here, is what each notebook contains and the PyTorch constructs you will find in each of them:
Basics [↩
]

 A simple linear regression model
 Learn about
torch.utils.data.TensorDataset
,torch.utils.data.DataLoader
,nn.Sequential
,nn.Linear
,nn.MSELoss
,torch.optim.SGD

 A simple softmax regression model on Fashion MNIST
 Learn about
torchvision.Transforms
,torchvision.datasets
,nn.CrossEntropyLoss
,nn.init.normal_

 A simple Multilayer Perceptron model on Fashion MNIST
 Learn about
nn.Flatten
,nn.ReLU

 A Multilayer Perceptron model with Dropout on Fashion MNIST
 Learn about
nn.Dropout
,nn.ReLU
and how to create a Dropout Layer from scratch

 Some important utilities and creating custom layers
 Learn about
nn.init.xavier_
, Extracting network parameters in different ways, initializing shared parameters, Xavier initialization, subclassing nn.Module,saving and loading tensors/whole models and GPUs  The runtime error here is intentional and serves to show why tensors must be on the same device (same CPU or same GPU) during computations.
Deep Learning for Vision Basics [↩
]

 Creating convolutions from scratch, padding, pooling and LeNet

Learn about
nn.Conv.2d
,nn.AvgPool2d

 Creating AlexNet, VGG11, NiN (Network in Network), Google LeNet (Inception),Batch Normalization layer, ResNet and DenseNet from scratch
 Learn about
nn.BatchNorm2d
,nn.AdaptiveAvgPool2d
(Global Average Pooling)
Deep Learning for Text Basics [↩
]
 D2L_Text_Basics.ipynb:
 Creating a tokenizer and vocabulary, random & sequential sampling and Sequence Data Loader
 Learn about corpus statistics (unigrams,bigrams,trigrams)

 character level RNNs,GRUs,LSTMs,stacked models, bidirectional models
 Learn about
torch.nn.functional.one_hot
,nn.RNN
,nn.GRU
,nn.LSTM
and when NOT to use bidirectional networks

 GRU and LSTM based encoderdecoder machine translation
 Learn about generating dataloaders for machine translation, padding , truncating, endtoend neural machine translation and BLEU evaluation

 Nadaraya Watson Kernel Regression, (nonparametric and parametric attention pooling)
 Learn about an early form of attention for regression problems, both parametric and nonparametric.
 D2L_Attention_Mechanisms.ipynb:
 Different Attention Mechanisms used in Seq2Seq Models
 Learn about Attention used in NMT, Bahdnau Attention (Additive Attention form) in GRU based Seq2Seq Models, Self Attention , MultiHeaded Attention, Positional Encodings (Traditional sinusoidal) and the Transformer.
Optimization and Distributed Training [↩
]

 Different Optimization scenarios and algorithms.
 Learn about GD, SGD, Variable Learning rates, Minibatch GD ,Preconditioning , Momentum , AdaGrad , RMSProp, Adadelta, Adam , Yogi , Schedulers and Policies
 Note that the book goes into much more mathematical detail and includes proofs (That I have left out).

D2L_Distributed_Training.ipynb:
 Speeding up training using distibuted GPU training, synchronizing computation, PyTorch's auto parallelism
 Learn about Hybrid Programming, Asynchronous Computation, Automatic Parallelism, MultiGPU training,
nn.parallel.scatter
,nn.DataParallel
,torch.cuda.synchronize
 Note that a few sections in the notebook won't run on colab due to the requirement of multiple GPUs (atleast 2)
Deep Learning Applications for Computer Vision [↩
]

 Augment Datasets using for robust training and learn how to finetune for imageclassification.
 Learn augmenting through
torchvision.transforms.RandomHorizontalFlip
,torchvision.transforms.RandomVerticalFlip
,torchvision.transforms.RandomResizedCrop
,torchvision.transforms.ColorJitter
and setting multiple learning rates for training different parameters of the same network and a small trick to further improve finetuning image classes found in ImageNet.
 D2L_Object_Detection.ipynb:
 Learn about multiscale object detection using a Single Shot Detector (from scratch). Note this is one of the tougher and more mathematical notebooks.
 Learn about generating anchor boxes at different scales and ratios, sampling pixels uniformly from images as centres to generate anchor boxes, an algorithm to map the generated anchor boxes to ground truth bounding boxes,Nonmaximal supression, the SSD architecture from scratch, and the object detection loss function.
 This is one of the best chapters in the book since its very detailed and delves into the nittygritties of object detection. I spent a lot of time on this notebook due to its size. One tip I'd like to give here is keep track of the dimensions of tensors at all times since there are so many functions and transformations. Some functions in here are gold mines, they are scalable and blazing fast.

D2L_Semantic_Segmentation.ipynb:
 Semantic Segmentation using a CNN on the PASCAL VOC 2012 dataset.
 Learn about preprocessing data for semantic segmentation and transposed convolutions.
 May require high RAM (unless you drastically reduce batch_size)

D2L_Neural_Style_Transfer.ipynb:
 Neural Style Transfer using a VGG19.
 Learn about the features needed to retain the content style and those to inject styles from the style image. Learn about the content loss and the gram matrix.

 Efficient Net B0 on CIFAR10.
 Focus isn't on accuracy but organizing the project,splitting the dataset properly, calculating mean and standard deviation for a dataset (to normalize the data) and using pretrained models.

 Deit tiny(FAIR) on Dog Breed Identification Dataset.
 Learn how to use HuggingFace and DEIT (a better performing variant of the vision transformer) on a custom dataset. Learn how to convert a custom image dataset to a HuggingFace image dataset and use the HuggingFace Trainer to train the model.
 I have added the Deit model, the book uses a traditional pretrained CNN.
Deep Learning Applications for NLP [↩
]

 Train a Skipgram model from scratch.
 Learn about sampling techniques, negative sampling for efficient training, word vectors, dealing with highfrequency words.

 Use GloVe and fasttext.
 Learn about using word embeddings to calculate similarities and analogies between words.

 Write BERT from scratch and pretrain it on WikiText2 using the MLM and NSP objective.
 Learn how to write BERT from scratch, how to use
torch.einsum
and coding the Masked Language Modelling objective and the Next Sentence Prediction Objective  One of my favourite chapters in the book!
 I refactored the code in the book to incorporate
Einops
which makes it more readable, easier to understand, faster to code tensor operations and rearrangements and reduces the need to write a few extra classes and functions. I also decided to refactor theMasked Softmax
class in the book and make it a part of theMultiHeadAttention
Class  Note that since I had access to a Tesla V100, I decided to use a full scale BERT architecture but for the free colab tier, stick to a BERT with 2 layers, 128 hidden dims and 2 attention heads.

 Train a Bidirectional LSTM and a 1D CNN (TextCNN) on the IMDB dataset and tune hyperparameters with Optuna.
 Learn how to use pretrained embeddings in your models, using a BiLSTM, multiple variations of 1D CNNs including those with positional embeddings (sinusoidal and learnable). Also see how we can use Optuna to tune the hyperparameters of a model.
 I have added the code to tune the hyperparameters (it is not a part of the book) and added a few variations based on the Exercise in the book
 Note that tuning hyperparameters requires a powerful GPU so feel free to skip that section.

 Train a decomposable attention model on the Stanford Natural Language Inference dataset.
 Learn how to use a MLP and attention to train an efficient model.

 Tuned BORT) from the Hugging Face hub on the Stanford Natural Language Inference dataset.
 Learn how to use a HuggingFace model, create a custom dataset compatabile with HuggingFace and use dynamic padding
 You can use any model of your choice, I just wanted to experiment with BORT
 I was very lucky to gain access to a NVIDIA A100 40 GB GPU and would recommend that you significantly reduce the batch size in the notebook on regular colab instances.
 Also note how always using a transformer model does not help. In the previous notebook, decomposable attention reached a much better accuracy score and more importantly took 45 minutes lesser to train.
Generative Deep Learning [↩
]
 D2L_GANs.ipynb:
 An introduction to Generative Computer Vision, trained a GAN to sample out of a particular Gaussian, and a DCGAN to generate Pokemon
 Learn about Generators, Discriminators, an efficient way to write the training loop for GANs
 Modified some of the code in the book to adopt some techniques from Sharon Zhou's CS236G (Stanford)
Recommender Systems [↩
]
 Work In Progress

Porting MXNet code to vanilla PyTorch

D2L_Matrix_Factorization.ipynb:
 Matrix Factorization on MovieLens100k
 Learn about the Matrix Factorization algorithm, creating the dataset and dataloaders for Recommender System

 Collaborative Filtering through an AutoEncoder on MovieLens100k
 Learn about the nonlinear neural network collaborative filtering model, AutoRec [Sedhain et al., 2015]. It identifies collaborative filtering (CF) with an autoencoder architecture and aims to integrate nonlinear transformations into CF on the basis of explicit feedback