Deep Learning Part 2

Automated Machine Learning

The idea is to, given some task and its dataset, find a ML algorithm (including the five pieces shown below) that solves a problem


where the training data/task/solution could come from Kaggle, as a label composed of the following 5 pieces:


  • pre-processing: which imputation method if needed?
  • feature extraction: which encoding to use? backbone?
  • feature selection: should we do dimensionality reduction?
  • model/predictor: which classifier?
  • post processing

Such an idea of automatically finding best tools could be applied to a range of similar tasks such as:

  • automated activation function finding:


    where essentially we given in building blocks to compose some activation function, and ask it to find a best one

  • automated optimizer finding, where again we feed in building blocks for an optimizer


    and under some tasks, the best solution could be:

    \[\text{PowerSign} = \alpha^{f(t)*\text{sign}(g)*\text{sign}(m)}*g\]

    for $g$ being gradient and $m$ being momentum.

  • automated data augmentation finding. Again, consider a space of possible actions:


  • automated algorithm selection

  • automated hyperparameters finding: Grid Search v.s. Random Search

  • automated architecture finding

  • and etc.


The idea is to have:

  • input some task + data of a problem
  • output a ML system that you can use to do predictions


Some nowadays implemented ones look like:

  • autosklearn

    import autosklearn.classification # classification task
    cls = autosklearn.classification.AutoSklearnClassifier(), Y_train)
    predictions = cls.predict(X_test)
  • flaml

  • h20

  • etc.

Note that this of course does not really find the best tuning of the model, it only finds some pipeline/architecture. To find the best hyperparameters, checkout the section on Automated Hyperparameter Finding

Meta Data

How do we represent the task/dataset/solution as a feature in the AutoML system?

Instead of putting the entire task/dataset as it, consider computing some meta data about the task/dataset/solution.

Hence we consider:


for instance, metadata features for dataset include:

  • number of data points
  • number of features
  • number of missing values
  • class entropy
  • etc.

Hyperparameters Finding

Certain hyperparameters we commonly care about are:

  • learning rate $\alpha$
  • momentum $\beta$
  • minibatch size
  • learning rate decay
  • etc.

The easiest approach is to consider grid search. However, this is often inefficient as it assumes each parameter being equally important.

Grid Search Random Search
image-20220322153943284 image-20220322153946013

consider random search

  • not all hyperparameters are equally important. For example, lr might be more important than decay coefficient $\gamma$.

  • if you look at the range of values covered by important parameter, grid search only searched $3$ possible values, whereas random search searched $6\sim 7$ different values

Several other ways include:

  • adaptive search in certain regions where we found good candidates:


    i.e. make the search more dense in that area

  • converting the problem to a RL task. Consider this as a multi-arm bandit: we pull an arm (sample in the hyperparameter space), the reward is the performance/training time, etc

  • Bayesian optimization. see section Bayesian Optimization.

Algorithm and Hyperparameter Selection

Instead of finding the algorithm and the hyperparameters, consider finding them jointly.

Given a dataset $D$, find the algorithm $A$ (e.g. linear regression, decision tree, etc) and a hyperparameter such that we minimize the loss of $A$ on $D_{test}$ when trained on $D_{train}$:

\[A^*_{\theta^*} = \arg\min_{A,\theta} \sum \mathcal{L}(A_\theta, D_{train}, D_{valid})\]

The algorithms and hyperparameters you can search through could be:


Ensemble of Models

Idea: When you have multiple potential models for a single task, instead of choosing one single model, we can keep all models.

The simplest ensembling is the following: given a test sample point $x$:

  • for each model $m$: perform prediction for probability of $x$ in the $j$-th class, i.e. $p^{(j)}$
  • for each class $j$: do averages of probability for each model
  • take $\arg\max_j$ to split out the prediction
\[y = \arg\max_j \frac{1}{m} \left( \sum_{i=1}^m p_i^{(j)} \right)\]

This would be the simplest form of assembling. Other ways would be stacking

Model Stacking

One problem with performing the simple ensemble above is that, for a new given test data $x$, you will need a central place to store all the models. This might not be efficient and desirable.

Goal: is there a way to collaborate without sharing a central repo for code/models?

Idea: consider building a super model that trains on outputs of each model.

This means that, for each model $m$:

  • define the same CV folds for all models
  • ask each model to keep the predictions for each fold $i$.
CV Fold 1 CV Fold 2
image-20220324154011972 image-20220324154344375

Then the idea is to take the predictions from each model as a meta-feature that you can feed in to a super-model:


The key idea is so that:

  • we only need each team for model $m$ to work on their own
  • building a super model then is just build a model, we don’t need to look into codes of other teams

Developing novel neural architectures manually is time consuming, error prone. Now you can use automatic methods for searching for neural network architectures: AutoKeras (CNN, RNN), AutoGAN, AutoGNN.

For deep learning models, we might consider combining many different algorithms in a pipeline such as CNN as backbone and then Transformers.

Most packages cover two possible architecture shape:

Chain DAG
image-20220324155629798 image-20220324155651998

However, another approach would be a RL like task:

  • ask NN controller to generate an architecture (take action)
  • evaluate performance (get reward)
  • get feed back to NN controller (update)



DARPA Data Driven Discovery of Models (D3M): solve any well defined ML task on any dataset specified by a user.

  1. Broad set of computational primitives as machine learning building blocks.
    • e.g. implementing PCA, Linear layers, etc.
  2. Automatic systems for machine learning, synthesize pipeline and hyperparameters to solve a previously unknown data and problem.
    • given a test dataset and task in a Docker image, ask your code to generate a model and predict
  3. Human in the loop user interface that enables users to interact with and improve the automatically generated results.
    • maybe user only wants a subset of models to choose from, etc.

The outcome from such a project are two methods where you either:

  • Gradient-based: have differentiable primitives, build a DAG, and optimize end-to-end using backprop:


  • Reinforcement-based: since essentially it is a search in high dimensional space,

    Idea Parallel in RL
    image-20220324160954905 image-20220324161009596

    in this approach, the entire pipeline is a single state. Therefore we need some kind of encoding for the pipeline, which can be done by:


A brief comparison on performance


where the horizontal line is the gradient-based method, we see that RL-based in doing better.

RL-based Pipeline Synthesis

Recall that

Deep RL Pipeline Synthesis
image-20220324161652671 image-20220324161712502

where essentially:

  • DNN now receives an entire pipeline, meta features and task as input, and estimates action probabilities as well as rewards, i.e. performance
  • MCTS then receives the data from DNN and search for next action
  • an action here would be modification/insertion/deletion of a certain primitive in the pipeline.

However, this tree search by default could also hit on certain meaningless pipelines such as estimator and then preprocessing. Hence you can enforce a grammar to only accept valid ones, i.e. giving some prior knowledge.

The performance becomes better


In the end, an ablation study is done on using the DNN or not. The result is similar but time performance is drastic


basically just MCTS would be very slow.

Matrix Completion and AutoML

It turns out the idea behind AutoML can be extended to/related to many applications in real life.

Another task would be a recommendation system, where given some past user’s ratings/history, fill in how likely users will like a certain film.

The problem can be setup as a matrix:

  • each user being $j=1,..,p$
  • each item, e.g. movie, being $i=1,…,n$
  • each cell represent rating $Y_{ij}$ a user $j$ gave to item $i$


where this would be a sparse matrix in general as many users won’t see all contents.

  • the task is then to complete the matrix, so that we can perhaps recommend to users

In terms of AutoML, we are basically doing:


where if fully filled, we can find the best algorithms given a dataset.

Naive Solution of Filling Matrix:

Let each item have a known feature vector $\vec{x}^{(i)}$. We want to learn some parameters $\theta^{(j)}$ of each user and assume that $\theta^{(j)^T}x^{i}$ would model their preference score. Then:

\[\min_{\theta^{(j)}} \frac{1}{2} \sum_{i:M_{ij}=1} \left(\theta^{(j)^T}x^{i} - Y_{ij}\right)^2 + \text{Regularize}(\theta^{(j)})\]

note that since we only know part of the matrix:

  • $i:M_{ij}=1$ means we are only fitting on the known parts
  • then we can predict once we learnt $\theta^{(j)}$ for each user $j$

This model is actually the simple content-based recommendation system


Problem: In tasks AutoML needs to solve, we are usually given a unknown/new dataset and task. I.e. we are having a new item with unknown $\vec{x}^{i}$.

Then, the idea is to do a EM-like approach:

  • assume you know $x^{(i)}$, fit $\theta^{(j)}$
  • then once you get $\theta^{(j)}$ fit $x^{(i)}$ back

So the entire task can be seen as

\[\min_{\theta^{(j)},x^{(i)}} \frac{1}{2} \sum_{i:M_{ij}=1} \left(\theta^{(j)^T}x^{i} - Y_{ij}\right)^2 + \text{Regularize}(x^{(i)}) + \text{Regularize}(\theta^{(j)})\]

which then we can fill in all items in the matrix. Note that:

  • this model only using dot products can be rephrased into a lower rank decomposition and imputation


The upshot is that for AutoML, we can keep past experience with different dsets and architecture as elements in the matrix, and treat the matrix as a warm start for MCTS search. (i.e. use that as initialization)

Using Meta Embeddings

When approaching an ML or DS problem humans read the documentation

  • Description of data and task
  • Description of machine learning functions available for solution
  • How can we allow an AutoML method to “read the descriptions or manual”? (in addition of the actual datasets)

Use large scale transformer models to represent descriptions of the dataset, task, etc, in addition to the dataset itself.

It turns out that just using the metadata could do equally well even if you haven’t seen the actual dataset


One potential representation would be computing a graph embedding: embedding of dataset that are similar should be close.

Dataset Embedding AutoML Architecture with Meta data
image-20220324180536406 image-20220324140500517


  • nodes in a graph would represent a dataset
  • edge would be based on embedding of dataset descriptions and meta-features
  • this is bascially the approach now: using GNN combined into AutoML systems

Then you use assembling to those AutoML systems as well


Bayesian Optimization

Here we cover another popular technique other than random search and grid search for hyperparameter selection.

Goal: Given some model and a loss function, select model hyperparameters such that the loss would be small.

We would want:

\[z^* = \arg\min \mathcal{L}(M_z)\]

and we want to approximate best hyperparameter $z^*$ that minimizes the loss within only a few steps $z_1,…,z_n$.

The idea is:

  1. have some prior on the objective function, say $f$, which could be the loss/some performance metric

  2. each time we evaluate $f$ at a new point $z_i\equiv x_i$, we update our model for $f(x)$

  3. use posterior to derive/update acquisition function $\alpha(x)$

    \[\alpha(x) = \mu(x) - k \sigma(x)\]

    where $\mu(x),\sigma(x)$ are mean and std. of posterior at point $x$. Here $k$ controls the trade of between exploration and exploitation:

    • if small $k$, then the $\arg\min$ in the next step favors small $\mu(x)$, o.e. exploitation
    • if large $k$, then exploration
  4. use the acquisition function to determine the next point

    \[x_{i+1} = \arg\min \alpha(x)\]

    and continue from step 2

Graphically, suppose we have already sampled three points $x_1,x_2,x_3$, e.g. learning rate. We are interested in which point to sample next:



  • the blue part would be the confidence intervals
  • the black line would be the true function which we don’t know

Now, depending how we choose $k$, the acquisition function could look differently:


For instance:

Exploitation In between Upper Confidence Bound
image-20220324182818062 image-20220324182751888 image-20220324182745786

where the

  • exploitation would result in $x_{i+1}$ in lower confidence bound

The algorithm hence looks like


And in practice, sklearn already implemented such a search. It has a function which uses a flag to either do Grid Search, Random Search, or Bayesian Optimizations.

Multi-Task Learning

In the real word, solving a problem = solving many tasks at the same time.

For instance, consider self-driving cars


where we need to do:

  • object detection for pedestrians, other cars, etc
  • RL learning driving
  • dehazing if needed
  • etc.

Some questions we want to answer here include:

  • do we want to build predictors doing independent work for each task?


  • independent predictors but using the same backbone?


  • or maybe one model for all tasks? i.e. one neural network for learning multiple tasks: all-in-one

Independent Network

The simplest approach for solving $N$ different task would be to have the following architecture


Some advantages

  • simple to build
  • do not need to

some disadvantages

  • no sharing hence no positive transfers
  • not resource efficient

Positive Transfer refers to the facilitation, in learning or performance, of a new/second task based on what has been learned during a previous one.

Negative transfer refers to any decline in learning or performance of a second task due to learning a previous one.

Multi-Task Learning Questions

In reality, we have multiple heterogeneous tasks with different importance, difficulty, number of samples, noise level

Some questions you might think about to develop an architecture:

  • How do tasks influence one another? Does each task help the other tasks? or is there negative transfer?
    • e.g. if we train backbone + detection head 1 on pedestrian detection, will it hurt the performance for same backbone + detection head 2 on car detection?
    • Are the tasks similar? Heterogeneous?
  • How to share weights between different tasks?
  • How does network size influence MTL?
  • How does dataset size and distribution of number of samples per task influence MTL?

Zero/One/Few Shot Learning

This is relevant in transfer learning. I.e. suppose you have fitted your model on some data from domain $D_A$. You want to see how well it can perform on a different domain $D_B$:

Few-shot learning aims for ML models to predict the correct class of instances when a small number of examples are available in the $D_B$.

$N$-shot learning aims to predict the correct class when $N$ examples for each class in the $D_B$.

Zero-shot learning aims to predict the correct class without being exposed to any instances belonging to that class in $D_B$.

In general you have $k$-shot $n$-ways, for $n$-ways being the number of classes.

Why is this relevant? One way to see it is that if our model can perform well on few/one-shot, then:


all we need is a single model with a few-shot examples to solve multiple tasks.

  • Advantage: models trained on multiple tasks at once are more robust to adversarial attacks on individual tasks
  • Disadvantage: hard to get it to work

Shared Backbone

Consider the setup of:

  • input $x$
  • need to solve $T$ tasks, $t=1,…,T$
  • output of each task is $y_t$

Suppose we are given $N$ training data, being ${x^{(i)}, y_1^{(i)},…, y_T^{(i)}}_{i=1}^T$. We consider the architecture of



  • again the backbone would take the role of feature extractor
  • we can also see this as the backbone being “encoder”, and the three tasks being three different “decoders”

  • Advantages: efficient runtime
  • Disadvantages: could be over-sharing, negative transfer

Then specifically, we have:

  • shared backbone network $f$ with parameter $\theta_s$

  • task-specific decoder network would be $g_t$ for each task $t$, with parameter $\theta_t$

  • task-specific loss would be defined as

    \[\mathcal{L}_t(\theta_t, \theta_s)=\frac{1}{N}\sum_i \mathcal{L}_t(g_t(f(x;\theta_s);\theta_t),y_t)\]

Note that though for each task you can define your own loss, what happens when we update $\theta_s$? There will be $T$ different losses for a single $\theta_s$. How do we update that?

Linear Scalarization of MTL

Aim: we want to solve for the best combination of all parameters $\theta_s$ and $\theta_{t_1},…,\theta_{t_T}$.

The problem we are considering is the following and we want to solve for the best $\theta$ set:

\[\min_\theta \mathcal{L}(\theta)\equiv\min_{\theta_i} \mathcal{L}(\theta_s,\theta_{t_1},...,\theta_{t_T})\]

But the key problem here is that solutions may not be comparable to begin with:

  • consider $\theta,\theta’$ being two sets of solution such that


    so that this the $\theta$ set performs better in task 1, but


    in task 2, the set $\theta’$ performs better.

Therefore, how do we define an ordering for the parameters to begin with?

One simple idea is to consider a linear combination

\[\mathcal{L}(\theta)\equiv \mathcal{L}(\theta_s,\theta_1,...,\theta_T) \to \sum_t \alpha_t \mathcal{L}_t(\theta_t,\theta_s)\]

which gives a well-defined ordering. Then we can update/solve by:

\[\min_{\theta}\mathcal{L}(\theta) =\min_{\theta} \sum_t \alpha_t \mathcal{L}_t(\theta_t,\theta_s)\]

but how do we find an $\alpha$?

Multi-Objective Optimization

Another formulation/view point of the task is to consider

\[\min_{\theta_i} \mathcal{L}(\theta_s,\theta_{t_1},...,\theta_{t_T}) = \min_{\theta_i}(\mathcal{L}_{t_1}(\theta_s,\theta_{t_1}),...,\mathcal{L}_{t_T}(\theta_s,\theta_{t_T}))\]

This can be generalized into the question that we have a system of $T$ functions $f_t(x) = \mathcal{L}(\theta)$, then we can define a partial ordering for $y \in \mathbb{R}^T = (f_1(x),…,f_T(x))$ such that:

\[y \prec y' \iff f_i(x) < f_i(x')\,\,\forall i\]

in other words, $y$ wins/is smaller if it is smaller in all dimension. Then applied in our case, we essentially have “many optimal solutions” forming a Pareto frontier.


where notice that:

  • $A,B$ are pareto frontiers since no other point can perform better than them in all dimension, $f_1, f_2$ in this case.
  • $C$ is not because it could be out-performed by $A,B$ in both dimensions.

Theorem: All pareto optimal points are Pareto stationary.

Theorem: A point $x$is pareto stationary if there exists $\alpha \in \mathbb{R}^T$ such that

\[\sum_t \alpha_t \nabla f_t(x)=0,\quad \sum_t \alpha_t=1,\forall\alpha_t \ge 0\]

Therefore, this gives us a way to define how to find a meaningful $\alpha$, which comes down to:

\[\min_\alpha \left\| \sum_t \alpha_t \nabla f_t(x) \right\| \to \min_\alpha \left\| \sum_t \alpha_t \nabla \mathcal{L}_t(\theta_s, \theta_t) \right\|\]

under the constraint that $\sum_t \alpha_t=1,\forall\alpha_t \ge 0$.

Negative Transfer

Up to today, individual networks are performing better than shared networks.

The cause is mainly due to negative transfer happening:

  • One task may dominate training
  • Tasks may learn at different rates
  • Gradients may conflict

So in general it is important to determine the relationships between tasks, to tell if a shared architecture works

Architectures in Shared Networks

In general we have:

  • Hard parameter sharing, which we have seen


  • Soft parameter sharing

    Instead we can imitate individual networks but allow sharing:

    Hard Sharing Soft Sharing
    image-20220329160407600 image-20220329160450920

    Does not scale well with number of tasks

  • Ad-hoc sharing


    so we compute task relatedness, then iteratively group network

    This tend to have better performance than soft or hard sharing

  • learning to route


    where we learn separate execution paths for different tasks

Combinatorial Optimization Problem

Which Tasks Should Be Learned Together in Multi-task Learning? If we want one model doing more than one thing?

Consider this problem as a bipartite matching

Problem Approximate Solution
image-20220329135502879 image-20220329135538244

Then the question can be formulated as:

  • Tasks being $t_1,…,t_T$ having $T$ nodes
  • NN Network having $n$ nodes, each would have a cost to use $c_n$ (e.g. inference time)
  • we have a budget $b$, which we do not want to exceed (e.g. should not take longer than $b$ seconds)
  • want to have a solution $S$ being a set of network using the $n$ nodes that solves all $T$ tasks within the budget

Then this problem is:


where of course this is still a NP-hard problem, so in reality we would use approximations. Some results look like


where $S,D,N,K,E$ would be the tasks, and the circular nodes are the networks.

Example: Raven’s Progressive Matrices

The task of Raven’s Progressive Matrices is to fill in the missing patterns.

Question Answers
image-20220329140327911 image-20220329140336981

notice that this is a multi-task question. We need to

  • infer hidden attributes and pattern

  • building blocks and transformations
  • infer Alphabet and composition rule

In fact, how we solve this is:

  1. find the building blocks


  2. pattern matching amongst the examples in the questoin


Hence to solve such problems, we need multi-task learning models.


Meta-learning is about learning to learn or learning something that you usually don’t directly learn (e.g. the hyperparameters), here learning is roughly a synonym for optimization.

  • hence, you can see this as normal learning algorithm, but the output is not directly the labels

An example would be that, suppose you have some task $T_1,T_2,T_3$ and $T_4$. You have some architecture beforehand with hyperparameters such as number of layers configured for $T_1, T_2,T_3$. You have also trained your network with data to solve $T_1,T_2,T_3$.

Then, the question is, what is the suitable hyperparameter if we want to do $T_4$ given all your experience so far? This is answered by meta-learning algorithms.

  • in this sense it does have an analogy with Transfer Learning, as they also involve fine-tuning hyperparameters for another task
  • however, meta-learning can be broad, as you could also ask it to output some grammar/rules (see example below)

In fact, in this paper it mentioned that:

The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples

Example: Learning Rule System

Consider the task of learning a language from a few examples

| image-20220331131451815 | image-20220331131458719 | image-20220331131505428 | | :———————————————————-: | :———————————————————-: | :———————————————————-: |

As a human, you are expected to find from the above:

  • primitives


  • rules/function

    • fep - three dots
    • blicket - surrounding
    • kiki - after

Then we can do inference:

Query Inference Result
image-20220331131825975 image-20220331131840229

note that this is meta-learning:

  • input data are not really “datasets”, as they have no input-label pairs. They are like data pairs chosen to reflect the task/language, hence meta-data about the language.

  • the output/we are learning the primitives and rules in this language. There is no directly “label” learnt.

The following section talks about how we solve this by:

  • Meta learning
  • Self-supervised meta learning
  • Self-supervised meta learning with a meta-grammar

Meta Learning Algorithm

Recall that in supervised learning, we have:



  • input data $X$ and labels $Y$ being $(X,Y)\sim D$

  • the learning algorithm specifies the architecture we use, e.g. SVM/k-NN/CNN/etc

  • during training, we find parameter for the predictor

    \[\theta^* = \arg\max_\theta p(\theta |D) = \arg\max_\theta p(D|\theta)p(\theta)\]
  • during testing/inference, we use $\theta^*$

Suppose we have some model $f_\theta$ we want to use, for example a CNN.

The goal of few-shot meta-learning is to train a model $f_\theta$ that can quickly adapt to a new task using only a few datapoints and training update iterations to $\theta$.

  • in other words, we want some good initialization of $\theta_0$ (learnt from other tasks as a “warm start”) such that in a new task it learns fast with only few samples

So how do we get such a handy $\theta_0$? In effect, the meta-learning problem treats entire tasks as training examples, i.e. we consider

  • a distribution over tasks $p(T)$ that we want our model to adapt to
  • for each task $T_i$ could have its own loss $\mathcal{L}_{T_i}$ and its training data distribution $q_i(x)$

Then, during meta-training (in a $K$-shot learning setting), the idea is, given a model $f_\theta$:

  1. consider some random initialization of $\theta_0 \equiv \theta \equiv \phi$

  2. while not done:

    1. sample a batch of tasks $T_{i}$s to train from $p(T_i)$:

      1. sample a task $T_i$, e.g. compare “cat v.s. dog”
      2. draw $K$ samples from $q_i(x)$ for training, known as support set
      3. compute its loss $L_{T_i}(f_\theta)$ using the $\theta$ initialization
      4. update/see how well this initialization works by doing a few descents:
      \[\theta_i' = \theta - \alpha \nabla_\theta L_{T_i}(f_\theta)\]

      for that specific $T_i$.

    2. update the initialization $\theta\equiv \phi$ as how the total loss over all tasks can be decreased if I have a better $\theta$ to begin with

      \[\theta \leftarrow \theta - \beta \nabla_\theta\sum_{T_i} L_{T_i}(f_{\theta_i'})\]


      • $f_{\theta_i’}$ is like the same model “fine-tuned” on task $T_i$.

      • $L_{T_i}(f_{\theta_i’})$ will now be evaluated on the new samples from $q_i(x)$, known as the query set to test its generalization ability as well

      • since we are taking derivative w.r.t $\theta$, and we know (if only doing a single iteration)

        \[\theta_i' = \theta - \alpha \nabla_\theta L_{T_i}(f_\theta)\]

        essentially this term include loss of losses.

  3. output the learnt $\theta$ being the better "initialization" parameters for your model $f_\theta$ which is ready for other tasks

Hence the algorithm is


which we know is agonistic of what model $f$ we picked.

  • resource



where the aim is to provide some good $\theta$ for your network $f_\theta$ such that it can get good performance on task $T_i$ with only a few updates on $\theta$.

Other related meta-learning algorithms only vary in how they descent the $\theta_i’$ for each task:


Example: K-shot learning for animal classifications:


where essentially:

  • each task $T_i$ is classifying between two animals, i.e. a binary classification

  • we want to generalize well to $3$-shot $2$-way, so that our model performs well on


    with only a few updates from $\theta$, i.e. it is almost like fine-tuning it.

Matching Network

This is essentially another approach at solving a new task $T_i$. Instead of finding a good $\theta$ for our network $f_\theta$ to start with, consider

Learning training the network $g_\theta$ to learn a good embedding network of the input such that, at test time with a new 1-shot learning task $T_i$ a simple Nearest Neighbor would work.

Specifically, consider:

  • learning input: all training data for all tasks at once
  • learning goal: learn embedding network $f_\theta, g_\theta$



  • The key point is that when trained, Matching Networks are able to produce sensible test labels for unobserved classes without any changes to the network

  • reference:

For a simplified version, let us just say that $g_\theta = f_\theta$.

Here the idea is that we consider

\[\hat{y} = \sum_{i=1}^k a(x_q,x_i)y_i\]

which is basically a weighted sum

  • for $a(x_q,x_i)$ being an attention kernel. The simplest you can use is

    \[a(x_q,x_i) = \text{Softmax}(c(x_q,x_i))\]

    for $c(x_q,x_i)$ being a dot product

  • for simplicity, we assume we have $Q=1$ for training.

Then the implementations look like

  • BaseNetwork a backbone doing $f_\theta(x_i)$, which will be also used in Prototypical Network

  • the predictor then does

    \[\hat{y} = \sum_{i=1}^k a(x_q,x_i)y_i\]

    as specified above

Prototypical Network

Here the idea is to consider predicting $\hat{y}$ of $x_q$ by nearest neighbor to prototypes $c_k$ of each class (learnt from the few-shot training):

\[c_ k = \frac{1}{|S_k|} \sum_{(x_i,y_i)\in S_k} f_\theta(x_i)\]

basically is saying each class $K=k$ has a prototype being its centroid. Then, for a new query $x_q$ we can simply do:

\[p(y=k|x) = \text{Softmax}(-d(f_\theta(x_q), c_k))= \frac{\exp(-d(f_\theta(x_q), c_k))}{\sum_{k'}\exp(-d(f_\theta(x_q), c_{k'}))}\]

for $-d(f_\theta(x_q), c_k)$ basically is the distance between $f_\theta(x_q)$ and $c_k$.

Relation Network

Instead of using Nearest Neighbor for classification, use a Neural network


Meta Learning with Meta Grammar

Instead of finding a $\theta$ as a “warm start” for the neural network $f_\theta$ on a new task, consider learning grammars.

So the general idea is:


The detailed architecture looks like:


notice that essentially the entire neural model is on $p_\theta(G\vert X)$ is a distribution over grammar $G$ given the support set $X$. this works by

  • an encoder $\text{Enc}(\cdot)$, which encodes each support example $(x_i,y_i)$ into a vector $h_i$

    \[\text{Enc}(X) = \{h_i\}^n_{i=1}\]
  • a decoder $\text{Dec}(\cdot)$ which decodes the grammar while attending to the support example

    \[p_\theta(\cdot |X) = \text{Dec}(\{h_i\}_{i=1}^n)\]
  • decoder outputs a tokenized program, which is then parsed into an interpretation grammar

  • this is essentially the solution to the example in section Example: Learning Rule System.

Applications of DL

Applications we will dicuss include:

  • Deep Learning for Protein Structure Prediction
  • Deep Learning for Combinatorial Optimization

DL for Protein Structure Prediction

We know that proteins (a chain of amino acids) are the major building blocks of life. Structure of protein determines functions, and they are engines for your body.

There are millions of possible protein structures in nature, but only a few thousands was discovered. Hence we have the task:

  • given a sequence of protein, predict the structure (i.e. average of a the moving system)
  • e.g. predict the structure of SARS-COV-2, so that we can develop antibodies

It is very applicable since in real word, the number of sequence discovered grows faster than structures:

  • 350M sequences
  • but only had recorded 170K structures


we want to fill the gap using Deep Learning.

Subtasks involved in structure prediction:

  • secondary structure: is this part a $\alpha$-helix, or $\beta$-sheet, or a loop, which is a labelling task


  • 3D structure prediction: 3D position of each atom in the protein

  • folding: essentially the quaternary structure, how they fold when placed into a solution (liquid)

  • function prediction: once you have a structure, is this an enzyme? What does it do?

For Example: SARS-COV-2 protein:

SARS-CoV-2 proteins include

  • the Spikes, Envelope, and Membrane proteins that form the viral envelope
  • Nucleocapsid proteins that hold RNA (to infect other cells in your body)



  • the role of spike is to be able to
    • use RBD (receptor binding domain) to bind to the receptor of your lung’s cell (key-lock mechanism)
    • pierce through the defense and inject RNA
    • hence we want to develop a drug/vaccine that can bind stronger to this than lung cells
  • once it gets in, the RNA gets into your cell and you are infected

Then, to predict the 3D structure of the RBD, we needed DL to generate the 3D shape.

Competitions related in this field

  • CASP: Critical Assessment of Techniques for Protein Structure Prediction
    • Protein (3D) structure prediction
      • AlphaFold2 team won 2020 CASP14 competition
    • Quality assessment, evaluation of model accuracy (EMA)
  • CAPRI: Critical Assessment of PRediction of Interactions
    • Protein-protein interaction, protein docking
    • COVID-19 Open Science Initiative
  • CAFA: Critical Assessment of Functional Annotation
    • Protein function prediction

Secondary Structure Prediction

Here, essentially it is a Seq-2-Seq task, whether if each is a $\alpha$-helix, or $\beta$-sheet, or a loop:


Once we have this structure, it can be served as “constraints/guidelines” for 3D and quatery structure:

3D Structure Prediction

First, how do we represent the 3D structure in numbers?

One way to do it is to consider sequence of amino acids to distance matrix and torsion angles

So that essentially you want to find out:

  • pair-wise distance between each atom. Then we get $(n^2-n)/2$, hence $O(n^2)$
  • torsion angles of each particle, $O(2n)=O(n)$ since you have two angles

Then you can solve for the structure from the two above:


However, you could have also

  • output directly 3D position of each particle, $O(3n)=O(n)$

So why was the distance matrix useful? One use was to enable the use of Resnet to process those matrix as images:


where the final atom graph is solved by:

  • nodes being the 3D coordinates
  • edges being the distance, and type of bond


Protein Structure Prediction System

The final goal is to predict the 3D structure. From the discussion above, we essentially:

  • find/infer the secondary structures along with other features
  • find/infer the torsion angles/distances
  • solve for 3D structure under some biological constrains

Hence the pipeline looks like


which is trained end-to-end. Notice that, alike this, it is often the case you will get many data modalities for your task (e.g. amino acids, secondary structure, etc):

  • then fusion network is important: allows the feature to interact in a nonlinear fashion
  • then the final fused representation is fed into decoders for some prediction task, e.g. predict pair-wise distance.

Predicted Structure Assessment

How do we measure the score of each prediction? In general it can be done in two ways

  • Model mapping from features of previous predictions to evaluation based on ground truth (RMSD, GDT_TS)
  • Pairwise similarity between features of different predictions


Average RMSD under 1 Angstrom


DL for Combinatorial Optimization

Combinatorial optimization is the process of searching for maxima (or minima) of an objective function $F$ whose domain is a discrete but large configuration space (as opposed to an N-dimensional continuous space). i.e. finding an optimal object from a finite/discrete set of objects.

Some simple examples of typical combinatorial optimization problems are:

  • Traveling Salesman Problem: given the $(x, y)$ positions of $N$ different cities, find the shortest possible path that visits each city exactly once.
  • Minimum Spanning Tree: find a tree in a graph that contains the least weight (from the edges)

From a computer science perspective, combinatorial optimization seeks to improve an algorithm by using mathematical methods either to reduce the size of the set of possible solutions or to make the search itself faster.

First, some recap of typical problems:

  • Minimum Spanning Tree: Prim, Kruskal algorithm etc. Can be solved in a polynomial time.

    • Sort the graph edges with respect to their weights.
    • Start adding edges to the MST from the edge with the smallest weight until the edge of the largest weight.
    • Only add edges which doesn’t form a cycle , edges which connect only disconnected components (using disjoint sets)
  • Single Source Shorted Path: classical Dijkstra if non-negative weights. Can be solved in a polynomial time.

    • Dijkstra for non-negative weights
    • general SSP include Bellman-Ford
  • Travelling Salesman Problem: NP hard

    • if we randomly distribute notes, the TSP have a strong locality, hence there are approximations that can be done in polynomial time


    • a one vehicle version of VRP
  • Vehicle Routing Problem: What is the optimal set of routes for a fleet of $N$ vehicles to traverse in order to deliver to a given set of $M$ customers? NP hard problems

Solving MST with DL

The idea is ot represent the problem as a search tree. We already know that this problem can be solved by


But visualizing the process:

Kruskal’s Algorithm To RL
image-20220405173917326 image-20220405174021385

where we can represent the problem as:

  • state is the current graph
  • action is selecting one of the edges
  • reward: the value of the solution

Note that this action of “adding node” is general, as you can switch the representation of node/edge if your task needs to find the edges instead.


where in the above

  • both edges $E$ and nodes $V$ can be represented by nodes $V^*$ and $V$, respectively.
  • the final tree formed $T$ is highlighted in red (i.e. output of your algo)

DL Formulation of Combinatorial Optimization

Essentially you have


where here we separated into two types because:

  • the P-problems are essentially graph problems
  • the NP-problems are essentially becoming RL + MCTS search

DL for Autonomous Vehicles

General pippeline



  • for prediction, it will be a set of trajectories with probabilities
  • planning: is a hierarchical process depends on resolution (e.g. per second? per milisecond?)

Localization and Data Collection

The ability of self-driving is unrelated to current location

  • though it is required for route planning, it is not relevant for driving ability
  • however, by just collecting what each car see, we can gain multiple perspectives and collect tons of data which we can use to reconstruct the world

Once we reconstructed the 3D model, we can obtain semantic information such as where are the lanes, traffic lights, etc:


and finally this gives


Perception and Detection

Now, we have all the sensory data as our perception. How do we extract entities (e.g. bboxes of pedestrain, cars, etc)?

For instance, from video input, we might want to know where is the traffic light, other cars, etc


where the bottom is the input video, and the top portion is what we want.

  • in terms of deep learning outputs, typically we need segmentations:


  • And again, as today we have tons of data from all kinds of sensors, it becomes learning from multi-modal data (e.g. from Radar, Lidar, etc, as oposed to only vision form humans)


Self-Driving Car: Prediction

Now, given the surroundings, and before I plan what I do, I also need to predict what other dynamical objects will do. i.e. I need trajectories from other dynamical objects as part of the environment as well!

  • therefore, we need to predict each dynamic objects we detected.

  • this can be estimated from current movement direction, speed, acceleration, etc

Our aim is to be able to see future trajectories of other objects, which is necessary for planning our own movement.


Then from all the above information, we would be able to make good plannings:


Self-Driving: Vehicle Trajectory Prediction

Essentially you need to:

  • given what you observe on car $A$
  • predict trajectories of the car (e.g. from first predicting steering angle and speed)
Predict Speed and Steering Predict Trajectory
image-20220407154813207 image-20220407154800380

though the trajectory and steering angle + speed is probably a one-to-one mapping, in reality we would use the trajectories more often as it would be detached from the make of cars, and etc.

Self-Driving Car: Planning

Now, we want to plan our high level and low level actions based on our destination:



  • the high level planning basically include knowing where to go, and low level includes when to use the breaks.

  • Finally, this can be send to simulation where we get feedbacks without harm.

What is the “requirement” for allowing autonomous cars on road? One reasonable threshold would be having 1000 times lower probability of fatality per hour vs real world cars, i.e. we would be on average 1000 times safer!

Self-Driving: Backup

However, there are many edge cases in real life. For example, sometimes we need to follow the person guiding the traffic instead of traffic lights. Or there are other emergencies cases.

Therefore, we need a backup plan if the car does not know what to do. Currently it is done by having a tele-operator sitting in control centers, so that if something goes wrong they can opt-in.

Self-Driving: Neural Network

Our end-goal is to have a car that

  • stay on the roard
  • avoid collision
  • head to destination

Then essentially it is a multi-task training:


There are some paradigms:

  • Imitation Learning: a RL algorithm used when training data is limited

    • uses environmental loss: each of the map is turned into a loss equation


    • with multiple losses, you can fuse those losses and get a set of trajectories each with some given score/probability

    • but again, if anything deviates from training data (i.e. action/state not seen before), then high chance it won’t work. For instance, if some car suddenly stopped and it was not experienced in training data, then you will crash

  • Data Augmentation: given some training trajectory (other cars and ground truth trajectory), augment the data by perturbing the trajectory so you end up having.

    Some augmentation methods experimented and ablation studies:

    • $M_1$: Imitation with past dropout
    • $M_2$: $M_1$ with trajectory pertubation
    • $M_3$: $M_2$ with environmental losses
    • $M_4$: $M_3$ with imitation dropout (for better OOV generalization)

    And the results