0. Matching Network for One Shot Learning
1. Summary
This paper is a metric-based method for few-shot learning task. It uses attention mechanism and considers to embed the whole support set into feature space not just a single sample in the support set. Given a query image, the predict output is the weighted sum of the samples in the support set.
2. Reaserch Object
Learning from a few examples, one-shot learning
3. Problem Statement
Train a classifer, test with a few(eg. 1-shot, 5-shot) queries(images) of unknown classes.
4. Methods
attention, memory
- overall architecture
- the simplest form to compuete \(\hat{y}\) of the test example \(\hat x\): \(\hat y = \sum\limits_{i=1}^ka(\hat{x},x_i)y_i\) \(x_i\) and \(y_i\) are the samples and labels from support set, \(a\) is an attention mechanism
- the choose of attention kernel
- the simplest form is used in this paper: \(a(\hat{x},x_i)=e^{c(f(\hat{x}),g(x_j))}/\sum _{j=1}^ke^{c(f(\hat{x}),g(x_j))}\)
- \(f\) and \(g\) are embedding functions, which can be CNN for image classification task and word embedding for language task. \(c\) is the cosine distance.
- full context embeddings
- consider all the images of the support set, not just embeding one image at once. Use the bidirectional Long-Short Term Memory to encode \(x_i\) in the context of the support set.
- play the role of embedding function \(f\)
- training strategy
task \(T\) has the distribution over label set \(L\), first sample a label set \(L\) from \(T\), then use \(L\) to sample support set \(S\) and a batch \(B\)("batch" in there means queries for the given support set). training objective
\(\theta = \mathop{\arg\max}_\theta E_{L\sim T}\left[E_{S\sim L, B\sim L}\left[\sum\limits_{(x,y)\in B}\mathop{\log}P_\theta(y|x,S)\right]\right]\)
5. Evaluate
test and train conditions must match
- ran on three data sets: two image classification sets(Omniglot, ImageNet), one language model(Penn Treebank)
- Omniglot:
- Augmented the data set.
- Pick N unseen character classes, indepedent of alphabet, as \(L\), provide the model with one drawing of each of the N characters as \(S\sim L\) and a batch \(B\sim L\).
- ImageNet:
- Get three subdataset of the ImageNet: randImageNet(remove 118 labels at random from training set), dogImageNet(remove dog class), miniImageNet. Test on the three subdatasets.
6. Conclusion
- Fully Conditional Embeddings didn't seem to help much on the simple dataset, like Omniglot, but on the harder task(miniImageNet), FCE is sensible.
- one-shot learning is much easier if you train the network to do one-shot learning
- Non-parametric structures in a neural network make it easier for network to remember and adapt to new training sets in the same task.
7. Notes
- FCE consider the whole support set \(S\) instead of pair-wise comparisons.
- they tested their method trained on Omniglot on the MNIST dataset, acc about 72%
- What's the loss function of this model?
- Need to know more about FCE's details. ## 8. Reference
- O Vinyals, S Bengio, and M Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.