Skip to content

Commit

Permalink
Add missing references
Browse files Browse the repository at this point in the history
Signed-off-by: Simone Rossi <[email protected]>
  • Loading branch information
srossi93 committed Dec 12, 2023
1 parent 0d4a858 commit 83e690a
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 2 deletions.
4 changes: 2 additions & 2 deletions _posts/2024-05-07-understanding-icl.md
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ $$
<!-- The common conception of ICL is "let's give an LLM examples in input of what we want to achieve, and it will give use some similar output". -->
<!-- The common knowledge is that ICL is a form of learning, where the supervision is given by the input examples. -->

In-Context Learning (ICL) is the behavior first observed in Large Language Models (LLMs), whereby learning occurs from prompted data without modification of the weights of the model [ref]. It is a simple technique used daily and throughout the world by AI practitioners of all backgrounds, to improve generation quality and alignment of LLMs <d-cite key="Brown2020"></d-cite>.
In-Context Learning (ICL) is the behavior first observed in Large Language Models (LLMs), whereby learning occurs from prompted data without modification of the weights of the model <d-cite key="dong2023survey"></d-cite>. It is a simple technique used daily and throughout the world by AI practitioners of all backgrounds, to improve generation quality and alignment of LLMs <d-cite key="Brown2020"></d-cite>.
ICL is important because it addresses full-on the once widespread criticism that for all their impressive performance, modern deep learning models are rigid systems that lack the ability to adapt quickly to novel tasks in dynamic settings - a hallmark of biological intelligence.
By this new form of "learning during inference", Large Language Models have shown that they can be, in some specific sense (once pretrained), surprisingly versatile and few-shot learners.

Expand Down Expand Up @@ -512,7 +512,7 @@ f(\mbw, P_C) = m\left(\mbw - \eta \nabla_{\mbw} \sum_{i=0}^{C-1}\ell\left(m(\mbw
$$

where $$\eta$$ is the learning rate of the meta-learning algorithm.
Equation \eqref{eq:meta-learning-model} represents the inner optimization loop in a simplified version of the MAML algorithm (REF), where the model is updated with a single gradient step.
Equation \eqref{eq:meta-learning-model} represents the inner optimization loop in a simplified version of the MAML algorithm <d-cite key="finn_model-agnostic_2017"><d-cite>, where the model is updated with a single gradient step.

Putting all together, we can define the meta-learning loss as:

Expand Down
9 changes: 9 additions & 0 deletions assets/bibliography/2024-05-07-understanding-icl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -210,3 +210,12 @@ @InProceedings{oswald23a
url = {https://proceedings.mlr.press/v202/von-oswald23a.html},
abstract = {At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of in-context learning in optimized Transformers. Building on this insight, we furthermore identify how Transformers surpass the performance of plain gradient descent by learning an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers.}
}

@misc{dong2023survey,
title={A Survey on In-context Learning},
author={Qingxiu Dong and Lei Li and Damai Dai and Ce Zheng and Zhiyong Wu and Baobao Chang and Xu Sun and Jingjing Xu and Lei Li and Zhifang Sui},
year={2023},
eprint={2301.00234},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

0 comments on commit 83e690a

Please sign in to comment.