Transformer: Theory to Implementation

Additional Research

Flash Attention
Chain of Continuous Thought (https://arxiv.org/abs/2412.06769)

Decoding Strategies in auto-regressive models

[1] decoding methods for language generation

While reading on decoding strategies supported by HuggingFace … (details on .generate() and available decoding strategies)

So apparently, the decoding strategy is passed in .generate in form of **kwargs of type: GenerationConfig (transformers/generation/configuration_utils.py)

no_repeat_ngram_size (args) -> this prevents the prevents the model from repeating any n-grams of the specified size. For example, if set to 3, once a three-token sequence appears, the model is forbidden from generating that same three-token sequence again.

Problems with Greedy Decoding #edit

Problems with Beam Search Decoding

Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization. But this is not the case for open-ended generation where the desired output length can vary greatly #why?
We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with n-gram- or other penalties in story generation

Temperature (LLM parameter)

<exact def> #edit
by lower the temp of softmax, the distribution becomes sharper (increasing likelihood of high prob words and decreasing the likelihood of low prob words). By doing this the the distribution becomes less random
setting temp → 0 , temperature scaled sampling becomes equal to greedy decoding (which suffer the same problems as greedy)

Top K sampling

language models limits the pool of possible next tokens to only top-k most likely candidates and then it randomly samples from them. Here, by limiting the vocabulary, the model considers only top-k tokens based on their probability and them randomly samples them
Despite there are examples of endangering the model to produce gibberish by limiting the sample pool