Additional Research
Decoding Strategies in auto-regressive models
[1] decoding methods for language generation
While reading on decoding strategies supported by HuggingFace … (details on .generate()
and available decoding strategies)
So apparently, the decoding strategy is passed in .generate
in form of **kwargs
of type: GenerationConfig
(transformers/generation/configuration_utils.py)
no_repeat_ngram_size (args) -> this prevents the prevents the model from repeating any n-grams of the specified size. For example, if set to 3, once a three-token sequence appears, the model is forbidden from generating that same three-token sequence again.
Problems with Greedy Decoding #edit
Problems with Beam Search Decoding
- Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization. But this is not the case for open-ended generation where the desired output length can vary greatly #why?
- We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with n-gram- or other penalties in story generation
Temperature (LLM parameter)
- <exact def> #edit
- by lower the temp of softmax, the distribution becomes sharper (increasing likelihood of high prob words and decreasing the likelihood of low prob words). By doing this the the distribution becomes less random
- setting
temp → 0
, temperature scaled sampling becomes equal to greedy decoding (which suffer the same problems as greedy)
Top K sampling
- language models limits the pool of possible next tokens to only top-k most likely candidates and then it randomly samples from them. Here, by limiting the vocabulary, the model considers only top-k tokens based on their probability and them randomly samples them
- Despite there are examples of endangering the model to produce gibberish by limiting the sample pool