Transformers

Across

2. encoding***Adds positional information to input embeddings
4. heads***Multiple attention mechanisms in parallel
7. output sequence
9. input sequence
10. layer***Maps input tokens to dense vectors

Down

1. for weighting input tokens
3. network***Neural network layer applied to each token independently
5. layers***Stacked layers of self-attention and feed-forward networks
6. dot-product attention***Common attention mechanism in transformers
8. key, value***Components of self-attention mechanism