Crossword

1234567891011121314
Across
  1. 3. Problems in benchmark
  2. 7. Challenging benchmark
  3. 8. New evaluation metric
  4. 10. Performance needed
  5. 12. LLM struggle
  6. 13. LLMs need to be this
  7. 14. Showed LLM weakness
Down
  1. 1. LLMs solved these
  2. 2. Increased problem this
  3. 4. Small ones affected LLMs
  4. 5. Large Language Models
  5. 6. Dropped significantly
  6. 9. Artificial Intelligence
  7. 11. Created G-Pass@k