Across
- 3. Problems in benchmark
- 7. Challenging benchmark
- 8. New evaluation metric
- 10. Performance needed
- 12. LLM struggle
- 13. LLMs need to be this
- 14. Showed LLM weakness
Down
- 1. LLMs solved these
- 2. Increased problem this
- 4. Small ones affected LLMs
- 5. Large Language Models
- 6. Dropped significantly
- 9. Artificial Intelligence
- 11. Created G-Pass@k
