A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics, and Benchmark Datasets (Zeng et al., 2020).


This paper provides a comprehensive survey on MRC tasks, evaluation metrics, and existing benchmark datasets. I find the Tasks section and the Open Issues section most helpful.


  1. a definition of typical MRC tasks is given, which can be seen as a supervised problem: (context, question -> answer).
  2. concept clarification about MRC
    • multi-modal MRC vs. textual MRC: multi-modal MRC also involves images and videos, such as RecipeQA and MovieQA.
    • MRC vs. QA:
      • These two tasks are not subsets of one another.
      • Some MRC may be seen as a special case QA, in that QA can also be open-domain and that QA can also be solved by rule-based method, information retrival method and knowledge-based method.
      • On the other hand, just like human, reading comhension can be about giving correct answers to questions, and can also be about asking the right or sensible questions given the context. And in multi-modal MRC, QA is just one part of it, and we also need CV.
    • MRC vs. NLP. Syntax information can help with MT, and some MRC models can be used in NLI as well (ex. SAN). (Need to figure out the definitions of NLP and NLI.)
  3. Classification of MRC Tasks (clear and well-defined)
    • type of corpus: multi-modal, textual
    • type of questions: cloze style, natural, synthetic
    • type of answers: natural, multiple choice
    • source of answers: span, free-form

Benchmark Datasets

In the Benchmark Dataset section, the authors list almost all available datasets and they kindly provide a website summarizing all the datasets. One good feature I like the most is the prerequisite skills (Table 8) and an overview of the characterisitcs of each dataset (Table 10). The prerequisite skills may provide some ideas on building new models and interpretation. And among the characteristics, I am the most interested in Complex Reasoning.

I checked most of them and found that some of them were not active in these two years. I hereby list the ones that I find active and interesting and also with leaderboard. I care about leaderboard is because I want to check the gap between the state-of-art of human performance to see further improvement potential.

  • ARC: commonsense knowledge and complex reasoning
  • OpenBookQA: commonsense knowledge
  • ReCoRD (part of SuperGLUE now): commonsense knowledge
  • HotpotQA
  • SciTail
  • DROP: complex reasonsing
  • RACE: passage reading comprehension from middle- and high-school English exams. Involve complex reasoning. I am intersted in this dataset since the questions in the exams are usually made up by experts and should have higher quality.
  • TriviaQA
  • SQuAD
  • CoQA: conversational QA
  • SuperGLUE

In these leaderboards, UnifiedQA (Khashabi et al., 2020), XLNet (Yang et al., 2019), ALBERT (Lan et al., 2019) , RoBERTa (Liu et al., 2019) , T5 (Raffel et al., 2019) , and DeBERTa (He et al., 2020) are models that achieve good results.

Open Issues

In the Open Issues section, the authors think multi-modal MRC, commonsense knowledge, complex reasoning, robustness, and interpretability is worth investigation.


  1. Zeng, C., Li, S., Li, Q., Hu, J., & Hu, J. (2020). A Survey on Machine Reading Comprehension—Tasks, Evaluation Metrics and Benchmark Datasets. Applied Sciences, 10(21), 7640.
  2. Khashabi, D., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system. ArXiv Preprint ArXiv:2005.00700.
  3. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 5753–5763.
  4. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. ArXiv Preprint ArXiv:1909.11942.
  5. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. ArXiv Preprint ArXiv:1907.11692.
  6. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv Preprint ArXiv:1910.10683.
  7. He, P., Liu, X., Gao, J., & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. ArXiv Preprint ArXiv:2006.03654.