Recent advances in natural language inference: A survey of benchmarks, resources, and approaches (Storks et al., 2019).


This is a really good review paper in NLI. It mainly covers language understanding tasks and benchmarks where we need to use some external knowledge or advanced reasoning beyond linguistic context. The idea that we can better guide researchers to focus on truly understand the reasoning by designing smarter benchmarks is inspiring. This paper gives an overview of existing benchmarks and what problems they are trying to solve, as well as existing knowledge resources and inference approaches. It also provides examples from the benchmark datasets, which can give beginners some basic idea. It can serve as a pretty good reference for resources looking up. Several issues raised in this paper are worth attention, such as the unexplainability of recent approaches and the statistical biases found in benchmark datasets.

Benchmarks and Tasks

Five major tasks require external knowledge and complex reasoning: reference resolution, question answering, textual entailment, plausible inference, and intuitive psychology. It seems to me that the difference between textual entailment and plausible inference is that text entailment judges the correctness of hypothese and focuses on reasoning, while plausible inference finds the event that is the most likely to happen according to commonse knowledge.

The authors also call attention for the superficial correlation biases in the datasets, for example, the gender bias. Mutual information method (Gururangan et al., 2018) and adversarial filtering process (Zellers et al., 2018) may be helpful for such biases.

Knowledge Resources

Linguistic knowledge includes annotated corpora, frame semantics resources, lexical resources, and pre-trained semantic vectors.

Common and commonsense knowledge resources are mostly in the form of knowledge base and knowledge graph. To clarify, common knowledge refers to well-known facts about the world that are often explicitly stated, while commonsense knowledge, on the other hand, is considered obvious to most humans, and not likely to be explicitly stated (Cambria et al., 2011).

Learning and Inference Approaches

Three main neural approaches are brought up: attention mechanism, memory augmentation, and contextual models and representations.

It points out that attention mechanism works well mainly on capturing the alignment between an input and an output, and capturing long-term dependencies. One thing to note is some RNN models with attention will perform worse since there is no such alignment. This reminds us to keep in mind what a structure is actually learning before stacking them altogether.

Memory augmentation methods, such as memory networks, are new to me and requires further reading.

One interesting point about using external knowledge is mentioned: (Mihaylov et al., 2018) find that their adding of facts from ConceptNet causes distraction which reduces performance, suggesting that the technique for selecting the appropriate relations is important to reduce distraction.

Future Directions

The directions are mostly for designing datasets, still, I get some motivations.

Despite the good performance of current models, we don’t know whether or not they are actually performing reasoning. The authors think the benchmarks should differentiate between types of reasoning and take that into evaluations.

A competence-centric evaluation, while important for pushing the state of the art, can also lead to a less productive path if not treated carefully.

The authors suggest that we put more attention on a good understanding of model behaviors (anything insightful? what is the model actually learning?), computational efficiency, and generalization ability (inference on new tasks with minimal training).


  1. How to use common or commonsense knowledge in creating a benchmark dataset?
  2. What are the existing types of reasoning?


  1. Storks, S., Gao, Q., & Chai, J. Y. (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. ArXiv Preprint ArXiv:1904.01172.
  2. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., & Smith, N. A. (2018). Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 107–112.
  3. Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 93–104.
  4. Cambria, E., Song, Y., Wang, H., & Hussain, A. (2011). Isanette: A common and common sense knowledge base for opinion mining. 2011 IEEE 11th International Conference on Data Mining Workshops, 315–322.
  5. Mihaylov, T., Clark, P., Khot, T., & Sabharwal, A. (2018). Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2381–2391.
  6. Niven, T., & Kao, H.-Y. (2019). Probing Neural Network Comprehension of Natural Language Arguments. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4658–4664.
  7. Ferrone, L., & Zanzotto, F. M. (2020). Symbolic, distributed, and distributional representations for natural language processing in the era of deep learning: A survey. Frontiers in Robotics and AI, 6, 153.
  8. Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9), 92–103.