Here is a list of papers that were on our list to read but they did not make it in the schedule. Please feel free to move back any of them to the list of suggestions if they are (or become) of interest.

  • Author. Year. Paper.

  • J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, and S. Wermter. What is right for me is not yet right for you: A dataset for grounding relative directions via multi-task learning. arXiv, arXiv:2205.02671 [cs.CV], 2022. Dataset

  • Y. Liu and G. Emerson. Learning functional distributional semantics with visual data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3976–3988, Dublin, Ireland, May 2022. Association for Computational Linguistics.

  • F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning. arXiv, arXiv:2205.00363 [cs.CL], 2022.

  • E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott. Multimodal pretraining unmasked: A meta- analysis and a unified framework of vision-and-language BERTs. Transactions of the Association for Computational Linguistics, 9:978–994, 2021.

  • L. W. Barsalou. Grounded cognition. Annual Review of Psychology, 59:617–645, 2008.

  • R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research, 55:409–442, 2016.

  • R. Bernardi and S. Pezzelle. Linguistic issues behind visual question answering. Language and Linguistics Compass, 15(6):elnc3.12417, 2021.

  • S. Buch, L. Fei-Fei, and N. D. Goodman. Neural Event Semantics for Grounded Language Understanding. Transactions of the Association for Computational Linguistics, 9:875–890, 08 2021.

  • J. Cho, J. Lei, H. Tan, and M. Bansal. Unifying vision-and-language tasks via text generation. arXiv, arXiv:2102.02779 [cs.CL], 2021.

  • G. Collell and M.-F. Moens. Learning representations specialized in spatial knowledge: Leveraging language and vision. Transactions of the Association for Computational Linguistics, 6:133–144, 2018.

  • I. Dasgupta, C. Kaeser-Chen, K. Marino, A. Ahuja, S. Babayan, F. Hill, and R. Fergus. Collaborating with language models for embodied reasoning. arXiv, arXiv:2302.00763 [cs.LG], 2023.

  • R. Dess`ı, E. Gualdoni, F. Franzon, G. Boleda, and M. Baroni. Communication breakdown: On the low mutual intelligibility between human and neural captioning. arXiv, arXiv:2210.11512 [cs.CL], 2022.

  • T. Dong, A. Testoni, L. Benotti, and R. Bernardi. Visually grounded follow-up questions: a dataset of spatial questions which require dialogue history. In Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics, pages 22–31, Online, Aug. 2021. Association for Computational Linguistics.

  • C. Greco, B. Plank, R. Fern ́andez, and R. Bernardi. Psycholinguistics meets continual learning: Measuring catastrophic forgetting in visual question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3601–3605, Florence, Italy, July 2019. Association for Computational Linguistics.

  • E. Gualdoni, T. Brochhagen, A. Mädebach, and G. Boleda. What’s in a name? a large-scale computational study on how competition between names affects naming variation. Submitted, 2023.

  • J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv, arXiv:1908.02265 [cs.CV], 2019.

  • J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–10, June 2020.

  • I. Parfenova, D. Elliott, R. Fern ́andez, and S. Pezzelle. Probing cross-modal representations in multi- step relational reasoning. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 152–162, Online, Aug. 2021. Association for Computational Linguistics.

  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.

  • T. Ramalho, T. Kocisky ́, F. Besse, S. M. A. Eslami, G. Melis, F. Viola, P. Blunsom, and K. M. Hermann. Encoding spatial relations from natural language. arXiv, arXiv:1807.01670 [cs.CL]:16, July 5 2018.

  • F. Sadeghi, S. K. Kumar Divvala, and A. Farhadi. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1456–1464, 2015.

  • S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. Chang, Z. Yao, and K. Keutzer. How much can CLIP benefit vision-and-language tasks? arXiv, arXiv:2107.06383 [cs.CV], 2021.

  • C. Silberer, S. Zarrieß, M. Westera, and G. Boleda. Humans meet models on object naming: A new dataset and analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1893–1905, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics.

  • A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela. Flava: A founda- tional language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15638–15650, June 2022.

  • G. Skantze and B. Willemsen. CoLLIE: Continual learning of language grounding from language-image embeddings. Journal of Artificial Intelligence Research, 74:1201–1223, jul 2022.

  • E. Sood, F. K ̈ogel, F. Strohm, P. Dhar, and A. Bulling. VQA-MHUG: A gaze dataset to study multimodal neural attention in visual question answering. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 27–43, Online, Nov. 2021. Association for Computational Linguistics.

  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai. VL-BERT: Pre-training of generic visual- linguistic representations. arXiv, arXiv:1908.08530 [cs.CV], 2019.

  • H. Tan and M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.

  • M. Tsimpoukelli, J. L. Menick, S. Cabi, S. M. A. Eslami, O. Vinyals, and F. Hill. Multimodal few- shot learning with frozen language models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 200–212. Curran Associates, Inc., 2021.

  • P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv, arXiv:2211.04325 [cs.LG], 2022.

  • V. Wang-Mascianica and B. Coecke. Talking space: inference from spatial linguistic meanings. arXiv, arXiv:2109.06554 [cs.CL]:1–33, September 16 2021.

  • M. Zare, A. Ayub, A. Liu, S. Sudhakara, A. Wagner, and R. Passonneau. Dialogue policies for learning board games through multimodal communication. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 339–351, 1st virtual meeting, July 2020. Association for Computational Linguistics.

  • Z. Zhang, Y. Wang, Q. Wu, and F. Chen. Visual relationship attention for image captioning. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2019.

  • C. Zheng, Q. Guo, and P. Kordjamshidi. Cross-modality relevance for reasoning on language and vision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7642– 7651, Online, July 2020. Association for Computational Linguistics.

  • Pay Attention to MLPs. Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le. (2021). paper (from Aram)

  • Visually Grounded Follow-up Questions: a Dataset of Spatial Questions Which Require Dialogue History. Dong, T., Testoni, A., Benotti, L., & Bernardi, R. (2021). In Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics (pp. 22–31). Association for Computational Linguistics. paper (from Nikolai)

  • PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World. Zellers, A., Peters, M., Mottaghi, R., Kembhavi, A., Farhadi, A., & Choi, Y. (2021). In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 2040–2050). Association for Computational Linguistics. paper (from Nikolai)

  • A Cognitive Regularizer for Language Modeling. Jason Wei, Clara Meister, Ryan Cottorell. paper

  • NeuralLog: Natural Language Inference with Joint Neural and Logical Reasoning. Zeming Chen, Qiyue Gao, Lawrence S. Moss paper (from Adam)

  • Tifrea, A., Bécigneul, G., & Ganea, O.-E. (2018). Poincar\’e GloVe: Hyperbolic Word Embeddings. ArXiv:1810.06546 [Cs]. http://arxiv.org/abs/1810.06546 (from Bill, would like to read: Adam)

  • Mittelman, R., Sun, M., Kuipers, B., & Savarese, S. (2014). A Bayesian generative model for learning semantic hierarchies. Frontiers in Psychology, 5. https://doi.org/10.3389/fpsyg.2014.00417 (from Bill, would like to read: Adam)

  • Emerson, G. (2020). What are the Goals of Distributional Semantics? Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7436–7453. https://doi.org/10.18653/v1/2020.acl-main.663 (from Bill, would like to read: Adam)

  • Nguyen, D., Rosseel, L., & Grieve, J. (2021). On learning and representing social meaning in NLP: A sociolinguistic perspective. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 603–612. https://www.aclweb.org/anthology/2021.naacl-main.50 (from Bill)

  • Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words. Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze paper (from Adam)

  • A Mutual Information Maximization Perspective of Language Representation Learning. Lingpeng Kong, Cyprien de Masson d’Autume, Wang Ling, Lei Yu, Zihang Dai, Dani Yogatama. paper (from Adam, seems reaaaally cool!)

  • Self-Supervised Dialogue Learning. Jiawei Wu, Xin Wang, William Yang Wang. paper (from Adam)

  • Norm-Based Curriculum Learning for Neural Machine Translation. Xuebo Liu, Houtim Lai, Derek F. Wong, Lidia S. Chao paper (from Adam)

  • Curriculum learning. Bengio, Yoshua and Louradour, J{'e}r{\^o}me and Collobert, Ronan and Weston, Jason. paper (from Adam)

  • Automated Curriculum Learning for Neural Networks. Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, Koray Kavukcuoglu paper (from Adam)

  • Unsupervised Bilingual Lexicon Induction via Latent Variable Models. Zi-Yi Dou, Zhi-Hao Zhou, Shujian Huang paper (from Adam)

  • Generating Sentences from Disentangled Syntactic and Semantic Spaces. Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, Jiajun Chen paper (from Adam)

  • Residual Energy-Based Models for Text Generation. Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, Marc’Aurelio Ranzato paper (from Adam)

  • Generating Sentences from a Continuous Space. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio paper (from Adam)

  • Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information. Li, J., Tan, H., & Bansal, M. (2021). In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1041–1050). Association for Computational Linguistics. paper (Recommended by Nikolai, would like to read: Nikolai)

  • Measuring Social Biases in Grounded Vision and Language Embeddings. Ross, C., Katz, B., & Barbu, A. (2021). In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 998–1008). Association for Computational Linguistics. paper (Recommended by Nikolai, would like to read: Nikolai)

  • KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, & Marcus Rohrbach. (2020). CVPR 2021. paper (Recommended by Nikolai, would like to read: Nikolai)

  • Implicit Representations of Meaning in Neural Language Models. Belinda Z. Li, Maxwell Nye, & Jacob Andreas. (2021). ACL 2021. paper. (Recommended by Nikolai, would like to read: Nikolai)

  • Psycholinguistics Meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering. Greco, C., Plank, B., Fernández, R., & Bernardi, R. (2019). paper In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3601–3605). Association for Computational Linguistics. (Recommended by Nikolai, would like to read: Nikolai)

  • Incorporating Structural Alignment Biases into an Attentional Neural Translation Model (2021) Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, Gholamreza Haffari paper (Recommended by Adam, would like to read: Adam)

  • How (Non-)Optimal is the Lexicon? (2021) Tiago Pimentel, Irene Nikkarinen, Kyle Mahowald, Ryan Cotterell, Damián Blasi paper (Recommended by Adam, would like to read: Adam)

  • Benotti, L., & Blackburn, P. (2021). Grounding as a Collaborative Process. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 515–531). Association for Computational Linguistics. [paper] (https://www.aclweb.org/anthology/2021.eacl-main.41.pdf) (Recommended by Nikolai, would like to read: Nikolai)

  • Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, & Kai-Wei Chang. (2021). Weakly-Supervised VisualBERT: Pre-training Vision-and-Language Representations Without Parallel Images and Captions. paper NAACL 2021 (Recommended by Nikolai, would like to read: Nikolai)

  • Annette Rios, Chantal Amrhein, Noëmi Aepli, & Rico Sennrich. (2021). On Biasing Transformer Attention Towards Monotonicity. paper NAACL 2021 (Recommended by Nikolai, would like to read: Nikolai)

  • Spliethöver, M., & Wachsmuth, H. (2020). Argument from Old Man’s View: Assessing Social Bias in Argumentation. (https://www.aclweb.org/anthology/2020.argmining-1.9.pdf) (Recommended by Anna, would like to read: Anna)

  • Kevin Lu, Aditya Grover, Pieter Abbeel, & Igor Mordatch. (2021). Pretrained Transformers as Universal Computation Engines. paper (Recommended by Nikolai, would like to read: Nikolai)

  • M. Artetxe, G. Labaka, and E. Agirre. (2019). Bilingual Lexicon Induction through Unsupervised Machine Translation ACL 2019 (recommended by Adam, would like to read: Simon, Adam, Anna, Nikolai)

  • Goodman, N. D., & Stuhlmüller, A. (2013). Knowledge and Implicature: Modeling Language Understanding as Social Cognition. Topics in Cognitive Science. paper (recommended by Bill, would like to read: Robin, Simon, Anna)

  • K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, Aug. 2019. Association for Computational Linguistics. (recommended by Felix, would like to read: Adam, Anna, Nikolai)

  • J. Bastings and K. Filippova. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 149–155, Online, Nov. 2020. Association for Computational Linguistics. (recommended by Felix, would like to read: Adam, Anna, Nikolai)

  • Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, & Dhruv Batra. (2019). Multi-Target Embodied Question Answering (recommended by Simon and Nikolai, would like to read: Simon, Nikolai)

  • Moro, D., Black, S., & Kennington, C. (2019). Composing and Embedding the Words-as-Classifiers Model of Grounded Semantics. arXiv preprint arXiv:1911.03283. (https://arxiv.org/pdf/1911.03283.pdf) (recommended by Staffan, would like to read: Robin, Nikolai)

  • Thomason, J., Padmakumar, A., Sinapov, J., Walker, N., Jiang, Y., Yedidsion, H., … & Mooney, R. J. (2020). Jointly improving parsing and perception for natural language commands through human-robot dialog. Journal of Artificial Intelligence Research, 67, 1-48. paper (recommended by Mehdi, would like to read: Robin, Simon)

  • J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer. Vision-and-dialog navigation. In Conference on Robot Learning (CoRL), 2019. paper (recommended by Simon, would like to read: Simon, Nikolai)

  • M. Janner, K. Narasimhan, and R. Barzilay. Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics, 6:49–61, 2018. (recommended by Mehdi, would like to read: Simon, Nikolai) link

  • Caglayan, O., Ive, J., Haralampieva, V., Madhyastha, P., Barrault, L., & Specia, L. (2020). Simultaneous Machine Translation with Visual Context.. EMNLP 2020. (recommended by Nikolai, would like to read: Simon)

  • Malt, B. C., Sloman, S. A., Gennari, S., Shi, M., & Wang, Y. (1999). Knowing versus naming: Similarity and the linguistic categorization of artifacts. Journal of Memory and Language, 40(2), 230-262. paper (recommended by Staffan, would like to read: Robin)

  • Mollica, F. et al. (2019). Composition is the core driver of the language-selective network Paper (recommended by Mehdi, would like to read: Robin)

  • Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. https://arxiv.org/abs/2004.04696 (recommended by Nikolai, would like to read: Simon)

  • Tan, H., & Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. https://arxiv.org/abs/1908.07490 (recommended by Simon, would like to read: Simon)

  • Tan, H., Dernoncourt, F., Lin, Z., Bui, T., & Bansal, M. (2019). Expressing Visual Relationships via Language. paper (recommended by Nikolai, would like to read: Simon)

  • J. Zwarts and Y. Winter. Vector space semantics: A model-theoretic analysis of locative prepositions. Journal of Logic, Language and Information, 9:169–211, 2000. (recommended by all, would like to read: Robin)

  • Magnus Sahlgren, Fredrik Carlsson (2020) The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Image, 2, T2. (recommended by Mehdi) https://arxiv.org/abs/1609.02116

  • Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring (Recommended by Adam)

  • Ben Bogin, Sanjay Subramanian, Matt Gardner, Jonathan Berant (2020) Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering (Recommended by Adam)

  • Paula Czarnowska, Sebastian Ruder, Ryan Cotterell, Ann Copestake (2020) Morphologically Aware Word-Level Translation (Recommended by Adam)

  • Wu, J., & Mooney, R. J. (2019). Self-Critical Reasoning for Robust Visual Question Answering. http://arxiv.org/abs/1905.09998 (recommended by Simon)

  • Akbik, Alan & Blythe, Duncan & Vollgraf, Roland. (2018). Contextual String Embeddings for Sequence Labeling. paper (recommended by Axel)

  • Wang, Bin & Chen, Fenxiao & Wang, Yuncheng & Kuo, C.. (2020). Efficient Sentence Embedding via Semantic Subspace Analysis. paper (recommended by Axel)

  • Marcus, G. (2018). Deep learning: A critical appraisal. paper; video comments; (recommended by Mehdi)

  • J. A. Bateman, M. Pomarlan, and G. Kazhoyan. Embodied contextualization: Towards a multistratal ontological treatment. Applied Ontology, Pre-press:1–35, 2 October 2019. paper

  • What are the differences between neural networks and the brain? panel discussion from Center for Brains, Minds and Machines (CBMM) (recommended by Mehdi)

  • W. N. Havard, J.-P. Chevrot, and L. Besacier. Models of visually grounded speech signal pay attention to nouns: a bilingual experiment on english and japanese. arXic, arXiv:1902.03052 [cs.CL]:1–5, 2019. paper (recommended by Sylvie)

  • L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek. ”What is relevant in a text document?”: An interpretable machine learning approach. PLOS ONE, 12(8):1–23, 08 2017. paper (recommended by Felix)

  • Yatskar, M., Zettlemoyer, L., & Farhadi, A. (2016). Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5534-5542). link (recommended by Mehdi)

  • Mei, H., Bansal, M., & Walter, M. R. (2016, February). Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. In AAAI (pp. 2772-2778). link (recommended by Mehdi)

  • one of the papers on this page (Oxford robotics & vision group): link (recommended by Staffan) Link broken

  • Ben-Yosef, G., Assif, L., & Ullman, S. (2018). Full interpretation of minimal images. Cognition, 171, 65-84. link video (recommended by Mehdi)