The course focuses on machine learning/deep learning models and techniques such as Recurrent Neural Networks (RNNs), Long-Short Term Memory Networks (LSTMs), Convolutional Neural Networks (ConvNets), Neural Auto-Encoders, Memory Networks, and others applied to computational modeling of natural language and images, and other sensory information.

Theoretically, it examines how machine learning approaches address topics such as multi-modal grounded representations of meaning, representing and resolving semantic ambiguity, attention and salience, perception and dialogue interaction, natural language interpretation, natural language generation, natural language reasoning and inference, and collection of perceptual and linguistic data.

Practically, the course oveviews contemporary computer vision and natural language processing tasks such as generating image and video descriptions, visual question answering, image retrieval using text queries, aligning images and text in large data collections, image generation from textual descriptions, and others.

The course webpage can be accessed here.

The course syllabus can be found here.

ML Methods for Vision and Language