Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text (work in collaboration with Michael Hahn)
- Event: Seminar
- Lecturer: Marco Baroni
- Date: 22 October 2018
- Duration: 2 hours
- Venue: Gothenburg
As recurrent neural networks (RNNs) have recently reached striking performance levels in a variety of natural language processing tasks, there has been a revival of interest in whether these generic sequence processing devices are effectively capturing linguistic knowledge. Nearly all studies of this sort, however, initialize the RNNs with a vocabulaty of known words, and feed them tokenized input during training. We are instead running an extensive, multi-lingual (English/German/Italian) study of the linguistic knowledge induced by RNNs trained at the character level on input data with whitespace removed. Our networks, thus, face a tougher and more cognitively realistic task, having to discover all the levels of the linguistic hierarchy from scratch. Our current results show that these “near tabula rasa” RNNs are implicitly encoding a surprising amount of phonological, lexical, morphological, syntactic and semantic information, opening the doors to intriguing speculations about the degree of prior knowledge that is necessary for succesful language learning.