ESRC Research Project ES/J022969/1

October 1, 2012 - January 31, 2016

Department of Philosophy, King's College London

SMOG is exploring the construction of an enriched stochastic model that represents the syntactic knowledge that native speakers of English have of their language.

We are hoping that this kind of model will provide a straightforward explanation for the fact that individual native speakers generally judge the well formedness of sentences along a continuum, rather than through the imposition of a sharp boundary between acceptable and unacceptable sentences.

We are experimenting with different sorts of language models that contain a variety of parameters encoding properties of sentences and probability distributions over corpora.

We are training these models on subsets of the British National Corpus (BNC), and we are testing them on additional subsets of the BNC into which we have introduced grammatical deformations and infelicities of varying degrees of severity and subtlety.

We hope to show that a sufficiently complex enriched language model can encode a fair amount of what native speakers know about the syntax of their language.

This research holds out the prospect of important impact in two areas. 

  1. It can shed light on the relationship between the representation and acquisition of linguistic knowledge on one hand, and learning and the encoding of knowledge in other cognitive domains. This can, in turn, help to clarify the respective roles of biologically conditioned learning biases and data driven learning in human cognition.
  2. This work can contribute to the development of more effective language technology by providing insight into the way in which humans represent the syntactic properties of sentences in their language. To the extent that natural language processing systems take account of this class of representations they will provide more