I’m taking a friends questions on Vector Space Semantics models as a reason to write about this arXival by Kim, Jernite, Sontag and Rush using an idea I’ve had in the back of my mind while fighting my semantics model: that maybe it is easier to model characters directly instead of words, as there are only so many characters but hundreds of thousands of words. And modeling characters directly might give an algorithm robustness to neologisms and spelling mistakes. Which indeed seems to be one of the conclusions of the paper.
The basic idea is to use character level ConvNets (with a pooling operation over time), extract features using the recently proposed autobahn networks (see my post) and bind things together over time using LSTMs. The big thing is that one has far fewer parameters than when processing at the word level, while the downside is that I’m not completely sure how easily one might extract word embeddings. While that might be no problem in the long run (as I suppose models could be developed that work based on characters only), it might be a problem for quick combination of this papers approach with current mainstream tools used in the industry.
In the convolutional layer, they used between 100 and 1000 filters of varying length, each of which basically recognizes different n-grams. After max pooling (over time), autobahn highway layers give a way of extracting hierarchical features. LSTMs are then used to predict the next word, the loss/”likelihood” being cross entropy to the actually observed next word. Unsurprisingly, in their discussion they account a strong increase in performance to the highway layers. Making me again think that we have to work on enabling deeper hierarchies for Bayesian Models.
Overall a super interesting case study, though as a linguist I would really like to know how these models could be changed to give us some insight into the structure of the language. As a plus for my friend, the code is available on github, written using the ubiquitius Torch library.