Lei Luo Machine Learning Engineer

Learning Part of Speech Using A Character-Word BiLSTM Model


Go straight to the code!

What is PoS?

In the English language, words can be considered as the smallest elements that have distinctive meanings. Based on their use and functions, words are categorized into several types or parts of speech. Some common PoS tagging are noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.

What is RNN and LSTMs?

A recurrent neural network (RNN) is a type of artificial neural network where nodes not only connect with each other, but also feed the output back to themselves. RNN is good at dealing with sequence problems, such as speech recognition, language modeling, translation, image captioning… The list goes on, but traditional neural networks do not handle sequence problems very well. The simple node in RNN can be seen like this where the output of the node is also input into the node itself.

This special structure enables the network to remember what has been seen in the past and connect the previous information to the present task, such as using previous video frames might inform the understanding of the present frame. Unfortunately when we need to look very far back to help understand the current frame, RNNs become unable to learn to connect the information. In theory, RNNs are capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. In cases like this, we can turn to LSTMs.

LSTM is short for long short term memory networks and are a special kind of RNN. LSTMs are explicitly designed to avoid the long-term dependency problem. The basic structure of LSTM is shown below:

Bidirectional LSTMs have two sets of LSTM cells where the forward LSTM is flowing the direct of the sequence and the backward is flowing the opposite direction. Often the case, the output of forward and backward LSTM are connected to the same output.

Learning Character-level Representation

The importance of learning character-level representation lies in the fact that prefix and suffix are often important to infer the PoS. For example, words ending with ‘ty’ are often nouns and words starting with ‘un’ are often negative adjectives. Some early research used a prefixed character window that only considers a fixed length of characters, such as only looking at the first 5 characters in words. This is not enough to consider the suffix of longer words. The model of this study uses a bidirectional LSTM that concatenates the output of the forward and backward LSTM so that it considers the entire character sequence as opposed to only a fixed length of characters. The structure of the model is shown below:

Learning Word-level Representation

In this study, the word embeddings are obtained using a pretrained word embedding, such as word2vec. Depending on the nature of text we are working with, different word embeddings should be used. For example, if we are working with biochemical texts, it would be better we obtain the word embeddings by training on a large corpus of biochemical texts. The word embeddings are then concatenated with the character-level representation as a combined feature. Then, another bidirectional LSTM is used to learn the combined representation, on top of which a softmax layer is used to predict the PoS tagging.

Results on CoNLL2000

The best f1 score of this model after 30 epochs is 94.21%, which is comparable to the state-of-the-art performance.

The implementation of the model using PyTorch is provided on my github repo.