Improving short text classification by learning vector representations of both words and hidden topics

Feature vector Text corpus
DOI: 10.1016/j.knosys.2016.03.027 Publication Date: 2016-03-31T05:55:09Z
ABSTRACT
We exploit the knowledge from a topic-consistent corpus for topic modeling and use the topics to enrich the corpus and the short texts.We learn the vector representations of both words and topics interactively on the enriched corpus.We use the vectors of the words and topics to represent the features of short texts for training and classification.Our method performs better than many baselines. This paper presents a general framework for short text classification by learning vector representations of both words and hidden topics together. We refer to a large-scale external data collection named "corpus" which is topic consistent with short texts to be classified and then use the corpus to build topic model with Latent Dirichlet Allocation (LDA). For all the texts of the corpus and short texts, topics of words are viewed as new words and integrated into texts for data enriching. On the enriched corpus, we can learn vector representations of both words and topics. In this way, feature representations of short texts can be performed based on vectors of both words and topics for training and classification. On an open short text classification data set, learning vectors of both words and topics can significantly help reduce the classification error comparing with learning only word vectors. We also compared the proposed classification method with various baselines and experimental results justified the effectiveness of our word/topic vector representations.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (45)
CITATIONS (62)