NFDI4DS | UHH-SEMS - Publication Details

Random forests and the data sparseness problem in language modeling

Perplexity Smoothing Cache language model Word error rate n-gram

DOI: 10.1016/j.csl.2006.01.003 Publication Date: 2006-02-21T12:15:04Z

Abstract Supplemental Material References Cited by

AUTHORS (2)

Peng Xu

Frederick Jelinek

ABSTRACT

Abstract Language modeling is the problem of predicting words based on histories containing words already hypothesized. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The solution of these aspects is hindered by the data sparseness problem. Application of random forests (RFs) to language modeling deals with the two aspects simultaneously. We develop a new smoothing technique based on randomly grown decision trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new method is complementary to many existing ones dealing with the data sparseness problem. We study our RF approach in the context of n-gram type language modeling in which n − 1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are longer than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser–Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary state-of-the-art speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (58)

CITATIONS (28)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Random forests and the data sparseness problem in language modeling

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....