English  |  正體中文  |  简体中文  |  Items with full text/Total items : 43312/67235
Visitors : 2077236      Online Users : 2
RC Version 5.0 © Powered By DSPACE, MIT. Enhanced by NTU/NCHU Library IR team.

Please use this identifier to cite or link to this item: http://nchuir.lib.nchu.edu.tw/handle/309270000/154559

標題: 在大量中文語料中語言模型關於平滑問題特性之分析
Analyzing Properties of Smoothing Issues for Language Models in Large Mandarin Corpus
作者: 黃健祐
Hwang, Chien-Yo
Contributors: 余明興
Ming-Shing Yu
資訊網路多媒體研究所
關鍵字: 語言模型;平滑化;混淆度;交叉熵
Language models;smoothing methods;perplexity;cross entropy
日期: 2012
Issue Date: 2013-11-21 10:56:31 (UTC+8)
Publisher: 資訊網路多媒體研究所
摘要: 平滑化處理是自然語言中非常根本且重要的課題,許多相關的應用如語音辨識、機器翻譯、輸入法,甚至是繁簡轉換的問題都會需要平滑化處理。平滑化處理主要是用來解決統計語言模型在實際應用中數據稀疏問題並且利用機率的方式去估算每個事件的機率值。
論文中,首先對平滑化方法的交叉熵和混淆度進行論述,在語言模型中,由於數據稀疏的緣故,平滑化方法使用在估算每個events的機率。文章中會提到數種知名的平滑化方法:Additive Discount Method, Good-Turing Method, Witten-Bell Method。
現有的平滑化技術雖已能有效的解決數據稀疏問題,但對已出現的事件頻率分佈的合理性沒有作出有效的分析,於是我們從統計觀點對平滑化處理進行分析,並且提出一些性質分析上述平滑化方法的統計行為。接下來提出兩種全新的平滑化方法,這兩種平滑化方法能夠同時滿足我們提出的性質。
最後,我們從大量中文語料中建立模型,並且討論如何使用交叉熵和混淆度評價模型,以及對Katz所提出的Cut offs議題做出相關的討論。
Smoothing technique is a very fundamental and important topic. Many applications like speech reconition, machine translation, input method, Chinese characters conversion use this technique a lot.
In this thesis, we discuss the properties and entropies of smoothing methods. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability of each event in language models. We will mention several well-known smoothing methods: Additive Discount Method, Good-Turing Method and Witten-Bell method.
The present smoothing techniques have solved the data sparse problem effectively but have not further anzlyzed the reasonableness for the frequency distribution of events occurring.So we analyzed smoothing method from a statitiscal point of view. We propose a set of properties to analyzed the statistical bebaviors of these smoothing methods. Furthmore, we present two new smoothing methods which comply with nearly all the properties.
Finally, we implement the language models using large Mandarin corpus and discuss how to evaluate language models by cross-entropy and perplexity. Then we discuss some related problems of the cut off issues proopsed by Katz.
Appears in Collections:[依資料類型分類] 碩博士論文

Files in This Item:

File Description SizeFormat
nchu-101-7099083009-1.pdf1202Kb261View/Open
index.html0KbHTML139View/Open


 


學術資源

著作權聲明

本網站為收錄中興大學學術著作及學術產出,已積極向著作權人取得全文授權,並盡力防止侵害著作權人之權益。如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員,將盡速為您處理。

本網站之數位內容為國立中興大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用。

聯絡網站維護人員:wyhuang@nchu.edu.tw,04-22840290 # 412。

DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU/NCHU Library IR team Copyright ©   - Feedback