English  |  正體中文  |  简体中文  |  Items with full text/Total items : 43312/67235
Visitors : 2021000      Online Users : 2
RC Version 5.0 © Powered By DSPACE, MIT. Enhanced by NTU/NCHU Library IR team.
National Chung Hsing University Institutional Repository - NCHUIR > 工學院 > 通訊工程研究所 > 依資料類型分類 > 碩博士論文 >  於雲端Hadoop平台上運用PST進行序列型樣探勘之研究

Please use this identifier to cite or link to this item: http://nchuir.lib.nchu.edu.tw/handle/309270000/153919

標題: 於雲端Hadoop平台上運用PST進行序列型樣探勘之研究
Mining Sequence Patterns by Using Probabilistic Suffix Tree on Hadoop Platform
作者: 王偉丞
Wang, Wei-Cheng
Contributors: 蔡曉萍
Hsiao-Ping Tsai
通訊工程研究所
關鍵字: 序列型樣探勘;雲端計算;機率後置樹
Sequence Pattern Mining;Cloud Computing;Hadoop/MapReduce;Probabilistic Suffix Tree (PST)
日期: 2012
Issue Date: 2013-11-19 12:31:34 (UTC+8)
Publisher: 通訊工程研究所
摘要: 『序列型樣探勘(sequence pattern mining)』主要是挖掘隱藏在序列資料中特殊、重要、具代表性的特徵(feature)。序列型樣探勘吸引了大量的關注,尤其是在生物資訊領域與具有時空軌跡探勘領域中。許多後置樹(Suffix Tree),特別是Probabilistic Suffix Tree (PST),常被用於序列型樣探勘,主要是因為其具有擷取序列資料之結構特徵能力,且其計算複雜度較低,因此常被用於計算能力或記憶體容量有限的環境下。
近來,隨著數據蒐集技術的進步,大量資訊迅速累積且無所不在,然而傳統集中式suffix tree演算法的可擴充性較差,因此無法應付大量資料的型樣探勘。有介於此,本論文提出了三種適用於雲端Hadoop平台的平行分散式的PST建構演算法,分別為CloudPST_Naïve、CloudPST、CloudPST_OneScan,具體來說,此三種演算法是使用Hadoop/MapReduce的程式模型來建構PST。CloudPST_Naïve是一種較直覺的演算法,由WordCount範例衍生而來。為解決Naïve方法的缺點,我們提出CloudPST演算法,其以漸進、一次多層、迭代的方式建構PST,因此可以避免過度探勘型樣,同時能平衡分散式計算的overhead。為避免多次掃描整個序列資料而拖累整體效能,我們進一步提出CloudPST_OneScan演算法,其利用一個新設計的資料結構以暫存中間統計的結果,因此每次迭代只需要掃描整個序列一次,即可建構該次迭代所需建構的PST層數。
為效能比較,我們進行許多實驗,試驗結果顯示CloudPST_OneScan各方面的表現都比CloudPST_Naïve和CloudPST好。而且CloudPST_OneScan擁有良好的執行效能,且具有良好的可擴充性與穩定性。
Sequence pattern mining is to discover special, important, and representative features hidden in sequence data. It attracts a lot of attention especially in the domains of bioinformatics and spatio-temporal trajectory data mining. To discover features in sequence data, many suffix trees, especially the Probabilistic Suffix Tree (PST), are frequently used due to their capability in capturing the structural characteristics in sequence data.
Recently, with the advance of data collection techniques, huge amounts of sequence data are accumulated ubiquitously. However, traditional centralized suffix tree learning algorithms do not scale well in learning huge sequence data. In view of this, we propose three distributed and parallel PST building algorithm, named CloudPST_Naive, CloudPST and CloudPST_OneScan on the Hadoop platform to speed up the learning process. Specifically, the three algorithms are Map/Reduce frameworks. CloudPST_Naive is an intuitive approach derived by the WordCount example. To overcome the drawbacks of the Naïve approach, we propose the CloudPST algorithm which builds a PST in an iterative, levels by levels manner to avoid learning excessive patterns and trade off the overhead of distributed computing. Furthermore, to avoid multiple scanning of the entire sequence data, we propose CloudPST_OneScan algorithm which involves a new data structure to store the intermediate statistics so that the One-Scan algorithm only scans the entire sequence data once for each iteration.
To evaluate the performance of our proposed algorithms, we conduct exhaustive experiments and the experimental results show that the CloudPST_OneScan outperforms CloudPST_Naive and CloudPST algorithms. In addition, CloudPST_OneScan algorithm shows good efficiency and possesses great scalability and stability.
Appears in Collections:[依資料類型分類] 碩博士論文

Files in This Item:

File SizeFormat
index.html0KbHTML142View/Open


 


學術資源

著作權聲明

本網站為收錄中興大學學術著作及學術產出,已積極向著作權人取得全文授權,並盡力防止侵害著作權人之權益。如仍發現本網站之數位內容有侵害著作權人權益情事者,請權利人通知本網站維護人員,將盡速為您處理。

本網站之數位內容為國立中興大學所收錄之機構典藏,無償提供學術研究與公眾教育等公益性使用。

聯絡網站維護人員:wyhuang@nchu.edu.tw,04-22840290 # 412。

DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU/NCHU Library IR team Copyright ©   - Feedback