Kho dữ liệu và khai phá dữ liệu - Chương 2: Tiền xử lý dữ liệu

Hiểu dữ liệu và chuẩn bị dữ liệu

Vai trò của tiền xử lý dữ liệu

Làm sạch dữ liệu

Tích hợp và chuyển dạng dữ liệu

Rút gọn dữ liệu

Rời rạc và sinh kiến trúc khái niệm

77 trang | Chia sẻ: Mr Hưng | Lượt xem: 1150 | Lượt tải: 0

Bạn đang xem trước 20 trang nội dung tài liệu Kho dữ liệu và khai phá dữ liệu - Chương 2: Tiền xử lý dữ liệu, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

ary)Each transform has 2 functions: smoothing, differenceApplies to pairs of data, resulting in two set of data of length L/2Applies two functions recursively, until reaches the desired length Haar2Daubechie4*Kho dữ liệu và khai phá dữ liệu: Chương 2*DWT cho nén ảnhImage Low Pass High Pass Low Pass High PassLow Pass High Pass*Kho dữ liệu và khai phá dữ liệu: Chương 2*Cho N vector dữ liệu k-chiều, tìm c (<= k) vector trực giao tốt nhất để trình diễn dữ liệu.Tập dữ liệu gốc được rút gọn thành N vector dữ liệu c chiều: c thành phần chính (chiều được rút gọn). Mỗi vector dữ liệu là tổ hợp tuyến tính của các vector thành phần chính.Chỉ áp dụng cho dữ liệu số.Dùng khi số chiều vector lớn.Phân tích thành phần chính (Principal Component Analysis )*Kho dữ liệu và khai phá dữ liệu: Chương 2*X1X2Y1Y2Phân tích thành phần chính (PCA)*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rút gọn kích thước sốPhương pháp tham sốAssume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume modelsMajor families: histograms, clustering, sampling *Kho dữ liệu và khai phá dữ liệu: Chương 2*Hồi quy và mô hình logarit tuyến tínhLinear regression: Data are modeled to fit a straight lineOften uses the least-square method to fit the lineMultiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vectorLog-linear model: approximates discrete multidimensional probability distributionsKho dữ liệu và khai phá dữ liệu: Chương 2Linear regression: Y =  +  XTwo parameters ,  and  specify the line and are to be estimated by using the data at hand.using the least squares criterion to the known values of Y1, Y2, , X1, X2, .Multiple regression: Y = b0 + b1 X1 + b2 X2.Many nonlinear functions can be transformed into the above.Log-linear models:The multi-way table of joint probabilities is approximated by a product of lower-order tables.Probability: p(a, b, c, d) = ab acad bcdPhân tích hồi quy và mô hình logarit tuyến tính*Kho dữ liệu và khai phá dữ liệu: Chương 2*Lược đồ (Histograms)A popular data reduction techniqueDivide data into buckets and store average (sum) for each bucketCan be constructed optimally in one dimension using dynamic programmingRelated to quantization problems.*Kho dữ liệu và khai phá dữ liệu: Chương 2*Phân cụmPartition data set into clusters, and one can store cluster representation onlyCan be very effective if data is clustered but not if data is “smeared”Can have hierarchical clustering and be stored in multi-dimensional index tree structuresThere are many choices of clustering definitions and clustering algorithms*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rút gọn mẫu (Sampling)Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the dataChoose a representative subset of the dataSimple random sampling may have very poor performance in the presence of skewDevelop adaptive sampling methodsStratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed dataSampling may not reduce database I/Os (page at a time).*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rút gọn mẫu (Sampling)SRSWOR(simple random sample without replacement)SRSWRRaw Data*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rút gọn mẫu (Sampling)Raw Data Cluster/Stratified Sample*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rút gọn phân cấpUse multi-resolution structure with different degrees of reductionHierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters”Parametric methods are usually not amenable to hierarchical representationHierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributesEach partition can be considered as a bucketThus an index tree with aggregates stored at each node is a hierarchical histogram*Kho dữ liệu và khai phá dữ liệu: Chương 2*Chapter 2: Tiền xử lý dữ liệuHiểu dữ liệu và chuẩn bị dữ liệuVai trò của tiền xử lý dữ liệuLàm sạch dữ liệuTích hợp và chuyển dạng dữ liệuRút gọn dữ liệuRời rạc và sinh kiến trúc khái niệm*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rời rạc hóaThree types of attributes:Nominal — values from an unordered setOrdinal — values from an ordered setContinuous — real numbersDiscretization: divide the range of a continuous attribute into intervalsSome classification algorithms only accept categorical attributes.Reduce data size by discretizationPrepare for further analysis*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rời rạc hóa và kiến trúc khái niệmDiscretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data valuesConcept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rời rạc hóa và kiến trúc khái niệm với dữ liệu sốBinning (see sections before)Histogram analysis (see sections before)Clustering analysis (see sections before)Entropy-based discretizationSegmentation by natural partitioning*Kho dữ liệu và khai phá dữ liệu: Chương 2*Rời rạc hóa dựa trên EntropyGiven a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning isThe boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,Experiments show that it may reduce data size and improve classification accuracy*Kho dữ liệu và khai phá dữ liệu: Chương 2*Phân đoạn bằng phân hoạch tự nhiênQuy tắc đơn giản 3-4-5 được dùng để phân đoạn dữ liệu số thành các đoạn tương đối thống nhất, “tự nhiên”.Hướng tới số giá trị khác biệt ở vùng quan trọng nhấtNếu 3, 6, 7 hoặc 9 giá trị khác biệt thì chia miền thành 3 đoạn tương đương.Nếu phủ 2, 4, hoặc 8 giá trị phân biệt thì chia thành 4.Nếu phủ 1, 5, hoặc 10 giá trị phân biệt thì chia thành 5.*Kho dữ liệu và khai phá dữ liệu: Chương 2*Ví dụ luật 3-4-5(-$4000 -$5,000)(-$400 - 0)(-$400 - -$300)(-$300 - -$200)(-$200 - -$100)(-$100 - 0)(0 - $1,000)(0 - $200)($200 - $400)($400 - $600)($600 - $800)($800 - $1,000)($2,000 - $5, 000)($2,000 - $3,000)($3,000 - $4,000)($4,000 - $5,000)($1,000 - $2, 000)($1,000 - $1,200)($1,200 - $1,400)($1,400 - $1,600)($1,600 - $1,800)($1,800 - $2,000) msd=1,000 Low=-$1,000 High=$2,000Step 2:Step 4:Step 1: -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Maxcount(-$1,000 - $2,000)(-$1,000 - 0)(0 -$ 1,000)Step 3:($1,000 - $2,000)*Kho dữ liệu và khai phá dữ liệu: Chương 2*Sinh kiến trúc khái niệm cho dữ liẹu phân loạiĐặc tả một thứ tự bộ phận giá trị thuộc tính theo mức sơ đồ do người dùng hoặc chuyên giasstreet<city<state<countryĐặc tả thành cấu trúc phân cấp nhờ nhóm dữ liệu{Urbana, Champaign, Chicago}<IllinoisĐặc tả theo tập các thuộc tính. Tự động sắp xếp một phần bằng cách phân tích số lượng các giá trị khác biệtNhư, street < city <state < countryĐặc tả một phần thứ tự bộ phậnNhư, chỉ street < city mà không có cái khác*Kho dữ liệu và khai phá dữ liệu: Chương 2*Sinh kiến trúc khái niệm tự độngSome concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set The attribute with the most distinct values is placed at the lowest level of the hierarchyNote: Exception—weekday, month, quarter, yearcountryprovince_or_ statecitystreet15 distinct values65 distinct values3567 distinct values674,339 distinct values

Các file đính kèm theo tài liệu này:

dm_dw_k17_c2_8701.ppt