David L. Olson · Dursun DelenAdvanced Data Mining Techniques Dr. David L. Olson Department of Management Science Un. PDF | This book covers the fundamental concepts of data mining, to demonstrate the potential of gathering large sets of data, and analyzing. Advanced Data Mining. Techniques. David L. Olson and Dursun Delen. Heidelberg: Springer (). Table of Contents. Part I: Introduction. Chapter 1.

Advanced Data Mining Techniques Pdf

Language:English, Arabic, French
Published (Last):25.12.2015
ePub File Size:22.53 MB
PDF File Size:14.85 MB
Distribution:Free* [*Registration Required]
Uploaded by: DAMION

This book covers the fundamental concepts of data mining, to demonstrate the potential of Data Mining Methods As Tools. Front Matter. Pages PDF. ADVANCED DATA MINING TECHNIQUES BY DAVID L. OLSON, DURSUN DELEN PDF. Once again, checking out routine will certainly always give helpful. ADVANCED DATA MINING TECHNIQUES. 3(2+1). Objective: To understand advanced concepts and technologies of mining association rules, cluster analysis.

Shortening data processing time can reduce much of the total computation time in data mining.

Psychology - A Self-Teaching Guide

The simple and standard data format resulting from data preprocessing can provide an environment of information sharing across different computer systems, which creates the flexibility to implement various data mining algorithms or tools. As an important component of data preparation, data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the purpose of data analysis.

Many available statistics measurements, such as mean, median, mode, and variance can readily be used to transform the data. In terms of the representation of data, data transformation may be used to 1 transform from numerical to numerical scales, and 2 recode categorical data to numerical scales. One reason for transformation is to eliminate differences in variable scales. Transforming data from the metric system e.

For categorical to numerical scales, we have to assign an appropriate numerical number to a categorical value according to needs. Categorical variables can be ordinal such as less, moderate, and strong and nominal such as red, yellow, blue, and green.

We need to be careful not to introduce more precision than is present in the original data. For instance, Likert scales often represent ordinal information with coded numbers 1—7, 1—5, and so on.

An object rated as 4 may not be meant to be twice as strong on some measure as an object rated as 2. Sometimes, we can apply values to represent a block of numbers or a range of categorical variables. There is no unique procedure and the only criterion is to transform the data for convenience of use during the data mining stage.

Modeling Data modeling is where the data mining software is used to generate results for various situations. A cluster analysis and visual exploration of the data are usually applied first. Depending upon the type of data, various models might then be applied. If the task is to group data, and the groups are given, discriminant analysis might be appropriate.

If the purpose is estimation, regression is appropriate if the data is continuous and logistic regression if not. Neural networks could be applied for both tasks. Decision trees are yet another tool to classify data.

Other modeling tools are available as well. The point of data mining software is to allow the user to work with the data to gain understanding. This is often fostered by the iterative use of multiple models. In some applications a third split of data validation set is used to estimate parameters from the data.

The principle is that if you build a model on a particular set of data, it will of course test quite well. By dividing the data and using part of it for model development, and testing it on a separate set of data, a more convincing test of model accuracy is obtained. This idea of splitting the data into components is often carried to additional levels in the practice of data mining. Further portions of the data can be used to refine the model.

This pattern occurs in 5. In Classification, the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power.

Mathematical techniques that are often used to construct classification methods are binary decision trees, neural networks, linear programming, and statistics. By using binary 1 D. Olson, Yong Shi Models fit to data can be measured by either statistical estimation or information entropy. However, the classification obtained from tree induction may not produce an optimal solution where prediction power is limited.

By using neural networks, a neural induction model can be built. In this approach, the attributes become input layers in the neural network while the classes associated with data are output layers. Between input layers and output layers, there are a larger number of hidden layers processing the accuracy of the classification.

In linear programming approaches, the classification problem is viewed as a special form of linear program. Given a set of classes and a set of attribute variables, one can define a cutoff limit or boundary separating the classes.

Then each class is represented by a group of constraints with respect to a boundary in the linear program. The objective function in the linear programming model can minimize the overlapping rate across classes and maximize the distance between classes. The linear programming approach results in an optimal classification.

However, the computation time required may exceed that of statistical approaches. Various statistical methods, such as linear discriminant regression, quadratic discriminant regression, and logistic discriminant regression are very popular and are commonly used in real business classifications.

Even though statistical software has been developed to handle a large amount of data, statistical approaches have a disadvantage in efficiently separating multiclass problems in which a pair-wise comparison i.

Cluster analysis takes ungrouped data and uses automatic techniques to put this data into groups. Clustering is unsupervised, and does not require a learning set. It shares a common methodological ground with Classification.

In other words, most of the mathematical models mentioned earlier in regards to Classification can be applied to Cluster Analysis as well. Prediction analysis is related to regression techniques. The key idea of prediction analysis is to discover the relationship between the dependent and independent variables, the relationship between the independent variables one versus Another, one versus the rest, and so on.

For example, if sales is an independent variable, then profit may be a dependent variable. By using historical data from both sales and profit, either linear or nonlinear regression techniques can produce a fitted regression curve that can be used for profit prediction in the future.

These patterns can be used by business analysts to identify relationships among data. The mathematical models behind Sequential Patterns are logic rules, fuzzy logic, and so on. As an extension of Sequential Patterns, Similar Time Sequences are applied to discover sequences similar to a known sequence over both past and current business periods.

In the data mining stage, several similar sequences can be studied to identify future trends in transaction development. This approach is useful in dealing with databases that have time-series characteristics. Evaluation The data interpretation stage is very critical. It assimilates knowledge from mined data. Two issues are essential. One is how to recognize the business value from knowledge patterns discovered in the data mining stage. Another issue is which visualization tool should be used to show the data mining results.

This operation depends on the interaction between data analysts, business analysts and decision makers such as managers or CEOs. Because data analysts may not be fully aware of the purpose of the data mining goal or objective, and while business analysts may not understand the results of sophisticated mathematical solutions, interaction between them is necessary.

Many visualization packages and tools are available, including pie charts, histograms, box plots, scatter plots, and distributions.

Olson D.L., Delen D. Advanced Data Mining Techniques

Good interpretation leads to productive business decisions, while poor interpretation analysis may miss useful information. Normally, the simpler the graphical interpretation, the easier it is for end users to understand. Deployment The results of the data mining study need to be reported back to project sponsors. The data mining study has uncovered new knowledge, which needs to be tied to the original data mining project goals.

Management will then be in a position to apply this new understanding of their business environment. It is important that the knowledge gained from a particular data mining study be monitored for change. Customer behavior changes over time, and what was true during the period when the data was collected may have already change. SEMMA In order to be applied successfully, the data mining solution must be viewed as a process rather than a set of tools or techniques.

By assessing the outcome of each stage in the SEMMA process, one can determine how to model new questions raised by the previous results, and thus proceed back to the exploration phase for additional refinement of the data. For optimal cost and computational performance, some including the SAS Institute advocates a sampling strategy, which applies a reliable, statistically representative sample of the full detail data. In the case of very large datasets, mining a representative sample instead of the whole volume may drastically reduce the processing time required to get crucial business information.

If general patterns appear in the data as a whole, these will be traceable in a representative sample. If a niche a rare pattern is so tiny that it is not represented in a sample and yet so important that it influences the big picture, it should be discovered using exploratory data description methods. It is also advised to create partitioned data sets for better accuracy assessment.

Step 2 Explore : This is where the user searched for unanticipated trends and anomalies in order to gain a better understanding of the data set.

After sampling your data, the next step is to explore them visually or numerically for inherent trends or groupings.

Exploration helps refine and redirect the discovery process. If visual exploration does not reveal clear trends, one can explore the data through statistical techniques including factor analysis, correspondence analysis, and clustering. For example, in data mining for a direct mail campaign, clustering might reveal groups of customers with distinct ordering patterns. Limiting the discovery process to each of these distinct groups individually may increase the likelihood of exploring richer patterns that may not be strong enough to be detected if the whole dataset is to be processed together.

Advanced Data Mining Techniques

Step 3 Modify : This is where the user creates, selects, and transforms the variables upon which to focus the model construction process. Based on the discoveries in the exploration phase, one may need to manipulate data to include information such as the grouping of customers and significant subgroups, or to introduce new variables. It may also be necessary to look for outliers and reduce the number of variables, to narrow them down to the most significant ones. Because data mining is a dynamic, iterative process, you can update data mining methods or models when new information is available.

Step 4 Model : This is where the user searches for a variable combination that reliably predicts a desired outcome. Once you prepare your data, you are ready to construct models that explain patterns in the data. Modeling techniques in data mining include artificial neural networks, decision trees, rough set analysis, support vector machines, logistic models, and other statistical models — such as time series analysis, memory-based reasoning, and principal component analysis.

Each type of model has particular strengths, and is appropriate within specific data mining situations depending on the data. For example, artificial neural networks are very good at fitting highly complex nonlinear relationships while Rough sets analysis is know to produce reliable results with uncertain and imprecise problem situations. Step 5 Assess : This is where the user evaluates the usefulness and the reliability of findings from the data mining process.

In this final step of the data mining process user assesses the models to estimate how well it performs. A common means of assessing a model is to apply it to a portion of data set put aside and not used during the model building during the sampling stage.

If the model is valid, it should work for this reserved sample as well as for the sample used to construct the model. Similarly, you can test the model against known data. For example, if you know which customers in a file had high retention rates and your model predicts retention, you can check Fig.

Poll results — data mining methodology conducted by KDNuggets. In addition, practical applications of the model, such as partial mailings in a direct mail campaign, help prove its validity. The data mining web-site KDNuggets provided the data shown in Fig. Both aid the knowledge discovery process. Once models are obtained and tested, they can then be deployed to gain value with respect to business or research application.

Example Data Mining Process Application Nayak and Qiu demonstrated the data mining process in an Australian software development project. The project owner was an international telecommunication company which undertook over 50 software projects annually. Nayak and Qiu were interested in mining the Table 2. Nayak, Tian Qiu A data mining application: Analysis of problems occurring during a software project development process, International Journal of Software Engineering , — All problem reports were collected throughout the company over 40, reports.

For each report, data was available to include data shown in Table 2. Goal Definition Data mining was expected to be useful in two areas. The first involved the early estimation and planning stage of a software project, company engineers have to estimate the number of lines of code, the kind of documents to be delivered, and estimated times. Accuracy at this stage would vastly improve project selection decisions. Little tool support was available for these activities, and estimates of these three attributes were based on experience supported by statistics on past projects.

Thus projects involving new types of work were difficult to estimate with confidence. The second area of data mining application concerned the data collection system, which had limited information retrieval capability. Data was stored in flat files, and it was difficult to gather information related to specific issues. Book Concept Our intent is to cover the fundamental concepts of data mining, to dem- strate the potential of gathering large sets of data, and analyzing these data sets to gain useful business understanding.

We have organized the material into three parts. Part I introduces concepts. Part II contains chapters on a number of different techniques often used in data mining. Part III focuses on business applications of data mining. Not all of these chapters need to be covered, and their sequence could be varied at instructor design. The book will include short vignettes of how specific concepts have been applied in real practice. A series of representative data sets will be generated to demonstrate specific methods and concepts.

References to data mining software and sites such as www. Part I: Introduction Chapter 1 gives an overview of data mining, and provides a description of the data mining process. An overview of useful business applications is provided.Step 3 Modify : This is where the user creates, selects, and transforms the variables upon which to focus the model construction process. In fact, analyst judgment is critical to successful implementation of data mining.

Coincidence matrices provide a means of focusing on what kinds of errors particular models tend to make. The effects of information technology on knowledge management systems. Al- Mudimigh, F. Knowledge Management Tools. Conference on very large data bases, , pp