Oracle Data Mining - AbsoluteAstronomy.com

Oracle Data Mining is an option of Oracle Corporation

Oracle Corporation

Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...

's Relational Database Management System

Relational database management system

A relational database management system is a database management system that is based on the relational model as introduced by E. F. Codd. Most popular databases currently in use are based on the relational database model....

(RDBMS) Enterprise Edition (EE). It contains several data mining

Data mining

Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...

and data analysis

Data analysis

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...

algorithms for classification, prediction

Prediction

A prediction or forecast is a statement about the way things will happen in the future, often but not always based on experience or knowledge...

, regression

Regression analysis

In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...

,
classification, associations

Association rule learning

In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases. Piatetsky-Shapirodescribes analyzing and presenting...

, feature selection

Feature selection

In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...

, anomaly detection

Anomaly detection

Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....

, feature extraction

Feature extraction

In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...

, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.

Overview

Oracle Data Mining implements a variety of data mining

Data mining

algorithms inside the Oracle

Oracle database

The Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....

relational database

Relational database

A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...

. These implementations are integrated right into the Oracle database kernel, and operate natively on data stored in the relational database

Relational database

tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers

Server (computing)

In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...

. The relational database

Relational database

platform is leveraged to securely manage models and efficiently execute SQL

SQL

SQL is a programming language designed for managing data in relational database management systems ....

queries

Information retrieval

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data mining

Data mining

functions. These operations include functions to create

Data Definition Language

A data definition language or data description language is a syntax similar to a computer programming language for defining data structures, especially database schemas.-History:...

, apply

Apply

In mathematics and computer science, Apply is a function that applies functions to arguments. It is central to programming languages derived from lambda calculus, such as LISP and Scheme, and also in functional languages...

, test

Test method

A test method is a definitive procedure that produces a test result.A test can be considered as technical operation that consists of determination of one or more characteristics of a given product, process or service according to a specified procedure. Often a test is part of an experiment.The test...

, and manipulate data mining

Data mining

models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects.

In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance.

Models can be created and managed by one of several means. (Oracle Data Miner) is a graphical user interface that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM

CRISP-DM

CRISP-DM stands for Cross Industry Standard Process for Data Mining. It is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems. Polls conducted in 2002, 2004, and 2007 show that it is the leading methodology used by data miners...

methodology). Application and tools developers can embed predictive and descriptive mining capabilities using PL/SQL

PL/SQL

PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...

or Java

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

APIs. Business analysts can quickly experiment with, or demonstrate the power of, predictive analytics using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated Microsoft Excel

Microsoft Excel

Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...

adaptor interface. ODM offers a choice of well known machine learning approaches such as Decision Trees

Decision tree learning

Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...

, Naive Bayes, Support vector machine

Support vector machine

A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...

s, Generalized linear model

Generalized linear model

In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...

(GLM) for predictive mining, Association rules, K-means and Orthogonal Partitioning Clustering (see O-Cluster paper below), and Non-negative matrix factorization for descriptive mining. A minimum description length

Minimum description length

The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...

based technique to grade the relative importance of an input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow text mining

Text mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

by accepting Text (unstructured data) attributes as input.

History

Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release:

Oracle Data Mining 9iR2 (9.2.0.1.0 - May 2002)
Oracle Data Mining 10gR1 (10.1.0.2.0 - February 2004)
Oracle Data Mining 10gR2 (10.2.0.1.0 - July 2005)
Oracle Data Mining 11gR1 (11.1 - September 2007)
Oracle Data Mining 11gR2 (11.2 - September 2009)

Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by Thinking Machines Corporation in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself
is a complete redesign and rewrite from ground-up

Rewrite (programming)

A rewrite in computer programming is the act or result of re-implementing a large portion of existing functionality without re-use of its source code. When the rewrite is not using existing code at all, it is common to speak of a rewrite from scratch...

- while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the GUI.

Road Map - Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. See ODM Blog entry "Get Ready for Oracle Data Miner 11gR2 New Workflow SGUI" for more information http://blogs.oracle.com/datamining/2010/02/get_ready_for_the_new_oracle_data_miner_11gr2_gui_1.html

Functionality

As of release 11gR1 Oracle Data Mining contains the following data mining

Data mining

functions:

Data transformation and model analysis:
- Data sampling
  Sampling (statistics)
  In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
  
  , binning
  Data binning
  Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value...
  
  , discretization
  Discretization
  In mathematics, discretization concerns the process of transferring continuous models and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers...
  
  , and other data transformations.
- Model exploration, evaluation and analysis.

Feature selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...

(Attribute Importance).
- Minimum description length
  Minimum description length
  The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...
  
  (MDL).

Classification.
- Naive Bayes (NB).
- Generalized linear model
  Generalized linear model
  In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
  
  (GLM) for Logistic regression
  Logistic regression
  In statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
  
  .
- Support Vector Machine
  Support vector machine
  A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
  
  (SVM).
- Decision Trees
  Decision tree learning
  Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...
  
  (DT).

Anomaly detection
Anomaly detection
Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....

.
- One-class Support Vector Machine
  Support vector machine
  A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
  
  (SVM).

Regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
- Support Vector Machine
  Support vector machine
  A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
  
  (SVM).
- Generalized linear model
  Generalized linear model
  In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
  
  (GLM) for Multiple regression

Clustering:
- Enhanced k-means (EKM).
- Orthogonal Partitioning Clustering (O-Cluster).

Association rule learning
Association rule learning
In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases. Piatetsky-Shapirodescribes analyzing and presenting...

:
- Itemsets and association rules (AM).

Feature extraction
Feature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...

.
- Non-negative matrix factorization (NMF).

Text
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...

and spatial mining:
- Combined text and non-text columns of input data.
- Spatial/GIS data.

Input sources and data preparation

Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a star schema

Star schema

In computing, the star schema is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables...

). The full functionality of SQL

SQL

SQL is a programming language designed for managing data in relational database management systems ....

can be used when preparing data for data mining, including dates and spatial data.

Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as outlier

Outlier

In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....

treatment, discretization

Discretization

In mathematics, discretization concerns the process of transferring continuous models and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers...

, normalization

Database normalization

In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...

and binning (sorting

Sorting

Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...

in general speak)

Graphical user interface: Oracle Data Miner

Oracle Data Mining can be accessed using Oracle Data Miner a GUI

Gui

Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...

“client” that provides access to the data mining

Data mining

functions and structured templates called Mining Activities that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of Java

Java (programming language)

and/or SQL

SQL

SQL is a programming language designed for managing data in relational database management systems ....

code associated with the data mining

Data mining

activities. The Java Code Generator is an extension to Oracle JDeveloper. There is also an independent interface: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics PL/SQL

PL/SQL

PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...

package from Microsoft Excel

Microsoft Excel

PL/SQL and Java interfaces

Oracle Data Mining provides a native PL/SQL

PL/SQL

PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...

package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a classification model:



BEGIN

  DBMS_DATA_MINING.CREATE_MODEL (

    model_name          => 'credit_risk_model',

    function            => DBMS_DATA_MINING.classification,

    data_table_name     => 'credit_card_data',

    case_id_column_name => 'customer_id',

    target_column_name  => 'credit_risk',

    settings_table_name => 'credit_risk_model_settings');

END;

where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'.

Oracle Data Mining also supports a Java

Java (programming language)

API consistent with the Java Data Mining

Java Data Mining

Java Data Mining is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics applications and tools. The...

(JDM) standard for data mining (JSR-73) for enabling integration with web and Java EE applications and to facilitate portability across platforms.

SQL scoring functions

As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a classification model:



SELECT customer_name

  FROM credit_card_data

 WHERE PREDICTION (credit_risk_model USING *) = 'LOW' AND customer_value = 'HIGH';

PMML

In Release 11gR2 (11.2.0.2), ODM supports the import of externally-created PMML for some of the data mining models. PMML is an XML-based standard for representing data mining models.

Predictive Analytics MS Excel Add-In

The PL/SQL

PL/SQL

PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...

package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Overview

History

Functionality

Input sources and data preparation

Graphical user interface: Oracle Data Miner

PL/SQL and Java interfaces

SQL scoring functions

PMML

Predictive Analytics MS Excel Add-In

See also

External links