In this two-day course, you’ll examine additional topics to aid in the preparation of data for a successful data mining project. You’ll learn how to partition records from files, handle missing data, modify fields and create new fields, and work with dates, strings and sequence data.

Course Outline

This follow-up course to Introduction to IBM SPSS Modeler and Data Mining is designed for anyone who wishes to become familiar with the full range of techniques available in IBM SPSS Modeler for data and file manipulation.

Pre-requisites

  • General computer literacy
  • Some experience using IBM SPSS Modeler including familiarity with the IBM SPSS environment, creating streams, reading data files, and doing simple data exploration and manipulation
  • Prior completion of Introduction to IBM SPSS Modeler and Data Mining is strongly encouraged

High-level Curriculum

Lesson 1: Introduction to Data Preparation

  • The Process of Data Mining
  • File Format for Analysis
  • Unit of Analysis for Modeling
  • Matching the data to the modelling tool

Lesson 2: Sampling Data

  • Sample Node
  • Types of Sample
  • Simple Sampling
  • Complex Sampling

Lesson 3: Working with Dates

  • Reading data which includes dates
  • Calulations involving dates
  • Applying the same expression to multiple fields

Lesson 4: Working with String Data

  • Manipulating String Data
  • Example of String Manipulation

Lesson 5: Data Transformations

  • Using Summary Statistics with Set Globals node
  • Transforming Continuous Fields
  • Binning Fields

Lesson 6: Working with Sequence Data

  • Sequence Functions
  • Count and State Forms of the Derive Node
  • Restructuring Sequence Data Using the History Node

High-level Curriculum cont’d

Lesson 7: Exporting Data Files

  • Using a Data file or streams in Modelings
  • Types of Exported Files
  • Exporting Flat Files
  • Exporting to Databases

Lesson 8: Efficiency within Modeler

  • SQL Pushback
  • SQL Optimization
  • Node Order
  • Using Samples of Data
  • Maximum Set Size
  • Performance in Specific Nodes