Data & Intelligence

Sampling Your Data

Another interesting feature of SPSS Modeler is its built-in ability to sample data. It is pretty typical to have (in one or more files) hundreds of thousands of records to process, and using complete sets of data during testing can take a huge amount of your time and is inefficient in terms of computer processing time and memory.

TM1

From a TM1 perspective, think of sampling as “the selection of a subset of records from within a statistical population (a file)”. Keep in mind that the objective isn’t simply to reduce the size of the file but more to create a smaller version of the file that still is representative of the characteristics of the whole population (the whole file).

Sampling Methods

SPSS MODELER offers two sampling options:

Simple:

  • Just select the first n records in the file
  • Select every nth record (where n is to be specified)
  • Select a random sample of size r %

Complex:

  • This option enables finer control of the sample, including clustered, stratified, and weighted samples, etc.

Sampling in TM1

For the TM1 developer, nothing exists “out of the box” to create samples of data and you are (as usual) left to your (only) option of “building it yourself” using TurboIntegrator (TI) scripts- perhaps using functions such as “ItemSkip” and “Rand”. (Of course, to be fair, sampling is not something that TM1 is “built for”).

The Sample Node

Of course, SPSS Modeler features the sample node (found in the Record ops palette) which offers various methods to sample records without any programming or scripting.

The procedure to sample records is:

  1. Place a Sample node in your stream and
  2. Edit the Sample node to set the options for sampling!

Sample node and the Settings Tab

b11.1

 

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

Using the Settings tab in the Sample node, you can easily set the options for your data sampling:

  1. Select Sampling method (Simple or Complex).
  2. The Mode allows you to either select records (include sample) or eliminate records (discard sample). If you select “Include sample” then you can also set the maximum size of the sample.
  3. The Sample option offers three possible methods of doing simple sampling: when a Random % sampling is requested, you can specify a random seed value so that the sample can be replicated in the future. (If no random seed is specified, then each time the Sample node is run with the Random % selected, a different random sample will be drawn, making it impossible to replicate earlier analyses).

The Generate button, when clicked, will generate a new random seed value.

b11.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Conclusion

Cognos TM1 is a tool that encourages rapid prototyping.  Dimensions and Cubes can be snapped together and presented to subject matter experts for review. It is highly recommended that realistic data samples be loaded into prototyped models for best results. Using “made up” data or entire sets of actual data can be ineffectual.  A realistic sampling set of data – based on actual data files – would increase the probability that “what you show” is a realistic representation of “what you will ultimately deliver”.

Clearly, SPSS Modeler handles sampling very well.

About the Author

Mr. Miller is an IBM certified and accomplished Senior Project Leader and Application/System Architect-Developer with over 30 years of extensive applications and system design and development experience. His current role is National FPM Practice Leader. His experience includes BI, Web architecture & design, systems analysis, GUI design and testing, Database modeling and systems analysis, design, and development of Client/Server, Web and Mainframe applications and systems utilizing: Applix TM1 (including TM1 rules, TI, TM1Web and Planning Manager), dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, PERL, Websuite, MS SQL Server, ORACLE, SYBASE SQL Server, etc. His Responsibilities have included all aspects of Windows and SQL solution development and design including: analysis; GUI (and Web site) design; data modeling; table, screen/form and script development; SQL (and remote stored procedures and triggers) development and testing; test preparation and management and training of programming staff. Other experience includes development of ETL infrastructure such as data transfer automation between mainframe (DB2, Lawson, Great Plains, etc.) systems and client/server SQL server and Web based applications and integration of enterprise applications and data sources. In addition, Mr. Miller has acted as Internet Applications Development Manager responsible for the design, development, QA and delivery of multiple Web Sites including online trading applications, warehouse process control and scheduling systems and administrative and control applications. Mr. Miller also was responsible for the design, development and administration of a Web based financial reporting system for a 450 million dollar organization, reporting directly to the CFO and his executive team. Mr. Miller has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. Specialties Include: Cognos/TM1 Design and Development, Cognos Planning, IBM SPSS and Modeler, OLAP, Visual Basic, SQL Server, Forecasting and Planning; International Application Development, Business Intelligence, Project Development. IBM Certified Developer - Cognos TM1 (perfect score 100% on exam) IBM Certified Business Analyst - Cognos TM1

More from this Author

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up