Skip to main content

Data & Intelligence

Interoperability and PMML

If you work within the rapidly expanding analytics space, you will need to think about defining and sharing statistical models between applications. PMML (or Predictive Model Markup Language) is an XML-based language developed by the Data Mining Group (DMG) for this purpose. I’d like to pass on some of the essentials:

The Basics

PMML provides a vendor-independent method for defining your models so that proprietary issues and incompatibilities are no longer present when exchanging models between your applications. It allows you to develop within one vendor’s application, and use other vendors’ applications to visualize, analyze and evaluate.

Based On XML

Since PMML is based upon XML, the structure of your model will be described using an XML Schema – which defines elements and attributes that can appear in the document, which elements are child elements, the order of child elements, the number of child elements, whether an element is empty or can include text, data types for elements and attributes and default and fixed values for elements and attributes.

PMML Producers and Consumers

A tool or application is a producer if it generates valid PMML documents for at least one type of model; an application is a consumer if it will accept valid PMML documents for at least one type of model. IBM SPSS is an excellent example of a modeling tool that both produces and consumes models using PMML documents.

Components of PMML

A PMML document that defines a model will contain a header, a data dictionary, data transformations, 1 or more models, a data mining schema and targets:

In the header you provide the model’s description, application used to generate the model and a timestamp which can be used to specify the date of model creation.

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

In the data dictionary you define all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).

Using data transformations you map the user data into a form that can be used by the mining model. PMML defines several kinds of data transformations.

  • Normalization: map values to numbers, the input can be  continuous or discrete.
  • Discretization: map continuous values to discrete  values.
  • Value mapping: map discrete values to discrete values.
  • Functions: derive a value by applying a function to one or more parameters.
  • Aggregation: used to summarize or collect groups of  values.

Using attributes such as Model Name, Function Name, Algorithm Name, Activation Function and Number of Layers, The model area is the definition of the actual data mining model in PMML document.
The Mining Schema lists all fields used in the model. (This can be also be a subset of the fields as defined in the data dictionary). It contains specific information about each field, such as:

  • Name: must refer to a field in the data dictionary.
  • Usage type: defines the way a field is to be used in  the model. Typical values are: active, predicted, and supplementary.   Predicted fields are those whose values are predicted by the model.
  • Outlier Treatment: defines the outlier treatment to be   use. In PMML, outliers can be treated as missing values, as extreme values   (based on the definition of high and low values for a particular field),  or as is.
  • Missing Value Replacement Policy: if this attribute is  specified then a missing value is automatically replaced by the given values.
  • Missing Value Treatment: indicates how the missing  value replacement was derived (e.g. as value, mean or median).

Finally the Targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.

Validation of PMML

PMML document validation requires two steps. The first step is XSD Validation. The purpose of XSD Validation is to ensure that the PMML is properly formed XML, and adheres to the appropriate version of the PMML’s XML Schema Definition (XSD).

This second step requires that the PMML elements in combination are understandable to a properly implemented model consumer. To accomplish this, a different XML technology is used: XSLT (Extensible Stylesheet Language Transformations). This technology is used to make sure key features of PMML are implemented properly and uses a set of rules that cover particular requirements of the PMML specification. These rules are embodied into XSL transformations document and are applied to a particular PMML using an XSLT processor. The result is another document which contains any rules that were violated.

Producing PMML

PMML is easily exported from many statistical tools. As mentioned above, the top analytic companies export and import PMML files with their products. For example, in IBM SPSS Statistics, you can export a PMML model by selecting to export the model as an XML file (PMML is XML-based) after you select all of the appropriate model parameters.

Consuming PMML

PMML allows predictive solutions to be easily shared as soon as the model building phase is completed. For example a model built in IBM SPSS Statistics can instantly be moved it to another tool for building visualizations. PMML also allows you to shield end users from the complexity associated with statistical tools and models.

Conclusion

As the need for reliable predictive solutions increases, PMML will play a significant role in any enterprise development and deployment strategy. Understanding PMML is not complicated, but the sooner you embrace it, the better.  Follow me!

Thoughts on “Interoperability and PMML”

  1. Excellent read. I just passed this onto a buddy who was doing a little research on that. He just bought me lunch as I found it for him! Thus let me rephrase: Thanks for lunch!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Jim Miller

Mr. Miller is an IBM certified and accomplished Senior Project Leader and Application/System Architect-Developer with over 30 years of extensive applications and system design and development experience. His current role is National FPM Practice Leader. His experience includes BI, Web architecture & design, systems analysis, GUI design and testing, Database modeling and systems analysis, design, and development of Client/Server, Web and Mainframe applications and systems utilizing: Applix TM1 (including TM1 rules, TI, TM1Web and Planning Manager), dynaSight - ArcPlan, ASP, DHTML, XML, IIS, MS Visual Basic and VBA, Visual Studio, PERL, Websuite, MS SQL Server, ORACLE, SYBASE SQL Server, etc. His Responsibilities have included all aspects of Windows and SQL solution development and design including: analysis; GUI (and Web site) design; data modeling; table, screen/form and script development; SQL (and remote stored procedures and triggers) development and testing; test preparation and management and training of programming staff. Other experience includes development of ETL infrastructure such as data transfer automation between mainframe (DB2, Lawson, Great Plains, etc.) systems and client/server SQL server and Web based applications and integration of enterprise applications and data sources. In addition, Mr. Miller has acted as Internet Applications Development Manager responsible for the design, development, QA and delivery of multiple Web Sites including online trading applications, warehouse process control and scheduling systems and administrative and control applications. Mr. Miller also was responsible for the design, development and administration of a Web based financial reporting system for a 450 million dollar organization, reporting directly to the CFO and his executive team. Mr. Miller has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. Specialties Include: Cognos/TM1 Design and Development, Cognos Planning, IBM SPSS and Modeler, OLAP, Visual Basic, SQL Server, Forecasting and Planning; International Application Development, Business Intelligence, Project Development. IBM Certified Developer - Cognos TM1 (perfect score 100% on exam) IBM Certified Business Analyst - Cognos TM1

More from this Author

Follow Us