If you work within the rapidly expanding analytics space, you will need to think about defining and sharing statistical models between applications. PMML (or Predictive Model Markup Language) is an XML-based language developed by the Data Mining Group (DMG) for this purpose. I’d like to pass on some of the essentials:
The Basics
PMML provides a vendor-independent method for defining your models so that proprietary issues and incompatibilities are no longer present when exchanging models between your applications. It allows you to develop within one vendor’s application, and use other vendors’ applications to visualize, analyze and evaluate.
Based On XML
Since PMML is based upon XML, the structure of your model will be described using an XML Schema – which defines elements and attributes that can appear in the document, which elements are child elements, the order of child elements, the number of child elements, whether an element is empty or can include text, data types for elements and attributes and default and fixed values for elements and attributes.
PMML Producers and Consumers
A tool or application is a producer if it generates valid PMML documents for at least one type of model; an application is a consumer if it will accept valid PMML documents for at least one type of model. IBM SPSS is an excellent example of a modeling tool that both produces and consumes models using PMML documents.
Components of PMML
A PMML document that defines a model will contain a header, a data dictionary, data transformations, 1 or more models, a data mining schema and targets:
In the header you provide the model’s description, application used to generate the model and a timestamp which can be used to specify the date of model creation.
In the data dictionary you define all the possible fields used by the model. It is in the data dictionary that a field is defined as continuous, categorical, or ordinal. Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).
Using data transformations you map the user data into a form that can be used by the mining model. PMML defines several kinds of data transformations.
- Normalization: map values to numbers, the input can be continuous or discrete.
- Discretization: map continuous values to discrete values.
- Value mapping: map discrete values to discrete values.
- Functions: derive a value by applying a function to one or more parameters.
- Aggregation: used to summarize or collect groups of values.
Using attributes such as Model Name, Function Name, Algorithm Name, Activation Function and Number of Layers, The model area is the definition of the actual data mining model in PMML document.
The Mining Schema lists all fields used in the model. (This can be also be a subset of the fields as defined in the data dictionary). It contains specific information about each field, such as:
- Name: must refer to a field in the data dictionary.
- Usage type: defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
- Outlier Treatment: defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
- Missing Value Replacement Policy: if this attribute is specified then a missing value is automatically replaced by the given values.
- Missing Value Treatment: indicates how the missing value replacement was derived (e.g. as value, mean or median).
Finally the Targets element allows for the scaling of predicted variables. It is a straight-forward way to represent post-processing of raw outputs.
Validation of PMML
PMML document validation requires two steps. The first step is XSD Validation. The purpose of XSD Validation is to ensure that the PMML is properly formed XML, and adheres to the appropriate version of the PMML’s XML Schema Definition (XSD).
This second step requires that the PMML elements in combination are understandable to a properly implemented model consumer. To accomplish this, a different XML technology is used: XSLT (Extensible Stylesheet Language Transformations). This technology is used to make sure key features of PMML are implemented properly and uses a set of rules that cover particular requirements of the PMML specification. These rules are embodied into XSL transformations document and are applied to a particular PMML using an XSLT processor. The result is another document which contains any rules that were violated.
Producing PMML
PMML is easily exported from many statistical tools. As mentioned above, the top analytic companies export and import PMML files with their products. For example, in IBM SPSS Statistics, you can export a PMML model by selecting to export the model as an XML file (PMML is XML-based) after you select all of the appropriate model parameters.
Consuming PMML
PMML allows predictive solutions to be easily shared as soon as the model building phase is completed. For example a model built in IBM SPSS Statistics can instantly be moved it to another tool for building visualizations. PMML also allows you to shield end users from the complexity associated with statistical tools and models.
Conclusion
As the need for reliable predictive solutions increases, PMML will play a significant role in any enterprise development and deployment strategy. Understanding PMML is not complicated, but the sooner you embrace it, the better. Follow me!
Excellent read. I just passed this onto a buddy who was doing a little research on that. He just bought me lunch as I found it for him! Thus let me rephrase: Thanks for lunch!