Perficient Business Intelligence Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Posts Tagged ‘Big Data’

“Accelerate your Insights” – Indeed!

I have to say, I was very excited today as I listened to Satya Nadella describe the capabilities of the new SQL 2014 Data Platform during the Accelerate your Insights event. My excitement wasn’t tweaked by the mechanical wizardry of working with a new DB platform, nor was it driven by a need to be the first to add another version label to my resume. Considering that I manage a national Business Intelligence practice, my excitement was fueled by seeing Microsoft’s dedication to providing a truly ubiquitous analytic platform that addresses the rapidly changing needs of the clients I interact with on a daily basis.

If you’ve followed the BI/DW space for any length of time you’re surely familiar with the explosion of data, the need for self-service analytics and perhaps even the power of in-memory computing models. You probably also know that the Microsoft BI platform has several new tools (e.g. PowerPivot, Power View, etc.) that run inside of Excel while leveraging the latest in in-memory technology.

PeopleDataAnalytics But… to be able to expand your analysis into the Internet of Things (IoT) with a new Azure Intelligent Systems Service and apply new advanced algorithms all while empowering your ‘data culture’ through new hybrid architectures…, that was news to me!

OK, to be fair, part of that last paragraph wasn’t announced during the key note, it came from meetings I attended earlier this week and that I’m not at liberty to discuss, but suffice it to say, I see the vision!

What is the vision? The vision is that every company should consider what their Data Dividend is.

Diagram: Microsoft Data Dividend Formula

Why am I so happy to see this vision stated the way it is? Because for years I’ve evangelized to my clients to think of their data as a ‘strategic asset’. And like any asset, if given the proper care and feeding, you should expect a return on it! Holy cow and hallelujah, someone is singing my song!! :-)

What does this vision mean for our clients? From a technical standpoint it means the traditional DW, although still useful, is an antiquated model. It means hybrid architectures are our future. It means the modern DW may not be recognizable to those slow to adopt.

From a business standpoint it means that we are one step closer to being constrained only by our imaginations on what we can analyze and how we’ll do it. It means we are one step closer to incorporating ambient intelligence into our analytical platforms.

So, in future posts and an upcoming webinar on the modern DW, let’s imagine…

Three Big Data Best Practices

One of the benefits of the Hadoop is its ability to be configured to address a number of diverse business challenges and integrated into a variety of different enterprise information ecosystems.  With proper planning these analytical big data systems have shown to be valuable assets for companies.  However, without significant attention to data architecture best practices this flexibility can result in an crude April Fool’s joke resulting in a system that is difficult to use and expensive to maintain.

Three Big Data Best PracticesAt Perficient, we typically recommend a number of best practices for implementing Big Data. Three of these practices are:

  1. Establish and Adhere to Data Standards – A data scientist should be able to easily find the data he/she is seeking and not have to worry about converting code pages, changing delimiters, and unpacking decimals.   Establish a standard and stick to it then convert the data to the standard encoding and delimiter during the ingestion process.
  2. Implement a Metadata Configured Framework – Remember when ETL was all hand-coded?   Don’t repeat the sins of the past and create a vast set of point to point custom Sqoop and Flume jobs. This will quickly become a support nightmare.   If the costs of a COTS ETL tool are prohibitive, then build a data ingestion and refining framework of a small number of components that can be configured using metadata.   The goal for a new data feed to be added by configuring a few lines of metadata versus scripting or creating code for each feed.
  3. Organize Your Data – This practice may seem obvious, however, we have seen a number of Hadoop implementations that look like a network file share vs. a standards driven data environment.   Establish a directory structure that allows for the different flavors of data.   Incremental data (aka delta’s), consolidated data, data that transformed, user data, and data stored in Hive should be separated by into different directory structures.   Leverage a directory naming convention; then publish the standard so that data scientists/users can find the data they are seeking.

Addressing these three best practices will ensure that your Big Data environment is usable and maintainable.   If you are implementing or considering a Big Data solution, Perficient has the thought-leadership, partnerships, and experience to may your Big Data program a success.

Tag Splunk, you’re it!

Splunk does a wonderful job of searching through all of the data you’ve indexed, based upon your search command pipeline. There are times though that you can add additional intelligence to the search that Splunk cannot add on its own – perhaps this information is specific to your organizational structure, like host names or server names. Rather than typing this information within the Search pipeline each time, you can create a knowledge object in the form of a Splunk search tag.

Search Tags

To help you search more efficiently for particular groups of event data, you can assign one or more tags to any field/value combination (including event type, host, source, or source type) and then do your searches, based on those tags.

Tagging field value pairs

You can use Splunk Web to create your tags directly from your search results. As an example, I’ve indexed multiple Cognos TM1 server logs into my Splunk server. These logs are generated from many different TM1 Admin servers but are all indexed by one Splunk server. If I’d like to have the ability to search a particular server source without having to qualify in each of my searches, I can create a tag for that server.

In a resulting search, I can select any event that has the field value pair that I want to tag, then:

1. Click on the arrow next to that event:










2. Under Actions, click on the arrow next to that field value:


3. Now select Edit Tags:


4. Create your tag and click Save:




In my example, I created a tag named “TM1-2” that specifies a particular TM1 server source. In the future, I can then use that tag to further narrow my search and isolate events that occurred only in that server log:


tag=TM1-2 product x plan

You can use the tag to narrow down the search (like in my example above) by using the following syntax:


Or, you can even further narrow down your search by associating your tag to a specific field using the following syntax:


Use wildcards to search for tags

As a Splunk Master, you can “get wild” and use the asterisk (*) as a wildcard when searching using your Tags. For example, if you have multiple event-type tags for various types of TM1 servers, such as TM1-1 and TM1-99, you can search for all of them with:


If you wanted to find all hosts whose tags contain “22″, you can search for the tag:


Here is an interesting example that I have yet to utilize (although you’ll find it in several places in the Splunk documentation): if you wanted to search for the events with event types that have no tags associated with them, you can search for the Boolean expression:

NOT tag::eventtype=*

Wildcards in general

Wildcard support makes searching very flexible, however it is important to understand that the “more flexible” (or less specific) you’re Splunk searches are, the less efficient they will become. It is recommended that care be taking when using wildcards within your searches.

Splunk On! 


Searching with Splunk

It would be remiss in a blog on Splunk searching without at least mentioning the 6.0 version dashboard.

The Search dashboard

If you take a look at the Splunk search dashboard (and you should), you can break it down into 4 areas

  • Search Bar. The search bar is a long textbox that you can enter your searches into when you use Splunk Web.
  • Range Picker. Using the (time) range picker you set the period over which to apply your search. You are provided with a good supply of preset time ranges that you can select from, but you can also enter a custom time range.
  • How-To. This is a Splunk panel that contains links you can use to access the Search Tutorial and the Search Manual.
  • What-To. This is another Splunk panel that displays a summary of the data that is installed on this Splunk instance.












The New Search Dashboard

After you run a new search, you’re taken to the New Search page. The search bar and time range picker are still available in this view, but the dashboard updates with many more elements, including search action buttons, a search mode selector, counts of events, a job status bar, and results tabs for Events, Statistics, and Visualizations.

Generally Speaking

All searches in Splunk take advantage of the indexes that where setup on the data that you are searching. Indexes exist in every database, and Splunk is not an exception. Splunk’s indexes organize words or phrases in the data over time. Successful Splunk searches (those that yield results) return records (events) that meet your searching criteria. The more matches you find in your data (the more events Splunk returns) will impact the overall searching performance so it is important to be as specific in your searches as you can.

Before I “jump in”, the following are a few things worth alerting you to:

  • Search terms are case insensitive.
  • Search terms are additive
  • Only the time frame specified is queried
  • Search terms are words, not parts of words

Splunk Quick Reference Guide

To all of us future Splunk Masters, Splunk has a (updated for version 6.0) Splunk Language Quick Reference Card available for downloading in PDF format from the company website:

I recommend you having a look!

To Master Splunk, you need to master Splunk’s search language, which includes an almost endless array of commands, arguments and functions. To help with this, Splunk offers its searching assistant.

The Splunk searching assistant uses “typahead” to “suggest” search commands and arguments as you are typing into the search bar. These suggestions are based on the content of the datasource you are searching and are updated as you continue to type. In addition, the searching assistant will also display the number of matches for the search term, giving you an idea of how many search results Splunk will return.

The image below shows the Splunk searching assistant in action. I’ve typed “TM1” into the search bar and Splunk has displayed every occurrence of these letters it found within my datasource (various Cognos TM1 server logs) along with a “hit count”:


The search assistant uses Python to perform a reverse-url-lookup to return description and syntax information as you type. You can control the behavior of the searching assistant with UI settings in the Search-Bar module, but it is recommended that you keep the default settings and use it as a reference.

Some Basic Optimization

Searching in Splunk can be done from Splunk Web, from the command line interface (CLI) or the REST API. When searching using the Web interface you can (and should) optimize the search by setting the search mode (Fast, Verbose or Smart).

Depending on the search mode, Splunk automatically discovers and extracts fields other than the default fields, returns results as an events list or a table, and runs the calculations required to generate the event timeline. This “additional work” can affect the performance and therefore the recommended approach would be to utilize the Splunk Fast Mode during which time you conduct your initial search discovery (with the help of the searching assistant) after which you can move to either the verbose or smart mode (depending upon your specific requirements and the outcome of your discovery searching).


I should probably stop here (before this post gets any longer) – but stay tuned; my next post is already written and “full of Splunk” …

Thank you for downloading Splunk Enterprise. Get started now…










Once you have found your way to the ( website and downloaded your installation file, you can initiate the installation process. At this point you would have received the “Thank You for Downloading” welcome email.

More than just a sales promotion, this email gives you valuable information about the limitations of your free Splunk Enterprise license, as well as links to help you get started quickly, including links to:

  • Online Tutorials
  • Free live training with Splunkers
  • Educational videos
  • Etc.

The (MS Windows) Installation

On MS Windows, once your download is complete, you are prompted to Run.











Read the rest of this post »

Give Me Splunk!

So you are ready to Splunk and you want to get started? Well..

Taking the First Step

Your first step, before you download any installation packages, is to review the Splunk Software License Agreement, which you can find at (and if you don’t check it there the Splunk install drops a copy for you in the installation folder – in both .RTF and .TXT formats). Although you have the ability to download a free full-featured copy of Splunk Enterprise, the agreement governs the installation and use and it is incumbent upon you to at least be aware of the rules.

Next, as in anytime you are intending to perform a software installation, you must make time to review your hardware to make sure that you can run Splunk in such a way as to meet your expected objectives. Although Splunk is a highly optimized application, a good recommendation is if you are planning on performing an evaluation of Splunk for eventual production deployment, you should use hardware typical of the environment you intend to employ to. In fact, the hardware you use for your evaluation should meet or exceed the recommended hardware capacity specifications for the tool and (your) intentions (you can check the website or talk to a Splunk professional to be sure what these are).

Disk Space Needs

Beyond the physical footprint of the Splunk software (which is minimal), you will need some Splunk “operational space”. When you read data into Splunk, it creates a compressed/indexed version of that “raw data” and this file is typically about 10% of the size of the original data. In addition, Splunk will then create index files that “point” to the compressed file. These associated “index files” can range in size -from approximately 10% to 110% of the rawdata file – based on the number of unique terms in the data. Again, rather than get into sizing specifics here, just note that if your goal is “education and exploration”, just go ahead and install Splunk on your local machine or laptop – it’ll be just fine.

Go Physical or Logic?

Most organizations today run a combination of both physical and virtual machines. Without getting into specifics here, it is safe to say that Splunk runs well on both; however (as does most software) it is important that you understand the needs of the software and be sure that your machine(s) are configured appropriately. The Splunk documentation reports:

“If you run Splunk in a virtual machine (VM) on any platform, performance does degrade. This is because virtualization works by abstracting the hardware on a system into resource pools from which VMs defined on the system draw as needed. Splunk needs sustained access to a number of resources, particularly disk I/O, for indexing operations. Running Splunk in a VM or alongside other VMs can cause reduced indexing performance”.

Let’s get the software!

Splunk Enterprise (version 6.0.2 as of this writing) can run on both MS Windows and Linux, but for this discussion I’m going to focus on only the Windows version. Splunk is available in both 32 and 64 bit architectures, and it is always advisable to check the product details to see which version are correct for your needs.

Assuming that you are installing for the first time (not upgrading) you can download the installation file (msi for Windows) from the company website ( I recommend that you read through the release notes for the version that you intend to install before downloading. Release notes list the known issues along with potential workarounds and being familiar with this information can save plenty of your time later.

[Note: If you are upgrading Splunk Enterprise, you need to visit the Splunk website for specific instructions before proceeding.]

Get a Account

To actually download (any) version of Splunk, you need to have a Splunk account (and user name). Earlier, I mentioned the idea of setting up an account that you can use for educational purposes and support. If you have visited the website and established your account, you are ready; if not, you need to set one up now.

  1. Visit
  2. Click on “Sign Up”

Once you have an account, you can click on the big, green button labeled “Free Download”. From there, you will be directed to the “Download Splunk Enterprise” page, where you can click on the link of the Splunk version you want to install.

From there, you will be redirected to the “Thank You for downloading…” page and be prompted to save the download to your location:









And you are on your way!

Check back and I’ll walk you through a typical MS Windows install (along with some helpful hints that I learned during my journey to Splunk Nirvana)!


Where and How to Learn Splunk

“Never become so much of an expert that you stop gaining expertise.” – Denis Waitley

In all professions, and especially information services (IT), success and marketability depends upon an individual’s propensity for continued learning. With Splunk, there exist a number of options for increasing your knowledge and expertise. The following are just a few. We’ll start with the obvious choices:

  • Where and How to Learn SplunkCertifications,
  • Formal training,
  • Product documentation and
  • The company’s website.


Similar to most main-stream technologies, Splunk offers various certifications and as of this writing, Splunk categorizes certifications into the following generalized areas:

The Knowledge Manager

A Splunk Knowledge Manager creates and/or manages knowledge objects that are used in a particular Splunk project, across an organization or within a practice. Splunk knowledge objects include saved searches, event types, transactions, tags, field extractions and transformations, lookups, workflows, commands and views. A knowledge manager not only will have a though understanding of Splunk, the interface, general use of search and pivot, etc. but also possess the “big picture view” required extend the Splunk environment, through the management of the Splunk knowledge object library.

The Administrator

A Splunk Administrator is required to support the day-to-day “care and feeding” of a Splunk installation. This requires “hands-on” knowledge of best practices, configuration details as well as the ability to create and manage Splunk knowledge objects, in a distributed deployment environment.

The Architect

The Splunk Architect will include both knowledge management expertise, administration know-how and the ability to design and develop Splunk Apps. Architects must also possess the ability to focus on larger deployments, learning best practices for planning, data collection, sizing and documenting in a distributed environment.

Read the rest of this post »

Data Indiscretions

Data loaded into a TM1 or SPSS model will, in most cases, include files consisting of thousands (or hundreds of thousands) of records. It is not reasonable, given the number of fields and records in files of this size, for you to visually inspect all fields in every record (of every file) for missing or invalid values.

Data Transformations

In TM1, data is usually loaded from an EDW or GL system, so (once the initial testing phase has been completed) the probability that incoming data contains unexpected values (should be) somewhat small. However, the data will (most likely) need to be transformed into a format that the TM1 model can use, or is more optimal for the model’s use. Additionally, if data is being manually input to the model (for example a user entering a (sales) forecast value), then inspecting the data for invalid values is required.

With SPSS Modeler, data may come from sources more “sketchy” – marketing surveys, multiple files merged into one or even manually typed-in data. Obviously, the need for auditing and transforming is critical.

Types of Data Violations

Generally, (with TM1 or SPSS) three different types of data violations may be found during the loading process:

  • Values do not comply (with a field’s defined storage type). For example, finding a string value in a numeric field.
  • Only a set of values or a range of values are allowed for a field and the incoming value does not exist within the defined set or is greater or less than the acceptable range.
  • Undefined (SPSS refers to this as $null$ and has “rules” about how it handles these values) values encountered anywhere (no data!).


When an invalid value (a value violating one of the three rules) is found, one of five possible actions can be taken:

  • Nullify – in SPSS, you can convert the value to undefined ($null$) again, this has a special meaning for SPSS, for TM1, this is not usually a viable option since TM1 does not work well with NULL values.
  • Coerce – you can convert the invalid to valid. What invalid and valid is usually determined by the measurement level of the field. In SPSS:
    • Flags; if not True or False, then it is False
    • Nominal, Ordinal; if invalid, it becomes the first member of the sets values
    • Continuous; value less than lower bound-> lower bound, value greater than upper bound -> upper bound
    • Undefined ($null$) -> midpoint of range

In TM1, this process is more typically a “remapping” from one value to another. For example, translating a product’s code used in one system to that products code used in another.

  • Discard – delete the record – with SPSS, you can Discard the record, in TM1, you might use an ItemReject function;
  • Warn – invalids are reported in a message window in SPSS, minor errors may be written to the message log in TM1.
  • Abort – In SPSS, the first invalid value encountered can result in an error and the stream execution aborted, with TM1, processes can be terminated with the ProcessError function.

Handling data Violations

In SPSS Modeler, the Type node enables data checking and transforming. (Checking can also be done using the Types tab in a data source node). To check and transform, you need to specify 1) what the valid values are and 2) the action to be taken:

  1. Select the Types tab.
  2. Select the field to check and click in the corresponding cell in the Values column.
  3. Select Specify values and labels and enter the lower and upper bound.
  4. Select the action to take when an invalid value is encountered.













In TM1, checking for and handling (transforming) of invalid values is a little more work.

Using TurboIntegrator processes is the best approach for loading and transforming data in a TM1 model and scripting logic (using predefined functions) is required to evaluate and take action on records being loaded into TM1:

  • Select the Data tab.
  • Type the script to use the appropriate TM1 functions. Some examples include:
    • Checking for invalid data types – Value_Is_String (does the cell contain a string value?)
    • Checking data against a set of values – DIMIX (does the value exist in a dimension?)
    • Checking data to ensure it is within a range – If (CapitalValue < MaxAllowed & CapitalValue > MinAllowed); (is the value within the allowed range?)
    • Checking for missing data (empty fields) – IF (value) @= ‘’ (is the value empty?)
  • Save and Run the process.













In both tools you have similar objectives – load data, ensure that the data is “usable” and if it is not, perform an appropriate action (rather than break the model!). SPSS Modeler allows you to do some checking and transforming by selecting values in dialogs, while Cognos TM1 requires you to use TI scripting to accomplish these basic operations. Both can be straight forward or complex and both can be automated and reused on future datasets.


Primary Practices for Examining Data

SPSS Data Audit Node





Once data is imported into SPSS Modeler, the next step is to explore the data and to become “thoroughly acquainted” with its characteristics. Most (if not all) data will contain problems or errors such as missing information and/or invalid values. Before any real work can be done using this data you must assess its quality (higher quality = more accurate the predictions).

Addressing issues of data quality

Fortunately, SPSS Modeler makes it (almost too) easy! Modeler provides us several nodes that can be used for our integrity investigation. Here are a couple of things even a TM1 guy can do.

Auditing the data

After importing the data, do a preview to make sure the import worked and things “look okay”.

In my previous blog I talked about a college using predictive analytics to predict which students might or might not graduate on time, based upon their involvement in athletics or other activities.

From the Variable File Source node, it was easy to have a quick look at the imported file and verify that the import worked.










Another useful option is run a table. This will show if field values make sense (for example, if a field like age contains numeric values and no string values). The Table node is cool – after dropping it into my stream and connecting my source node to it, I can open it up and click run (to see all of my data nicely fit into a “database like” table) or I can do some filtering using the real-time “expression builder”.















The expression builder lets me see all of the fields in my file (along with their level of measurement (shown as Type) and their Storage (integer, real, string). It also gives me the ability to select from SPSS predefined functions and logical operators to create a query expression to run on my data. Here I wanted to highlight all students in the file that graduated “on time”:












You can see the possibilities that the Table node provides – but of course it is not practical to visually inspect thousands of records. A better alternative is the Data Audit node.

The Data Audit node is used to study the characteristics of each field. For continuous fields, minimum and maximum values are displayed. This makes it easy to detect out of range values.

Our old pal measurement level

Remember, measurement level (a fields “use” or “purpose”)? Well the data audit node reports different statistics and graphs, depending on the measurement level of the fields in your data.

For categorical fields, the data audit node reports the number of unique values (the number of categories).

For continuous fields, minimum, maximum mean, standard deviation (indicating the spread in the distribution), and skewness (a measure of the asymmetry of a distribution; if a distribution is symmetric it has a skewness value of 0) are reported.

For typeless fields, no statistics are produced.

“Distribution” or “Histogram”?

The data audit node also produces different graphs for each field (except for typeless fields, no graphs are produced for them) in your file (again based upon the field’s level of measurement).

For a categorical field (like “gender”) the Data Audit Node will display a distribution graph and for a continuous field (for example “household income”) it will display a histogram graph.

So back to my college’s example, I added an audit node to my stream and took a look at the results.











First, I excluded the “ID” field (it is just a unique student identification number and has no real meaning for the audit node). Most of the fields in my example (gender, income category, athlete, activities and graduate on time) are qualified as “Categorical” so the audit node generated distribution graphs, but the field “household income” is a “Continuous” field, so a histogram was created for it (along with the meaningful statistics like Min, Max, Mean, etc.).














Another awesome feature – if you click on the generated graphs, SPSS will give you a close up of the graph along with totals, values and labels.


I’ve talked before about the importance of understanding field measure levels. The fact that the audit data node generates statistics and chart types are derived from the measurement level is another illustration of how modeler uses the approach that measurement level determines the output.


Technology Confusion

While returning from a client presentation and reflecting on the meeting conversations I was struck by a similarity that seems to be creeping into the minds of our clients.

While discussing our approach to performing a strategy assessment for this new client we were reviewing an example architectural diagram and a question was raised. One of the business sponsors commented that the ‘Operational Data Store’ that was referenced on the diagram seemed like an ‘archaic’ term from the past that may not be appropriate for their new platform. I explained that they may need a hybrid environment and that each technology had its place.

However, on the plane ride home I realized that I had heard a similar question just a few weeks earlier. I was at a different client, manufacturing as opposed to software, in a different part of the country, speaking about a different type of proposal, although both would have resulted in architectural enhancements, and a stakeholder asked about ‘new data warehouse’ technology such as Hadoop replacing the ‘older’ data warehouse paradigm we were discussing.

On both occasions I knew that the client wasn’t challenging my ideas as much as wanting to understand my recommendations better. What I knew in my head, but had failed to initially describe to both clients was the concept of ‘replacement’ technologies versus ‘complimentary’ technologies. Honestly it had never occurred to me that I needed to make such a designation since I wasn’t recommending both technologies. The client introduced the newer technology into the discussion at which point I fell victim to the assumption that both clients had a base understanding of what the different technologies were used for.

To be clear, we’re talking about the Big Data technology Hadoop and the well-known process of building a data warehouse with the Kimball or Inman approach. The former approach is relatively new and getting a lot of airplay as the latest thing. The latter approach is well known but has had its share of underwhelming successes.

So is Hadoop the new replacement for ‘traditional’ data warehousing? For that matter, is self-service BI a replacement for traditional dashboarding and reporting? How about Twitter, is it the replacement for traditional email or text messaging?

The answer is No. All of the technologies described are complementary technologies, not replacement technologies. These technologies offer additional capabilities with which to build more complete systems, but in some cases, certainly that of data warehousing, our clients are confusing them as carte blanche replacement options.

Considering that clear and concise messaging is fundamental to successful client engagements, I encourage all of our consultants to consider which category their respective technologies fall into and make sure your clients understand that positioning within their organization.