Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow our Enterprise Information Technology board on Pinterest

Posts Tagged ‘Big Data’

The Chief Analytics Officer

One of the key points I make in our Executive Big Data Workshops is that effective use of Big Data analytics will require transforming both business and IT organizations.   Big Data with access to cross-functional data will transform the strategic processes within a company that guide long term and year to year investments. With the ability to apply machine learning, data mining, and advance analytics to view how different business processes interact with each other, companies now have empirical information for use in their strategic processes.

We are now seeing evidence of this transformation happening with the emergence of the  Chief Analytics Officer position.  As detailed in this InfoWorld article, Chief analytics officer: The ultimate big data job, it’s not about data but what you do with the data. And it is important enough to create a new position, the CAO. I recommend reading this article.

The Best Way to Limit the Value of Big Data

A few years back I worked for a client that was implementing cell level security on every data structure within their data warehouse. They had nearly 1,000 tables and 200,000 columns — yikes! Talking about administrative overhead. The logic was that data access should only be given on a need-to-know basis. The idea would be that users would have to request access to certain tables and columns.

Big DataNeed-to-know is a term frequently used in military and government institutions that refers to granting access to sensitive information to cleared individuals. This is a good concept, but the key here is the part about “granting access to SENSITIVE data.” The key is that the information has to be classified first, then need-to-know (for cleared individuals) is applied.

Most government documents are not sensitive. This allows the administrative resources to focus on the sensitive, classified information. The system for classifying information as Top Secret, Secret, and Confidential, has relatively stringent rules for, but also discourages the over classification of information. This is because when a document is classified, its use becomes limited.

This same phenomenon is true in the corporate world. The more a set of data is locked down, the less it will be used. Unnecessary limiting an information’s workers access to data obviously does not help the overall objectives of the organization. Big Data just magnifies this dynamic and unnecessarily restricting access to Big Data is the best way to limit its value. Unreasonably lock down Big Data, its value will be severely limited.
Read the rest of this post »

One Cluster To Rule Them All!

In the Hadoop space we have a number of terms for the Hadoop File System used for data management. Data Lake is probably the most popular. I have heard it called a Data Refinery as well as some other not so mentionable names. The one that has stuck with me has been is the Data Reservoir. Mainly because this most accurate water analogy to what actually happens in a Hadoop implementation that is used for data storage and integration.

Consider, that data is first landed in the Hadoop file system. This is the un-processed data just like water running into a reservoir from different sources. The data in this form in only fit for limited use, like analytics by trained power users. The data is then processed just like water is processed. Process water you end up with water that is consumable. Go one step further and distill it, and you have water that is suitable for medical applications. Data is the same way in a Big Data environment. Process it enough and one ends up with conformed dimensions and fact tables. Process it even more, and you have data that is suitable for basing bonuses or even publishing to government regulators. Read the rest of this post »

Internet of Things and Enterprise Data Management…

 

It is amazing to see the technology terms we come up with to explain new technology or trend. The consulting thought leadership coins the words to group a set of technology, trend to make it easier for people to have a context. However the success and adoption of the technology/trend defines the term’s reputation. For example Data warehouse was an in-thing only to be shunned when it did not deliver on its promises. Industry quickly realized the mistake and called it Business Intelligence and hid Data Warehouse behind BI until things settled. Now no one questions value of DW or EDW or perceive that as a risky project.

Some terms are really great and they are here to stay for a long time. Some withers away, some change and take a different meaning. One such term which got my attention is IoT – Internet of Things – what is this? It sounds like ‘Those things’ but really what is this trend or technology?

Wikipedia gives you this definition:

“The Internet of Things (IoT) is the interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure. Typically, IoT is expected to offer advanced connectivity of devices, systems, and services that goes beyond machine-to-machine communications (M2M) and covers a variety of protocols, domains, and applications.[1] The interconnection of these embedded devices (including smart objects), is expected to usher in automation in nearly all fields, while also enabling advanced applications like a Smart Grid.[2]

IoT1

That is a lot of stuff. Looks like pretty much everything we do with Internet. I am sure this term will change and take shape. But let’s look how this relates to Enterprise Data Management. So from an enterprise data perspective, Let us consider a subset of IoT – machine generated internet data and consolidation of data from the systems operating on the cloud. What we end up with is a whole lot of data which is new, and also not in the traditional Enterprise Data framework. The impact and exposure are real, and much of the IoT data may live outside the firewalls.

In essence, the Enterprise Data Management need to deal with the added dimension of Architecture, Technology, and Governance of IoT. Considering IoT Data as out of scope for Enterprise Data Management will lead to more issues than it can solve, especially if you are generating or depend on the IoT data.

Realizing Agile Data Management …

Years of work went into building the elusive single version of truth. Despite all the attempts from IT and business, Excel reporting and Access databases were impossible to eliminate. Excel is the number one BI tool in the industry and for the following good reasons : accessibility to the tool, speed and familiarity. Almost all the BI tools export data to Excel for those reasons. Business will produce the insight they need as soon as the data is available, manual or otherwise. It is time to come to terms with the fact change is imminent and there is no such thing as Perfect Data but only what is good enough to business. As the saying goes:

‘Perfect is the enemy of Good!’

So waiting for all the business rules and perfect data to produce the report or analytics, is too late for the business. Speed is of essence, when the data is available, business wants it; stale data is as good as not having it.

Data_Management_1

In the changing paradigm of Data Management, agile ideas and tools are in play. Waiting for Months, weeks or even a day to analyze the data from Data warehouse is a problem. Data Discovery through Agile BI tools which doubles as ETL, offers significant reduction in data availability. Data Virtualization provides access to data in real-time for quicker insights along with metadata. In-Memory data appliances produce analytics in fraction of the time compared to traditional Data warehouse/ BI.

We are moving from the Gourmet sit-in dining to fast food concept for Data access and analytical insights. Though both have its place, usage benefits and short comings. They complement each other in terms of use and the value they bring to the Business. In the following series let’s look at these new set of tools and how they help Agile  Data Management throughout the life cycle.

  1. Tools in play:
    1. Data Virtualization
    2. In-Memory Database (appliances)
    3. Data Life Cycle Management
    4. Data Visualization
    5. Cloud BI
    6. Big Data (Data Lake & Data Discovery)
    7. Cloud Integration (on-prem and off-prem)
    8. Information Governance (Data Quality, Metadata, Master Data)
  2. Architectural changes traditional Vs Agile
  3. Data Management Impacts
    1. Data Governance
    2. Data Security & Compliance
    3. Cloud Application Management

DevOps Considerations for Big Data

Big Data is on everyone’s mind these days. Creating an analytical environment involving Big Data technologies is exciting and complex. New technology, new ways of looking at the data which is otherwise remained dark or not available. The exciting part of implementing the Big Data solution is to make it a production ready solution.

Once the enterprise comes to rely on the solution, dealing with typical production issues is a must. Expanding the data lakes and creating multiple applications accessing, changing and deploying new statistical learning solutions can hit the overall platform performance. In the end-user experience and trust will become an issue if the environment is not managed properly. Models which used to run in minutes may turn into hours and days based on the data changes and algorithm changes deployed. bigdata_1Having the right DevOps process framework is important to the success of Big Data solutions.

In many organizations the Data Scientist reports to the business and not to IT. Knowing the business and technological requirements and setting up the DevOps process is key to make the solutions production ready.

Key DevOps Measures for Big Data environment:

  • Data acquisition performance (ingestion to creating a useful data set)
  • Model execution performance (Analytics creation)
  • Modeling platform / Tool performance
  • Software change impacts (upgrades and patches)
  • Development to Production –  Deployment Performance (Application changes)
  • Service SLA Performance (incidents, outages)
  • Security robustness / compliance

 

One of the top key issue is Big Data security. How secured is the data and who has the access and the oversight of the data? Putting together a governance framework to manage the data is vital for the overall health and compliance of the Big Data solutions. Big Data is just getting the traction and much of best practices for Big Data DevOps scenarios yet to mature.

Creating Transactional Searches with Splunk

Transactions refer to a “unit of work” or “grouped information” that someone is treating as a perhaps “logical” data point or singular target. Transactions are made up of multiple events or actions and, may mean something entirely different when looked at as a group than if examined one by one or each at a time.

Using either Splunk Web or its command line interface, you can search for and identify what it is referred t as related raw events” and group them into one single event”, which you can then denote as “a single Splunk transaction”.

These events can be linked together by fields they have in common. In addition, transactions can be saved as transactional types for later reuse.

Your Splunk transactions can include:

  • Different events from the same source/same host.
  • Different events from different sources/same host.
  • Similar events from different hosts/different sources.

Some Conceptual Examples

To help understand the power of Splunk transactional searches, let’s consider a few conceptual examples for its use:

  • A certain server error triggers several events to be logged
  • All events that occur within a precise time frame
  • Events that share the same host or cookie value
  • Password change attempts, that occurred near where there were unsuccessful logins.
  • All of the web addresses a particular IP address viewed, over a time range

To use Splunk transactions, you can either call a transaction type (that you configured via the Splunk configuration file: transactiontypes.conf), or define transaction constraints within your search (by setting the search options of the transaction command).

Here is the transaction command syntax:

transaction [<field-list>] [name=<transaction-name>] <txn_definition-opt>* <memcontrol-opt>* <rendering-opt>*

Splunk Transactions are made up of 2 key required arguments: a field name (or list of field names delimited by a comma) and your name for the transaction, and several other optional arguments.

Field Name/List

The field list will be a string value made up of 1 or more field names that you want Splunk to use the values of for grouping events into transactions.

Transaction Name

This will be the ID (name) that your transaction will be referred to or, the name of a transaction type from transactiontypes.conf.

Optional Arguments

If other configuration arguments (such as maxspan) are provided in your Splunk search, they overrule the values of that parameter that is specified in the transaction definition (within the transactiontypes.conf file). If those parameters are not specified in the file, Splunk will use the default value.

Here is an example

A simple example of a Splunk transaction might be to define a transaction that groups Cognos TM1 ERRORS that appear in a message log that have the same value for the field “date_month” (in other words errors that occur in the same month) and with a maximum span of 90 seconds into a transaction:

sourcetype=tm1* ERROR | transaction date_month maxspan=90s

pj

 

 

 

 

 

 

 

As always, never stop learning…

Sub-Searching – with Splunk

You’ll find that it is pretty typical to utilize the concept of sub-searching in Splunk.

A “sub search” is simply a “search within a search” or, a search that uses another search as an argument. Sub searches in Splunk must be contained in square brackets and are evaluated first by the Splunk interpreter.

Sub-Searching - with SplunkThink of a Sub search as being similar to a SQL subquery (a subquery is a SQL query nested inside a larger query).

Sub searches are mainly used for three purposes:

  • Parametrization (of a search, using the output of another search)
  • Appending (running a separate search, but stitching the output to the first search using the Splunk append command).
  • Conditions (To create a conditional search where you only see results of your search if it meets the criteria or perhaps threshold of the sub-search).

Normally, you’ll use a sub-search to take the results of one search and use them in another search all in a single Splunk search pipeline. Because of how this works, the second search must be able to accept arguments; such as with the append command (as I’ve already mentioned).

Parametrization

sourcetype=TM1* ERROR[search earliest=-30d | top limit=1 date_mday| fields + date_mday]

The above Splunk search utilizes a sub search as a parametrized search of all TM1 logs indexed within a Splunk instance that have “error” events. The sub search (enclosed in the square brackets) filters the search first to the past 30 days and then to the day which had the most events.

Appending

The Splunk appendcommand can be used to append the results of a sub-search to the results of a current search:

sourcetype=TM1* ERROR | stats dc(date_year), count by sourcetype | append [search sourcetype=TM1* | top 1 sourcetype by date_year]

The above Splunk search utilizes a sub search with an append command to combine 2 TM1 server log searches; these search though all indexed TM1 sources for “Error” events. The first search yields a count of events by TM1 source by year; the second (sub) search returns the top or (or most active) TM1 source by year. The results of the 2 searches are then appended.

 Conditional

sourcetype=access_* | stats dc(clientip), count by method | append [search sourcetype=access_* clientip where action = 'addtocart' by method]

The above Splunk search – which counts the number of different IP addresses which accessed a server and also finds the user who accessed the server the most for each type of page request (method) is modified with a “where clause” to limit the counts to only those that are “addtocart” actions. (In other words, which user added the most to his online shopping cart whether they actual purchased anything or not).

Output Settings for Sub-searches

When performing Splunk sub searches you will often utilize the format command. This command takes the results of a sub-search and formats them into a single result.

Depending upon the search pipeline, the results returned may be numerous, which will impact the performance of your search. To remedy this you can change the number of results that the format command operates over in-line with your search by appending the following to the end of your sub-search:

| format maxresults = <integer>.

I recommended that you take a very conservative approach and utilize the Splunk limits.conf file to enforce limits of all you rsub-searches.  This file exists in the $SPLUNK_HOME/etc/system/default/ folder (for global settings) or, for localized control, you may find (or create) a copy in $SPLUNK_HOME/etc/system/local/ folder.

The file controls all Splunk searches (providing it is coded correctly, based upon your environment) but also contains a section specific to Splunk sub-searches, titled “subsearch”.

Within this section, there are 3 important sub-sections:

  • maxout (this is the maximum number of results to return form a subsearch. The default is 100).
  • maxtime (this is the maximum number of seconds to run a subsearch before finalizing. Defaults to 60).
  • ttl (this is the time to cache a given subsearch’ s results (the default is 300).

Splunk On!

 

An Architectural Approach to Cognos TM1 Design

Overtime, I’ve written about keeping your TM1 model design “architecturally pure”. What this means is that you should strive to keep a models “areas of functionality” distinct within your design.

Common Components

I believe that all TM1 applications, for example, are made of only 4 distinct “areas of functionality”. They are absorption (of key information from external data sources), configuration (of assumptions about the absorbed data), calculation (where the specific “magic” happens; i.e. business logic is applied to the source data using the set assumptions) and consumption (of the information processed by the application and is ready to be reported on).

Some Advantages

Keeping functional areas distinct has many advantages:

  • Reduces complexity and increases sustainability within components
  • Reduces the possibility of one component negativity effecting another
  • Enables the probability of reuse of the particular (distinct) components
  • Promotes a technology independent design; meaning components can be built using the technology that best fits their particular objective
  • Allows components to be designed, developed and supported by independent groups
  • Diminishes duplication of code, logic, data, etc.
  • Etc.

Resist the Urge

There is always a tendency to “jump in” and “do it all” using a single tool or technology or, in the case of Cognos TM1, a few enormous cubes and today, with every release of software, there are new “package connectors” that allow you to directly connect (even external) system components. In addition, you may “understand the mechanics” of how a certain technology works which will allow you to “build” something, but without comprehensive knowledge of architectural concepts, you may end up with something that does not scale, has unacceptable performance or is costly to sustain.

Final Thoughts

Some final thoughts:

  • Try white boarding the functional areas before writing any code
  • Once you have your “like areas” defined, search for already existing components that may meet your requirements
  • If you do decide to “build new”, try to find other potential users for the new functionality. Could you partner and co-produce (and thus share the costs) a component that you both can use?
  • Before building a new component, “try out” different technologies. Which best serves the need of these components objectives? (A rule of thumb, if you can find more than 3 other technologies or tools that better fit your requirements than the technology you planned to use, you’re in trouble!).

And finally:

Always remember, just because you “can” doesn’t mean you “should”.

A Practice Vision

Vision

Most organizations today have had successes implementing technology and they are happy to tell you about it. From a tactical perspective, they understand how to install, configure and use whatever software you are interested in. They are “practitioners”. But, how may can bring a “strategic vision” to a project or to your organization in general?

An “enterprise” or “strategic” vision is based upon an “evolutionary roadmap” that starts with the initial “evaluation and implementation” (of a technology or tool), continues with “building and using” and finally (hopefully) to the organization, optimization and management of all of the earned knowledge (with the tool or technology). You should expect that whoever you partner with can explain what their practice vision or mythology is or, at least talk to the “phases” of the evolution process:

Evaluation and Implementation

The discovery and evaluation that takes place with any new tool or technology is the first phase of a practices evolution. A practice should be able to explain how testing is accomplished and what it covers How was it that they determined if the tool/technology to be used will meet or exceed your organization’s needs? Once a decision is made, are they practiced at the installation, configuration and everything that may be involved in deploying the new tool or technology for use?

Build, Use, Repeat

Once deployed, and “building and using” components with that tool or technology begin, the efficiency at which these components are developed as well as the level of quality of those developed components will depend upon the level of experience (with the technology) that a practice possess. Typically, “building and using” is repeated with each successful “build” so how many times has the practice successfully used this technology? By human nature, once a solution is “built” and seems correct and valuable, it will be saved and used again. Hopefully, this solution would have been shared as a “knowledge object” across the practice. Although most may actually reach this phase, it is not uncommon to find:

  • Objects with similar or duplicate functionality (they reinvented the wheel over and over).
  • Poor naming and filing of objects (no one but the creator knows it exists or perhaps what it does)
  • Objects not shared (objects visible only to specific groups or individuals, not the entire practice)
  • Objects that are obsolete or do not work properly or optimally are being used.
  • Etc.

Manage & Optimization

At some point, usually while (or after a certain number of) solutions have been developed, a practice will “mature its development or delivery process” to the point that it will begin investing time and perhaps dedicate resources to organize, manage and optimize its developed components (i.e. “organizational knowledge management”, sometimes known as IP or intellectual property).

You should expect a practice to have a recognized practice leader and a “governing committee” to help identify and manage knowledge developed by the practice and:

  • inventory and evaluate all known (and future) knowledge objects
  • establish appropriate naming standards and styles
  • establishing appropriate development and delivery standards
  • create, implement and enforce a formal testing strategy
  • continually develop “the vision” for the practice (and perhaps the industry)

 

More

As I’ve mentioned, a practice needs to take a strategic or enterprise approach to how it develops and delivers and to do this it must develop its “vision”. A vision will ensure that the practice is leveraging its resources (and methodologies) to achieve the highest rate of success today and over time. This is not simply “administrating the environment” or “managing the projects” but involves structured thought, best practices and continued commitment to evolved improvement. What is your vision?