Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow our Enterprise Information Technology board on Pinterest

DevOps Considerations for Big Data

Big Data is on everyone’s mind these days. Creating an analytical environment involving Big Data technologies is exciting and complex. New technology, new ways of looking at the data which is otherwise remained dark or not available. The exciting part of implementing the Big Data solution is to make it a production ready solution.

Once the enterprise comes to rely on the solution, dealing with typical production issues is a must. Expanding the data lakes and creating multiple applications accessing, changing and deploying new statistical learning solutions can hit the overall platform performance. In the end-user experience and trust will become an issue if the environment is not managed properly. Models which used to run in minutes may turn into hours and days based on the data changes and algorithm changes deployed. bigdata_1Having the right DevOps process framework is important to the success of Big Data solutions.

In many organizations the Data Scientist reports to the business and not to IT. Knowing the business and technological requirements and setting up the DevOps process is key to make the solutions production ready.

Key DevOps Measures for Big Data environment:

  • Data acquisition performance (ingestion to creating a useful data set)
  • Model execution performance (Analytics creation)
  • Modeling platform / Tool performance
  • Software change impacts (upgrades and patches)
  • Development to Production –  Deployment Performance (Application changes)
  • Service SLA Performance (incidents, outages)
  • Security robustness / compliance

 

One of the top key issue is Big Data security. How secured is the data and who has the access and the oversight of the data? Putting together a governance framework to manage the data is vital for the overall health and compliance of the Big Data solutions. Big Data is just getting the traction and much of best practices for Big Data DevOps scenarios yet to mature.

Virtualization – THE WHY?

 

The speed in which we receive information from multiple devices and the ever-changing customer interactions providing new ways of customer experience, creates DATA! Any company that knows how to harness the data and produce actionable information is going to make a big difference to their bottom line. So Why Virtualization? The simple answer is Business Agility.

As we build the new information infrastructure and the tools for the modern Enterprise Information Management, one has to adapt and change. In the last 15 years, the Enterprise Data Warehouse has matured to a point with proper ETL framework and Dimension models.

With the new ‘Internet of Things’ (IoT) a lot more data is created and consumed from external sources. Cloud applications create data which may not be readily available for analysis. Not having the data for analysis will greatly change the critical insights outcome.

Major Benefits of Virtualization

 Virtualization_benefits

Additional considerations

  • Address performance impact of Virtualization on the underlying Application and the overall refresh delays appropriately
  • It is not a replacement for Data Integration (ETL) but it is a quicker way to get data access in a controlled way
  • May not include all the Business rules, which implies Data Quality issues, may still be an issue

In conclusion, having the Virtualization tool in the Enterprise Data Management portfolio of products will add more agility in Data Management. However, use Virtualization  appropriately to solve the right kind problem and not as a replacement to traditional ETL.

Data Virtualization can make IT look good!

Virtualization-wave-1Data Virtualization offers a unique opportunity for IT and Business to leverage this technology to cut down the development time for adding new sources of data. The providers of this technology is the top software vendors like IBM, Microsoft etc. (see Forrester wave) … with the new entrant Cisco (bought Composite recently). This is not a complete list. There are other players in this market.

Many BI tools offer connectivity to different types of data sources as part of their interface (think ODBC) – but it falls more in the ETL side of the offering. Virtualization provides a way to hide the physical names and provides a common model / canonical models for the business user’s consumption.

Use case

ETL development for adding new data sources  to Enterprise Data Warehouse (EDW)  takes a long time simply because of the rigor needed for loading and validation of the data. Business users want these new data for analysis or even just for cross checking as soon as possible. Adding new data sources in a reasonably shorter turnaround time like in days as opposed to weeks and months is possible by using Data Virtualization tools.

Benefits of Data Virtualization:DI_challenge

  • Buys time for IT: Provides the intermediate solution to business while IT take their time to build the data integration with proper controls.
  • Assess the value of the data: Business users can validate the usability and the overall Quality of the Data and help define the business rules for data cleansing.
  • Seamless Deployment: IT can change the sources of data underneath the logical layer without any interruption to services when the data is ready for full integration.

IT can leverage Data virtualization for providing quick access to the needed Data to power users without compromising the control. After establishing the trustworthiness of the data, bigger roll out can follow suit. Putting the proper processes for access and letting IT manage the meta-data (Logical) layer will be a good way to have an oversight on the usage. These processes will give the needed control to IT in managing the Data sources to avoid operational nightmares.

Posted in News

Cloud BI use cases

Cloud BI comes in different forms and shapes, ranging from just visualization to full-blown EDW combined with visualization and Predictive Analytics. The truth of the matter is every niche product vendor offers some unique feature which other product suite does not offer. In most case you almost always need more than one suite of BI to meet all the needs of the Enterprise.

De-centralization definitely helps the business in achieving agility and respond to the market challenges quickly. At the same token that is how companies may end up with silos of information across the enterprise.

Let us look at some scenarios where a cloud BI solution is very attractive to Departmental use.

time_2_mktTime to Market

Getting the business case built and approved for big CapEx projects is a time-consuming proposition. Wait times for HW/SW and IT involvement means lot longer delays in scheduling the project. Not to mention the push back to use the existing reports or wait for the next release which is allegedly around the corner forever.

 

deploymentDeployment Delays

Business users have immediate need for analysis and decision-making. Typical turnaround for IT to get new sources of data takes anywhere between 90 days to 180 days. This is absolutely the killer for the business which wants the data now for analysis. Spreadsheets are still the top BI tool just for this reason. With Cloud BI (not just the tool) Business users get not only  the visualization and other product features but also the data which is not otherwise available. Customer analytics with social media analysis are available as  a third-party BI solution. In the case of value-added analytics there is business reason to go for these solutions.

 

Tool CapabilitiesBI_cap

Power users need ways to slice and dice the data, need integration of other non traditional sources (Excel, departmental cloud applications) to produce a combined analysis. Many BI tools comes with light weight integration (mostly push integration) to make this a reality without too much of IT bottleneck.

So if we can add new capability, without much delay and within departmental budget where is the rub?

The issue is not looking at the Enterprise Information in a holistic way. Though speed is critical, it is equally important to engage Governance and IT to secure the information and share appropriately to integrate into the Enterprise Data Asset.

As we move into the future of Cloud based solutions, we will be able to solve many of the bottlenecks, but we will also have to deal with security, compliance and risk mitigation management of leaving the data in the cloud. Forging a strategy to meet various BI demands of the enterprise with proper Governance will yield the optimum use of resources and /solution mix.

Simple Cognos TM1 Backup Best Practices

How do you create a recoverable backup for a TM1 server instance (TM1 service)? What is best practice? Here is some advice.

Note: as with any guideline or recommendation, there will be circumstances that support deviating from accepted best practice. In these instances, it is recommended that all key stakeholders involved agree that:

  • Simple Cognos TM1 Backup Best PracticesThe reason for deviation is reasonable and appropriate
  • The alternative approach or practice being implemented is reasonable and appropriate

Definition of a Backup

“In information technology, a backup, or the process of backing up, refers to the copying and archiving of computer data so it may be used to restore the original after a data loss event. The verb form is to back up in two words, whereas the noun is backup” (http://en.wikipedia.org/wiki/Backup).

To be clear, what I mean to refer to here is the creation of an archived copy or image of a specified Cognos TM1 server instance at a specified moment in time that can be used to completely restore that TM1 server to the state it was in when the archive was created.

Procedure

The following outlines the steps recommended for creating a valid backup:

  1. Verify the current size of the TM1 server logs and database folders. Note that the location of these folders is specified in the TM1s.cfg file; look for “DataBaseDirectory” and “LoggingDirectory”. Should you restore from this backup, you should compare these sizes to the size totals after you complete the restore.
  2. Verify that there is available disk space to perform compression of the server logs and database folders and to save the resulting compressed file(s).
  3. Verify that you have appropriate access rights to:
    1. Stop and start TM1 services
    2. Create, save and move files on the appropriate file systems and servers
  4. Notify all TM1 users that the server will be shut down at a specified time
  5. Login to TM1 as a TM1 Admin (preferably the Admin ID not, a client ID granted admin access).
  6. Verify that all TM1 users have exited. (One way to do this is to us right-click on the TM1 server (in TM1 server explorer) and select Server Manager…).
  7. Deactivate (turn off) any active or scheduled TM1 chores (Note: it is important to verify that you have available, up-to-date documentation on chore schedules before deactivating so that you can restore the correct chore schedule after the backup is complete).
  8. Make sure that any software that may have access to the TM1 logs and database folders (for example, virus scanning or automated backups) is temporarily disabled or not scheduled to run during the period of time that you will be creating a backup to avoid the chance of file lock conflicts.
  9. Perform a TM1 SaveDataAll.
  10. Logout of TM1.
  11. Stop the machine service for the TM1 server instance. Note: be sure that the service is not configured to “auto start”. Some environments may have services configured to startup automatically after a period of down time. It is imperative that the TM1 service does not start while a backup is being created.
  12. Verify that the service has stopped.
  13. Using a simple text editor such as MS Windows notepad, open and review the TM1 server log to verify that the TM1 service did stop and no errors occurred during shutdown.
  14. Use certified compression software such as 7-Zip, create a compressed file of the TM1 server logs folder
  15. Use certified compression software such as 7-Zip, create a compressed file of the TM1 server database folder
  16. Rename the compressed files, typically adding a “_date” to the file name for later reference. For example “Forecasting_2014_09_11.zip”.
  17. Move the compressed files to a “work area” and verify that the files can be uncompressed.
  18. Move the compressed files to an area specified for archiving backups, typically one that is subject to an automated network backup. These files should be saved for an appropriate amount of time.
  19. Restart the machine service for the TM1 server instance.
  20. When the TM1 server is available again, login as a TM1 Admin verifying the server is accessible.
  21. Using a simple text editor such as MS Windows notepad, open and review the TM1 server log to verify that the TM1 service did start successfully and no errors occurred during startup.
  22. Reactivate the appropriate TM1 chores (based upon available documentation).
  23. Notify all TM1 users that the server is now available.

Conclusion

Certainty some of the above steps could be eliminated in the process of creating a backup, however in an enterprise environment where business processes depend upon availability and correctness , it is highly recommended that the outlined steps  become standard operating procedure for creating your Cognos TM1 backups. 

Common sense, right? Let’s hope so.

Creating Transactional Searches with Splunk

Transactions refer to a “unit of work” or “grouped information” that someone is treating as a perhaps “logical” data point or singular target. Transactions are made up of multiple events or actions and, may mean something entirely different when looked at as a group than if examined one by one or each at a time.

Using either Splunk Web or its command line interface, you can search for and identify what it is referred t as related raw events” and group them into one single event”, which you can then denote as “a single Splunk transaction”.

These events can be linked together by fields they have in common. In addition, transactions can be saved as transactional types for later reuse.

Your Splunk transactions can include:

  • Different events from the same source/same host.
  • Different events from different sources/same host.
  • Similar events from different hosts/different sources.

Some Conceptual Examples

To help understand the power of Splunk transactional searches, let’s consider a few conceptual examples for its use:

  • A certain server error triggers several events to be logged
  • All events that occur within a precise time frame
  • Events that share the same host or cookie value
  • Password change attempts, that occurred near where there were unsuccessful logins.
  • All of the web addresses a particular IP address viewed, over a time range

To use Splunk transactions, you can either call a transaction type (that you configured via the Splunk configuration file: transactiontypes.conf), or define transaction constraints within your search (by setting the search options of the transaction command).

Here is the transaction command syntax:

transaction [<field-list>] [name=<transaction-name>] <txn_definition-opt>* <memcontrol-opt>* <rendering-opt>*

Splunk Transactions are made up of 2 key required arguments: a field name (or list of field names delimited by a comma) and your name for the transaction, and several other optional arguments.

Field Name/List

The field list will be a string value made up of 1 or more field names that you want Splunk to use the values of for grouping events into transactions.

Transaction Name

This will be the ID (name) that your transaction will be referred to or, the name of a transaction type from transactiontypes.conf.

Optional Arguments

If other configuration arguments (such as maxspan) are provided in your Splunk search, they overrule the values of that parameter that is specified in the transaction definition (within the transactiontypes.conf file). If those parameters are not specified in the file, Splunk will use the default value.

Here is an example

A simple example of a Splunk transaction might be to define a transaction that groups Cognos TM1 ERRORS that appear in a message log that have the same value for the field “date_month” (in other words errors that occur in the same month) and with a maximum span of 90 seconds into a transaction:

sourcetype=tm1* ERROR | transaction date_month maxspan=90s

pj

 

 

 

 

 

 

 

As always, never stop learning…

Why MDM should be part of CRM Strategy

Let’s face it. Companies capture Customer Information  in multiple stages through multiple systems even in a Small / Medium Business (SMB segment). Not all the latest Customer information is readily available for all the applications.  Applications need Customer Master Data to link the transactions and every Application has different data model for capturing the Customer Master. Companies embarking on the CRM strategy have the unique opportunity to streamline the Master Data creation and consumption across the Enterprise.

CRM may capture the majority of the Customer related details but it is not designed to capture all the information to Master the Customer Data. Also it does not avoid creating duplicate data or capturing incomplete data. Adding Master Data Management to CRM will not only improve the Data Quality but also provide the framework for improving the overall customer experience through the sharing of Master Data to Enterprise Applications.

Forrester recommends that

Before Modernization of CRM, Shore up your foundation’.

Also according to Forrester,

Modern CRM deployments only succeed if the foundations are solid. CRM deployments must be scalable, highly performant, and must support security protocols, data privacy requirements, and third-party credentials like the Payment Card Industry (PCI) and the Health Insurance Portability and Accountability Act (HIPAA) that matter in your industry. All actions must be logged and be auditable. Master Data Management and Data Governance policies must be put in place.”

 

cloud_adoption

Focusing on Customer Experience and managing end-to-end journey involve multiple applications besides CRM. Having the Master Data Strategy provides the framework for sharing the most current Customer Information across applications at the Enterprise level. Several companies are already looking at cloud options for their CRM needs. (See fig. on the left) – CRM is at the top of the CFO list (41% said they are considering the move to cloud) for Cloud Application. Having the Master Data strategy is vital for implementing new CRM to address all customer touch points in order to provide superior customer experience.

 

Key Benefits of MDM strategy along with CRM:uber

  • Enables CRM to deeply personalize customer interactions
  • Enhance user experience and ability to interact using multiple devices, avenues (See Uber’s Customer Experience through seamless Customer interface)
  • Improved customer journey at every touch point (Imagine providing your account details repeatedly when your call get transferred to another department)
  • Provide seamless integration / sharing of best available Customer Data across Enterprise applications

Sub-Searching – with Splunk

You’ll find that it is pretty typical to utilize the concept of sub-searching in Splunk.

A “sub search” is simply a “search within a search” or, a search that uses another search as an argument. Sub searches in Splunk must be contained in square brackets and are evaluated first by the Splunk interpreter.

Sub-Searching - with SplunkThink of a Sub search as being similar to a SQL subquery (a subquery is a SQL query nested inside a larger query).

Sub searches are mainly used for three purposes:

  • Parametrization (of a search, using the output of another search)
  • Appending (running a separate search, but stitching the output to the first search using the Splunk append command).
  • Conditions (To create a conditional search where you only see results of your search if it meets the criteria or perhaps threshold of the sub-search).

Normally, you’ll use a sub-search to take the results of one search and use them in another search all in a single Splunk search pipeline. Because of how this works, the second search must be able to accept arguments; such as with the append command (as I’ve already mentioned).

Parametrization

sourcetype=TM1* ERROR[search earliest=-30d | top limit=1 date_mday| fields + date_mday]

The above Splunk search utilizes a sub search as a parametrized search of all TM1 logs indexed within a Splunk instance that have “error” events. The sub search (enclosed in the square brackets) filters the search first to the past 30 days and then to the day which had the most events.

Appending

The Splunk appendcommand can be used to append the results of a sub-search to the results of a current search:

sourcetype=TM1* ERROR | stats dc(date_year), count by sourcetype | append [search sourcetype=TM1* | top 1 sourcetype by date_year]

The above Splunk search utilizes a sub search with an append command to combine 2 TM1 server log searches; these search though all indexed TM1 sources for “Error” events. The first search yields a count of events by TM1 source by year; the second (sub) search returns the top or (or most active) TM1 source by year. The results of the 2 searches are then appended.

 Conditional

sourcetype=access_* | stats dc(clientip), count by method | append [search sourcetype=access_* clientip where action = 'addtocart' by method]

The above Splunk search – which counts the number of different IP addresses which accessed a server and also finds the user who accessed the server the most for each type of page request (method) is modified with a “where clause” to limit the counts to only those that are “addtocart” actions. (In other words, which user added the most to his online shopping cart whether they actual purchased anything or not).

Output Settings for Sub-searches

When performing Splunk sub searches you will often utilize the format command. This command takes the results of a sub-search and formats them into a single result.

Depending upon the search pipeline, the results returned may be numerous, which will impact the performance of your search. To remedy this you can change the number of results that the format command operates over in-line with your search by appending the following to the end of your sub-search:

| format maxresults = <integer>.

I recommended that you take a very conservative approach and utilize the Splunk limits.conf file to enforce limits of all you rsub-searches.  This file exists in the $SPLUNK_HOME/etc/system/default/ folder (for global settings) or, for localized control, you may find (or create) a copy in $SPLUNK_HOME/etc/system/local/ folder.

The file controls all Splunk searches (providing it is coded correctly, based upon your environment) but also contains a section specific to Splunk sub-searches, titled “subsearch”.

Within this section, there are 3 important sub-sections:

  • maxout (this is the maximum number of results to return form a subsearch. The default is 100).
  • maxtime (this is the maximum number of seconds to run a subsearch before finalizing. Defaults to 60).
  • ttl (this is the time to cache a given subsearch’ s results (the default is 300).

Splunk On!

 

Governance imperatives for Cloud Data

Today’s Cloud solutions range from platform services to full-blown ERP holding the key to Enterprise Information. It is very easy to get carried away and forget the efforts of creating the Enterprise Data and existing Analytical capabilities. Getting rid of the cost of managing the HW/SW and other Operational expenses is only part of the story. Any company which does not take into account the exposure and the amount of data being transferred to cloud applications will be grossly mistaken.

Cloud_appI learned from my CIO during my tenure as Data services manager for a hospital, how important it is to read the fine print when signing the contract with the hosting vendor. Especially in Healthcare it is typical to have third-party hosting and a handful of employees managing the application. The same rigor applies to Cloud applications, with vendor consolidations and not having proper SLA/Contract, the impact to the business will be very high. Data implications are enormous if the disruptions are real. Migrating from one application to another is not trivial especially if IT is not in the loop.

Regular audits and exposure compensations are not negotiated upfront causing overall concerns. As per Gartner “Concerns about the risk ramifications of cloud computing are increasingly motivating security, continuity, recovery, privacy and compliance managers to participate in the buying process led by IT procurement professionals.”.

Here are the four areas Information Governance should consider for cloud applications, especially if it is part of the Enterprise Strategy.

Security

  • Is security comparable to what you might have in your Enterprise?
  • What is your exposure if there is a security breach?
  • Do you have an SLA / expectations spelt out in your contract?

Accessibility

  • How granular is the access for information?
  • What level of controls you can put in place to meet your regulatory and other expectations?
  • Business continuity – plans, measures, SLA’s

Enterprise Integration

  • How well the data is integrated into Enterprise data?
  • What is the backup/recovery plan and access to your data?

Risk Management

  • Who is involved in selecting / benchmarking the Cloud applications?
  • What is the process for signing up Cloud services? Is it departmental or Enterprise Involvement?
  • Is overall SLA’s / Governance process defined for the Cloud Application?
  • Financial strength of Cloud Vendor

Governing the Cloud Analytics…

The new trend in the Analytics world, Cloud Analytics is slowly becoming a norm. Except for the Cloud tag, companies have used Cloud or External Analytics for a long time. Historically Campaign Management has been part outsourced, part managed by Marketing, using external data besides ‘Enterprise Data’. Traditional Data Vendors / Credit Score providers like D&B are expanding into Cloud BI offerings (acquisition of Indicee) to diversify their offerings. IBM’s acquisition of Silverpop puts them on par with data vendors offering Campaign Management solutions, not to mention pre-packaged Analytics solutions they offer in this space.

cloud_analytics.jpgAll of these new tools and offerings makes it conducive for Business users to use these solutions bypassing IT. The challenge remains how to deal with the fast paced changing norm for the Enterprise Data? Restricting artificially and holding back the trend is not only impossible but also will put the company in a competitive disadvantage. Cloud Analytics is Agile and offers reduction in time for Information access. Cloud deployments are much faster than traditional EDW/BI which is mostly ‘what happened?’ type of data. According to a report from Aberdeen ‘Large Enterprises utilizing Cloud Analytics obtain pertinent information 13% more often than all other large Enterprises’, which is a significant capability compared to the competition. (See Aberdeen Report: Cloud Analytics for the Large Enterprise: Fast Value, Pervasive Impact)

Preparing to deal with the new changing Analytics trend is a must for IT and the Enterprise as whole. Data silos has been an issue in the past, is an issue at present and will be an issue in the future. Creating the environment to deal with overall Information explosion is a key to survival of the company. Partnership between IT and Business will strengthen the Information Governance to manage the Information based decisions and to leverage the offerings in the Marketplace wisely. Adding more sources of Data in silos will restrict the overall Information use but not allowing to experiment will also put the company in a disadvantage position.

Having the visibility to overall Enterprise Information including the Cloud is a critical factor for a successful Information Governance. Companies should forge strategies to put in the framework to manage the Information in a rapidly changing environment with the right people, process and technologies at the Regional and Enterprise level. Information Governance should create a balanced environment by providing appropriate oversight yet foster innovation in leveraging new offerings.