Perficient Enterprise Information Solutions Blog

Blog Categories

Subscribe via Email

Subscribe to RSS feed

Archives

Follow our Enterprise Information Technology board on Pinterest

Posts Tagged ‘analytics’

DevOps Considerations for Big Data

Big Data is on everyone’s mind these days. Creating an analytical environment involving Big Data technologies is exciting and complex. New technology, new ways of looking at the data which is otherwise remained dark or not available. The exciting part of implementing the Big Data solution is to make it a production ready solution.

Once the enterprise comes to rely on the solution, dealing with typical production issues is a must. Expanding the data lakes and creating multiple applications accessing, changing and deploying new statistical learning solutions can hit the overall platform performance. In the end-user experience and trust will become an issue if the environment is not managed properly. Models which used to run in minutes may turn into hours and days based on the data changes and algorithm changes deployed. bigdata_1Having the right DevOps process framework is important to the success of Big Data solutions.

In many organizations the Data Scientist reports to the business and not to IT. Knowing the business and technological requirements and setting up the DevOps process is key to make the solutions production ready.

Key DevOps Measures for Big Data environment:

  • Data acquisition performance (ingestion to creating a useful data set)
  • Model execution performance (Analytics creation)
  • Modeling platform / Tool performance
  • Software change impacts (upgrades and patches)
  • Development to Production –  Deployment Performance (Application changes)
  • Service SLA Performance (incidents, outages)
  • Security robustness / compliance

 

One of the top key issue is Big Data security. How secured is the data and who has the access and the oversight of the data? Putting together a governance framework to manage the data is vital for the overall health and compliance of the Big Data solutions. Big Data is just getting the traction and much of best practices for Big Data DevOps scenarios yet to mature.

Virtualization – THE WHY?

 

The speed in which we receive information from multiple devices and the ever-changing customer interactions providing new ways of customer experience, creates DATA! Any company that knows how to harness the data and produce actionable information is going to make a big difference to their bottom line. So Why Virtualization? The simple answer is Business Agility.

As we build the new information infrastructure and the tools for the modern Enterprise Information Management, one has to adapt and change. In the last 15 years, the Enterprise Data Warehouse has matured to a point with proper ETL framework and Dimension models.

With the new ‘Internet of Things’ (IoT) a lot more data is created and consumed from external sources. Cloud applications create data which may not be readily available for analysis. Not having the data for analysis will greatly change the critical insights outcome.

Major Benefits of Virtualization

 Virtualization_benefits

Additional considerations

  • Address performance impact of Virtualization on the underlying Application and the overall refresh delays appropriately
  • It is not a replacement for Data Integration (ETL) but it is a quicker way to get data access in a controlled way
  • May not include all the Business rules, which implies Data Quality issues, may still be an issue

In conclusion, having the Virtualization tool in the Enterprise Data Management portfolio of products will add more agility in Data Management. However, use Virtualization  appropriately to solve the right kind problem and not as a replacement to traditional ETL.

Cloud BI use cases

Cloud BI comes in different forms and shapes, ranging from just visualization to full-blown EDW combined with visualization and Predictive Analytics. The truth of the matter is every niche product vendor offers some unique feature which other product suite does not offer. In most case you almost always need more than one suite of BI to meet all the needs of the Enterprise.

De-centralization definitely helps the business in achieving agility and respond to the market challenges quickly. At the same token that is how companies may end up with silos of information across the enterprise.

Let us look at some scenarios where a cloud BI solution is very attractive to Departmental use.

time_2_mktTime to Market

Getting the business case built and approved for big CapEx projects is a time-consuming proposition. Wait times for HW/SW and IT involvement means lot longer delays in scheduling the project. Not to mention the push back to use the existing reports or wait for the next release which is allegedly around the corner forever.

 

deploymentDeployment Delays

Business users have immediate need for analysis and decision-making. Typical turnaround for IT to get new sources of data takes anywhere between 90 days to 180 days. This is absolutely the killer for the business which wants the data now for analysis. Spreadsheets are still the top BI tool just for this reason. With Cloud BI (not just the tool) Business users get not only  the visualization and other product features but also the data which is not otherwise available. Customer analytics with social media analysis are available as  a third-party BI solution. In the case of value-added analytics there is business reason to go for these solutions.

 

Tool CapabilitiesBI_cap

Power users need ways to slice and dice the data, need integration of other non traditional sources (Excel, departmental cloud applications) to produce a combined analysis. Many BI tools comes with light weight integration (mostly push integration) to make this a reality without too much of IT bottleneck.

So if we can add new capability, without much delay and within departmental budget where is the rub?

The issue is not looking at the Enterprise Information in a holistic way. Though speed is critical, it is equally important to engage Governance and IT to secure the information and share appropriately to integrate into the Enterprise Data Asset.

As we move into the future of Cloud based solutions, we will be able to solve many of the bottlenecks, but we will also have to deal with security, compliance and risk mitigation management of leaving the data in the cloud. Forging a strategy to meet various BI demands of the enterprise with proper Governance will yield the optimum use of resources and /solution mix.

KScope14 Session: Empower Mobile Restaurant Operations Analytics

Perficient is exhibiting and presenting this week at KScope14 in Seattle, WA. On Monday, June 23 I presented my retail-focused solution offering built upon the success of Perficient’s Retail Pathways, but using the Oracle suite of products. In order to focus the discussion to fit within a one hour window I chose restaurant operations to represent the solution.

Here is the abstract for my presentation.

Multi-unit, multi-concept restaurant companies face challenging reporting requirements. How should they compare promotion, holiday, and labor performance data across concepts? How should they maximize fraud detection capabilities? How should they arm restaurant operators with the data they need to react to changes affecting day-to-day operations as well as over-time goals? An industry-leading data model, integrated metadata, and prebuilt reports and dashboards deliver the answers to these questions and more. Deliver relevant, actionable mobile analytics for the restaurant industry with an integrated solution of Oracle Business Intelligence and Oracle Endeca Information Discovery.

We have tentatively chosen to brand this offering as Crave – Designed by Perficient. Powered by Oracle. This way we can differentiate this new Oracle-based offering from the current Retail Pathways offering.

Crave Logo

Read the rest of this post »

SAP HANA – A ‘Big Data’ Enabler

Some interesting facts and figures for your consideration:

  • 90% – of stored data in the world today was created in the past 2 years
  • 50% – annual data growth rate
  • 34,000 – tweets sent each minute
  • 9,000,000 – daily Amazon orders
  • 7,000,000,000 – daily Google Page Views
  • 2.5 Exabyte – amount of data created every day (an Exabyte is 1,000,000,000,000,000,000 B = 1000 petabytes = 1 million terabytes = 1 billion gigabytes)

Looking at these numbers it is easy to see why more and more technology vendors want to provide solutions to ‘Big Data’ problems.

In my previous blog, I mentioned how we’ll soon get to a place where it will be more expensive for a company not to store data than to store data – some pundits claim that we’ve already reached this pivotal point.

Either way, it would be greatly beneficial to come to terms with at least some of those technologies that have made a substantial investment in the Big Data space.

One such technology is SAP HANA – a Big Data enabler. I am sure that some of you have heard this name before… but what is SAP HANA exactly?

The acronym H.AN.A. in ‘SAP HANA’, stands for High-performance ANalytical Appliance. If I went beyond the name/acronym and described SAP HANA in one sentence, I would say that SAP HANA is a database on steroids, perfectly capable of handling Big Data in-memory, and one of the few in-memory computing technologies that can be used as an enabler of Big Data Solutions.

Dr. Berg and Ms. Silvia – both SAP HANA gurus – provide a comprehensive and accurate definition of SAP HANA:

“SAP HANA is a flexible, data-source-agnostic toolset (meaning it does not care where the data comes from) that allows you to hold and analyze massive volumes of data in real time, without the need to aggregate or create highly complex physical data models. The SAP HANA in-memory database solution is a combination of hardware and software that optimizes row-based, column-based, and object-based database technologies to exploit parallel processing capabilities. We want to say the key part again: SAP HANA is a database. The overall solution requires special hardware and includes software and applications – but at its heart, SAP HANA is a database”.

Or as I put it, SAP HANA is a database on steroids… but with no side-effects, of course. Most importantly though, SAP HANA is a ‘Big Data Enabler’, capable of:

  • Conducting Massive Parallel Processing (MPP), handling up to 100TB of data in-memory
  • Providing a 360 degree view of any organization
  • Safeguarding the integrity of the data by reducing, or eliminating data migrations, transformations, and extracts across a variety of environments
  • Ensuring overall governance of key system points, measures and metrics

All with very large amounts of data, in-memory and in real time… could this be a good fit for your company? Or, if you are already using SAP HANA, I’d love to hear from you and see how you have implemented this great technology and what benefits you’ve seen working with it.

My next blog post will focus on SAP HANA’s harmonious, or almost harmonious, co-existence with Hadoop…

QlikView… QlikTech… Qlik…

Several years ago, when I started using QlikView (QlikTech’s flagship product), I had a strong preference for more traditional BI tools and platforms, mostly because I thought that QlikView was just a visualization tool. But after some first-hand experience with the tool, any bias I had was quickly dissipated and I’ve been a QlikView fan and fulfilling the role of Senior QlikView Architect on full lifecycle projects for a while now.

QlikViewToday, Qlik Technologies (also known as QlikTech or simply Qlik) is the 3rd fastest growing tech company in the US (according to a Forbes article) but my personal journey with QlikView, and probably QlikTech journey as well, has not always been easy – a paradigm shift in the way we look at BI is required. Most importantly, I understood along with many others, that this isn’t a matter of QlikView or SAP BI, of QlikView Agile approach to BI or Traditional BI – it is NOT a matter of ORs, but rather a matter of ANDs.

It is a matter of striking the right balance with the right technology mix and do what is best for your organization, setting aside personal preferences. At times QlikView may be all that is needed. In other cases, the right technology mix is a must. At times ‘self-service’ and ‘agile’ BI is the answer…. and at times it isn’t. Ultimately, it all revolves around the real needs of your organization and creating the right partnerships.

So far, QlikTech has been able to create a pretty healthy ecosystem with many technology partners, from a wide variety of industries and with a global reach. QlikTech has been able to evolve over time and has continued to understand, act on and metabolize the needs of the market, along with the needs of end-users and IT – I wonder what’s next.

That’s one of the reasons why Qlik has been able to trail-blaze a new approach to BI; user-driven BI, i.e. Business Discovery. According to Gartner ‘Qlik’s QlikView product has become a market leader with its capabilities in data discovery, a segment of the BI platform market that it pioneered.’

Gartner defines QlikView as ‘a self-contained BI platform, based on an in-memory associative search engine and a growing set of information access and query connectors, with a set of tightly integrated BI capabilities’. This is a great definition that highlights a few key points of this tool.

In coming blogs, we’ll explore some additional traits of QlikTech and its flagship product QlikView, such as:

Ø  An ecosystem of partnerships – QlikTech has been able to create partnerships with several Technology Partners and set in place a worldwide community of devotees and gurus

Ø  Mobility – QlikView was recently named ‘Hot Vendor’ for mobile Business Intelligence and ranks highest in customer assurance (see WSJ article here) with one of the best TCO and ROI

Ø  Cloud – QlikView has been selected as a cloud-based solution by several companies and it has also created strong partnerships with leading technologies in Cloud Computing, such as Amazon EC2 and Microsoft Azure

Ø  Security – provided at the document, row and field levels, as well as at the system level utilizing industry standard technologies such as encryption, access control mechanisms, and authentication methods

Ø  Social Business Discovery – Co-create, co-author and share apps in real time, share analysis with bookmarks, discuss and record observations in context

Ø  Big Data – Qlik has established partnerships with Cloudera and Hortonworks. In addition, according to the Wall Street Journal, QlikView ranks number one in BI and Analytics offering in Healthcare (see WSJ article here), mostly in connection with healthcare providers seeking “alternatives to traditional software solutions that take too long to solve their Big Data problems”

 

In future posts, I am going to examine and dissect each of these traits and more! I am also going to make sure we have some reality checks set in place in order to draw the line between fact and fiction.

What other agile BI or visualization topics would you like to read about or what questions do you have? Please leave comments and we’ll get started.

Three Attributes of an Agile BI System

In an earlier blog post I wrote that Agile BI was much more than just applying agile SDLC processes to traditional BI systems.  That is, Agile BI systems need to support business agility.   To support business agility, BI systems should address three main attributes:

  1. Usable and Extensible –  In a recent TDWI webinar on business enablement, Claudia Imholf said “Nothing is more agile than a business user creating their own report.”   I could not agree more, with Ms. Imholf’s comments.   Actually, I would go farther.  Today’s BI tools allow users to create and publish all types of BI content like dashboards, and scorecards.  They allow power users to conduct analysis and then storyboard, annotate, and interpret the results.   Agile BI systems allow power users to publish content to portals, web-browsers, and mobile devices.  Finally, Agile BI systems do not confine users to data published in a data warehouse, but allow users to augment IT published data with “user” data contained in spreadsheets and text files.  Read the rest of this post »

Top 5 Best Practices for an Actionable BI Strategy

In an earlier blog post, I pointed out a number of companies complete a BI Strategy but only to shelve it shortly after its completion.   One main reason is that companies let their BI Strategy atrophy by not maintaining it; however, the other main cause of shelving a BI Strategy is that it was not actionable. That is, the BI Strategy did not result in a roadmap that would be funded and supported by the organization.  As you formulate your BI Strategy, there are 5 best practices that will help result in a BI Strategy that is actionable and supported by your business stakeholders.  These best practices are:BI Strategy

1.       Address the Elephants in the Room – Many times if management consultants are brought into help with a BI Strategy, their objectivity is needed to resolve one or more disagreements within an organization.   For example, this disagreement could be a DW platform selection, architectural approach for data integration, or the determination of business priorities for the BI program.   The BI Strategy needs to resolve these issues or the issues will continue to fester within the organization, eventually undermining the support for the BI Strategy.    Read the rest of this post »

Strategic thoughts when choosing new big data storage technology

Before 2000, primary challenges for companies were to enable the systems so that transactional data could be captured faster for organizational productivity, now gear is shifted towards delivery of information to the business users through reporting, analytical system and actionable drill down dashboard etc that organization have stored in files, data, audio and video stream etc on propriety clustered and open source file system based on their business need and suitability.

Organizations are using storage technology for decades to store the information on clustered file system which are mounted on multiple servers and few are not but complexities of the underlying storage environment increases as new servers/system are added for scalability.

Now, organizations, which want to monetize, better analyze and capitalize the information channels and integrate with the business, depending on the big data storage, opted in the past they are facing challenges of large scale data indexing, availability on demand with low latency. So some of them are choosing/changing or integrating with better enterprise class large scale data storage through some connecter technology to trap and monetize the value of information they have in reduced time manner.

We should know the technology, protocols, network challenges when thinking of  adopt new big data storage and their features.  There are few architectural approaches how clustering works in such scenario.

Shared Disk: It uses SAN storage area network at a block level. It has again few approaches, some distribute information across all over the server in cluster and some employ centralized metadata server.  SGI CXFS, IBM GPFS,  DataPlow, Microsoft CSV, Oracle CFS, Redhat GFS, SUN QFS, VMware VMFS, Ceph etc are most widely used cluster file systems.

Distributed file system: It uses a network protocol and Lustre’s data storage technology is very popular on this, Ceph has also come up with this and Microsoft has DFS too.

NAS: It uses file based protocol i.e. NFS, SMB (Server Message Block)/CIFS (Common Internet File System).

Shared Nothing Architecture: Each storage nodes communicates changes to other or to master for replication. Ceph, Lustre and Hadoop are few implementers for this.

The most suitable technology selection reduces time to solution and control the budgets as well, so based on the above architecture, let us list down the general and most critical SLAs from the solutions.

Common selection parameters for big data storage technology?

  • High availability
  • Scalability of data storage with fixed IT budgets
  • Fault tolerance
  • Cost of ownership, commodity hardware
  • Global workload sharing
  • Map Reduce algorithm support on high bandwidth
  • Reduced time to solution
  • Centralize storage management
  • Support wide range of hardware and software
  • High application IO support for analytic system
  • Event stream processor/storage
  • Caching of data for better performance
  • Holistic network design (Unified Ethernet Fabric)

List of Big data Storage technology available in the market

Most of the features are claimed to be supported by these listed except Unified Ethernet Fabric that is separate case and CISCO has network related offering to scale out the big data storage.

HDFS

It is the de facto solution in big data technology for large scale data processing over clusters of commodity hardware and is very much suitable However if you are trying to process dynamic datasets (data in motion) , ad-hoc analysis or graph data structure, please stop and read about Goggle’s better alternative to map reduce paradigm (Percolator, Dremel and Pregel). Cassandra and other Enterprise version of HDFS are trying to provide improvement and solutions in this area.

GPFS

It has been available in the market since 1993 and thousands of organizations are using it (Pharmaceutical, Financial Institutes, Life Science, USA National Whether Forecast, Energy Sector etc). It runs on commodity hardware as well and support many OS and platform. It claims to work with low latency ad-hoc analysis and streaming data at very high volumes. It is a propriety offering from IBM and of course one of the very suitable big data storage options if licensing is not a concern. Cluster Manager Failure, File System manager level failure, Secondary cluster, configuration Server failure and Rack Failure are claimed to be addressed with GPFS SNC.

Lustre Distributed File System

This is a very recognized scalable cluster computing file storage system that is widely used by super computers and it has open licensing. There are many commercial suppliers of this bundled with hardware like netApp and Dell. It also claims to fulfill all of requirement listed above including low latency for analytical system.

Isilon’sOneFS by EMC

This is a major induction in the big data storage arena and companies like Oracle and IBM are taming this big data beast. EMC was has re-engineered HDFS and created its own version of data storage later in 2011. It uses the MapR File system. MAPR File System claim to be the alternate file system to HDFS which has full random access read /write right. Snapshots and mirrors advanced feature that addressed centralized metadata of name node in HDFS Single point failure issue.

NetApp’s RAID array on HDFS.

Netapps claims it’s improvement to HDFS to make it faster and reliable but still rely on HDFS.

Clever safe Dispersed Computation file system highly scalable object based storage.

Appistry and KosmosFS are few more computational big data storage options.

Conclusion

In order to monetize and present actionable insights of information to business for provocative organizational decision making, analytic system heavily rely on data storage technologies that drives how data is  made available to frontend middle ware application at a faster rate and reduced interval. As we know GFS based HDFS is cheaper and rock solid but for scalability over Peta bytes perhaps the enterprise class solution may be EMC, IBM or GPFS or many available in the market?

But remember many commercial offering don’t run on commodity hardware and cost advantage of HDFS and related bundles are fundamental for current success and growing popularity. Low latency issue with HDFS can be addressed with right design/implementation by skilled big data technical experts that are organization choice weather they desire for per-bundled commercial offering or open source HDFS that allows customizing organizational needs in a flexible manner but it all depends on business requirement and investment budgets for the solutions.

A Data Mining Workbench

Data mining provides organizations with a clearer view of current conditions as well deeper insight into the future.

In many previous posts, I’ve talked about IBM SPSS Statistics and Modeler. Here is some basic information:

IBM® SPSS® Modeler Professional is the data mining workbench for the analysis of structured numerical data to model outcomes and to make predictions that inform business decisions with predictive intelligence.

Some Highlights Include:

  • Easily access, prepare and model structured data with this intuitive, visual data mining workbench.
  • Expand the benefits of business analytics through integration with IBM® Cognos®
  • Business Intelligence software and IBM® InfoSphere® Warehouse.
  • Analyze data stored in legacy systems with IBM Classic Federation Server and IBM® zDB2® support.
  • Rapidly build and validate models using the most advanced statistical and machine-learning techniques available.
  • Efficiently deploy insight and predictive models.
  • Add More deployment options through zLinux, SuSE Linux Enterprise Server, and inclusion into IBM Smart Analytics System for Power.

Download the Fact Sheet for Modeler here:

IBM SPSS Modeler Professional