I hope everyone has found my articles useful and have used my articles in creating some great looking SSRS reports. This week we will take a look at creating a product breakdown report or a detailed sales report.
Let us refresh our memory. Our sales table has the following fields:
|ID||Identifier (Primary Key)||Number|
|Product _Type||Candles, hand sanitizers, perfumes, etc||Varchar2|
|Product Detail||Fragrance of the product such as Strawberry, Vanilla, Eucalyptus, etc||Varchar2|
|In Store||Timestamp of the product arrival in store||Date|
|Sold||Timestamp of product sold||Date|
Imagine a scenario in which a manager wants a count of every product sold by its product type. We generate such a report by creating tabular reports and using SQL’s group by clause.
If you missed my article on creating tabular reports, don’t fret! This report will take you through the all the steps required to create a great looking sales report using tables in SSRS. This article will be a great refresher course.
Once you have found your way to the (Splunk.com) website and downloaded your installation file, you can initiate the installation process. At this point you would have received the “Thank You for Downloading” welcome email.
More than just a sales promotion, this email gives you valuable information about the limitations of your free Splunk Enterprise license, as well as links to help you get started quickly, including links to:
The (MS Windows) Installation
On MS Windows, once your download is complete, you are prompted to Run.
So you are ready to Splunk and you want to get started? Well..
Taking the First Step
Your first step, before you download any installation packages, is to review the Splunk Software License Agreement, which you can find at splunk.com/view/SP-CASSSFA (and if you don’t check it there the Splunk install drops a copy for you in the installation folder – in both .RTF and .TXT formats). Although you have the ability to download a free full-featured copy of Splunk Enterprise, the agreement governs the installation and use and it is incumbent upon you to at least be aware of the rules.
Next, as in anytime you are intending to perform a software installation, you must make time to review your hardware to make sure that you can run Splunk in such a way as to meet your expected objectives. Although Splunk is a highly optimized application, a good recommendation is if you are planning on performing an evaluation of Splunk for eventual production deployment, you should use hardware typical of the environment you intend to employ to. In fact, the hardware you use for your evaluation should meet or exceed the recommended hardware capacity specifications for the tool and (your) intentions (you can check the Splunk.com website or talk to a Splunk professional to be sure what these are).
Disk Space Needs
Beyond the physical footprint of the Splunk software (which is minimal), you will need some Splunk “operational space”. When you read data into Splunk, it creates a compressed/indexed version of that “raw data” and this file is typically about 10% of the size of the original data. In addition, Splunk will then create index files that “point” to the compressed file. These associated “index files” can range in size -from approximately 10% to 110% of the rawdata file – based on the number of unique terms in the data. Again, rather than get into sizing specifics here, just note that if your goal is “education and exploration”, just go ahead and install Splunk on your local machine or laptop – it’ll be just fine.
Most organizations today run a combination of both physical and virtual machines. Without getting into specifics here, it is safe to say that Splunk runs well on both; however (as does most software) it is important that you understand the needs of the software and be sure that your machine(s) are configured appropriately. The Splunk documentation reports:
“If you run Splunk in a virtual machine (VM) on any platform, performance does degrade. This is because virtualization works by abstracting the hardware on a system into resource pools from which VMs defined on the system draw as needed. Splunk needs sustained access to a number of resources, particularly disk I/O, for indexing operations. Running Splunk in a VM or alongside other VMs can cause reduced indexing performance”.
Let’s get the software!
Splunk Enterprise (version 6.0.2 as of this writing) can run on both MS Windows and Linux, but for this discussion I’m going to focus on only the Windows version. Splunk is available in both 32 and 64 bit architectures, and it is always advisable to check the product details to see which version are correct for your needs.
Assuming that you are installing for the first time (not upgrading) you can download the installation file (msi for Windows) from the company website (www.splunk.com). I recommend that you read through the release notes for the version that you intend to install before downloading. Release notes list the known issues along with potential workarounds and being familiar with this information can save plenty of your time later.
[Note: If you are upgrading Splunk Enterprise, you need to visit the Splunk website for specific instructions before proceeding.]
Get a Splunk.com Account
To actually download (any) version of Splunk, you need to have a Splunk account (and user name). Earlier, I mentioned the idea of setting up an account that you can use for educational purposes and support. If you have visited the website and established your account, you are ready; if not, you need to set one up now.
Once you have an account, you can click on the big, green button labeled “Free Download”. From there, you will be directed to the “Download Splunk Enterprise” page, where you can click on the link of the Splunk version you want to install.
From there, you will be redirected to the “Thank You for downloading…” page and be prompted to save the download to your location:
And you are on your way!
Check back and I’ll walk you through a typical MS Windows install (along with some helpful hints that I learned during my journey to Splunk Nirvana)!
Like my 1st grade teacher would tell me when I ended a sentence with this preposition……”It’s between the ‘A’ and the ‘T’”. Well, in this situation, it’s between the “cloud” and the “on premise”.
More and more companies are starting to explore and use Infrastructure as a Service (IaaS) as a viable option for developing and maintaining their data warehouse. There are many companies on the market that provide Iaas like Amazon, AT&T, and bluelock, to name only a few. We see this market taking off almost exponentially because providers are offering companies environments that are safe, secure, fast, redundant, and cheap. Also, without a doubt, many companies are already using Software as a Service (SaaS) where much of their data is also stored in the cloud (Sales Force, Workday, Facebook, Twitter, etc.).
Although much of the company’s data is being relocated and used in the cloud, there is a lot that is still on premise (On-Prem) and for all practical purposes will remain there. According to Chris Howard, managing vice president at Gartner, “Hybrid IT is the new IT and it is here to stay. While the cloud market matures, IT organizations must adopt a hybrid IT strategy that not only builds internal clouds to house critical IT services and compete with public CSPs, but also utilizes the external cloud to house noncritical IT services and data, augment internal capacity, and increase IT agility.”
The issue now starts to become, how do I manage my data environment that is both in the Cloud and On-Prem? And, how do I keep the information in sync and current so that I can use the data where appropriate to make better business decisions?
There are several software vendors on the market that realized that this is something that quickly needs to be addressed (short of manual coding) and they provide solutions in this area. Right now, Informatica is the market leader in data integration and they also have solutions that easily manage the issues of a hybrid data environment (Cloud and On-Prem). Informatica has been recognized by ChannelWeb as the pioneer of Cloud data integration and by salesforce.com customers as the #1 integration application on AppExchange for the past 5 years.
So why is managing in the cloud and on-prem that easy? From what I have seen with this product, since Informatica already offers connectivity to just about everything (well, maybe everything), it uses some of the same logic and thought process for extending the concept of data integration to everything in the Cloud. This concept includes data synchronization, data quality, Master Data Management, etc. They have created connectors to many of the SaaS applications in the cloud so a user of this solution does not need to hand code anything to quickly connect and start using the service. Plus, if a person already knows how to use any of Informatica’s On-Prem solutions (like PowerCenter, DQ, MDM, etc.) there is very little to no learning curve to quickly apply this knowledge to the Cloud solution.
With Informatica’s concept of VIBE (virtual data machine), a person can map once and deploy anywhere. What this means is that a developer can create data mappings in PowerCenter with the On-Prem solution and then run the mappings in the Cloud solution. These solutions can also be created directly in the Cloud product and then run On-Prem if needed.
So let’s take a look at the architecture of the Informatica Cloud solution. The main thing about how this works is that the company’s data does not pass through Informatica’s environment in the cloud to reach any destination whether it is in the Cloud or On-Prem. When installing the Informatica Cloud product, a runtime agent is placed in the customer’s environment (yep, behind the firewall if needed) and this is where all the work is done. Metadata about your environments is stored in the Informatica Cloud (data about the sources, targets, jobs, transformations, etc.) and managing and monitoring of your integration processes are performed through a web application. All the work and data movement is done in the customer’s environment. The only actual data that goes to the Cloud is data that you choose to store in the cloud (e.g. Sales Force, your data warehouse in Amazon RedShift, etc.).
The product has prebuilt connectors to many Cloud Based solutions so it’s only a matter of selecting the application that you need to connect with in the Cloud and the Informatica Cloud solution automatically understands it’s structure and how to access the data stored there. I was very surprised how quick and easy a job could be set up to maintain data synchronicity between On-Prem and Cloud data.
Here is a diagram of the architecture that I mentioned earlier. The dotted line represents the management of the metadata in the Informatica Cloud. The company’s actual data travels only between the On-Prem location and the Cloud applications that the company subscribes to…… Well there you go; I ended my blog with a preposition. Forgive me Mrs. Rita Hart….
Image courtesy of Informatica
“Never become so much of an expert that you stop gaining expertise.” – Denis Waitley
In all professions, and especially information services (IT), success and marketability depends upon an individual’s propensity for continued learning. With Splunk, there exist a number of options for increasing your knowledge and expertise. The following are just a few. We’ll start with the obvious choices:
Similar to most main-stream technologies, Splunk offers various certifications and as of this writing, Splunk categorizes certifications into the following generalized areas:
The Knowledge Manager
A Splunk Knowledge Manager creates and/or manages knowledge objects that are used in a particular Splunk project, across an organization or within a practice. Splunk knowledge objects include saved searches, event types, transactions, tags, field extractions and transformations, lookups, workflows, commands and views. A knowledge manager not only will have a though understanding of Splunk, the interface, general use of search and pivot, etc. but also possess the “big picture view” required extend the Splunk environment, through the management of the Splunk knowledge object library.
A Splunk Administrator is required to support the day-to-day “care and feeding” of a Splunk installation. This requires “hands-on” knowledge of best practices, configuration details as well as the ability to create and manage Splunk knowledge objects, in a distributed deployment environment.
The Splunk Architect will include both knowledge management expertise, administration know-how and the ability to design and develop Splunk Apps. Architects must also possess the ability to focus on larger deployments, learning best practices for planning, data collection, sizing and documenting in a distributed environment.
Often I am asked to conduct a “performance review” of implemented Cognos TM1 applications “rather quickly” when realistically; a detailed architectural review must be extensive and takes some time. Generally, if there is a limited amount of time, you can use the following suggestions as perhaps some appropriate areas to focus on (until such time when a formal review is possible):
Specific Review Areas
Based upon timing and the above outlined objectives, you might proceed by looking at the following:
Taking a quick look at the server configuration settings (tm1s.cfg) and follow-up on any non-default settings – who changed them and why? Do they make sense? Where they validated as having the expected effects?
Cubes and dimensions should be reviewed. Are there excessive views or subsets? Are there any cubes with an extraordinary number of dimensions? Have the dimension order been optimized? Any dimension partially large? Etc.
Security can be complex and compound or, simple and straight forward. Given limited time I usually check the number of roles (groups) vs. the number of users (clients) – hint: never have more groups than clients – and things like naming conventions and how security is maintained. In addition, I am always uneasy when I see cell-level security implemented.
As part of a “quick review” I recommend leveraging a file-search tool such as Grep (or similar) to provide the ability to examine all of the applications TurboIntegrator script files for a specific function or logic pattern use (that you may want to examine more closely):
Saving Changed Data Methodology
Quickly scan the TI processes – do any call the SaveDataAll function to force in-memory changes to be committed to disk? SaveDataAll commits changes in-memory for all cubes in the instance or server. This creates a period of time that the TM1 server is locked to all users. In addition, depending on the volume of changes that were made since the last save, the period of time the function will require to complete will vary and increase processing time. Rather than using the SaveDataAll function, CubeSaveData should be used to serialize or commit in-memory changes for (only) a specific cube:
TM1 by default logs all changes as transactions to all cubes. Transactional logging may be used to recover in an event of a server crash or other abnormal error or shutdown. A typical application does not need to log all transactions for all cubes. Transaction logging impacts overall server performance and increases processing time when changes are being made in a “batch”. The best practice recommendation is to turn off logging for all cubes that do not require TM1 to recover lost changes. In other situations, it may be preferable to set cube logging on for a particular cube but temporally turn off logging at the start of a (TurboIntegrator) process and then reset or turn back on logging after the process completes successfully:
Subset and View Maintenance
It is a best practice recommendation to avoid using the ViewDestory and SubsetDestory functions. These functions are memory intensive, cause potential locking/rollback situations and impact performance. The appropriate approach is to use ViewExists and SubsetExists and if the view or subset exists, update the view and subsets as required for the processing effort.
An additional good practice is to modify the view and subset in the Epilog section to insert a single leaf element to all subsets in the view to reduce its overall size in case a user accidentally opens these “not for user” views.
The CellsUpdateable is a TM1 function that lets you determine if a particular cube cell can be written to. This function is useful but impacts performance since it uses the same logic and internal resources as performing an actual cell write (CellPutN or CellPutS). This function is usually used as a defensive measure in avoiding write errors or to simplify processing logic flow. It is a best practice recommendation to restrict or filter the data view being processed (eliminating the need for recurring CellsUpdateable calls) if possible. This approach also decreases the volume or size of the data transactions to be processed.
Span of Data
Even the best TM1 model designs have limitations on the volume of data that can be loaded into a cube. An optimal approach is to limit a cube to 3 years of “active” data with prior years available in archival cubes. This will reduce the size of the entire TM1 application and improve performance. Older data can still be available for processing (sourced from 1 or more archival cubes). Keep the “active” cube loaded with only those years of data that have the highest percentage chance of being required for a “typical” business process. Additionally, it is recommended that a formal process for both archival and removal of years of data be provided.
TM1 caches views of data into memory for faster access (increasing performance). Once data changes, views become invalid and must be re-cached for viewing or processing. It may be beneficial to utilize the TM1 function VIECONSTRUCT to “force” TM1 to pre-cache updated views before processing them. This function is useful for pre-calculating and storing large views so they can be quickly accessed after a data load or update:
Real-time (RT) to Near-real-time (NRT)
Generally speaking, consolidation is one of TM1’s greatest features. The TM1 engine will perform in-memory consolidations and drill-downs faster than any rule or TI process.
All data that is not loaded from external means should be maintained using the most efficient means: a consolidation, TurboIntegrator process or rule. From a performance and resource usage perspective, consolidations are the fastest and require the least memory and rules are the slowest and require the most memory. Simply put, all data that cannot be calculated by a TM1 consolidation should be seriously evaluated to determine its change tempo (slow moving or fast moving). For example, data that changes little or only changes during a set period of time, or “on demand” are “good candidates” for TI processing (or near real time maintenance) rather than using a rule. If at all possible, moving rule calculation logic into a TurboIntegrator process should be the method used to maintain as much data as possible. This will reduce the size of the overall TM1 application and improve overall processing performance as well.
It is an architectural best practice to organize application logical components as separate or distinct from one another. This is known as “encapsulation” of business purpose. Each component should be purpose based and be optimized to best solve for its individual purpose or need. For example, the part of the application that performs calculating and processing of information should be separated from the part (of the application) that supports the consumption of or reporting on of information.
Applications with architecture that does not separate on purpose are more difficult (more costly) to maintain and typically develop performance issues over time.
Architectural components can be separated within a single TM1 server instance or across multiple server instances. The “multiple servers” approach can be ideal in some cases may:
Profiling is the extrapolation of information about something, based on known qualities (baselines) to determine its behavior (performance) patterns. In fact, performance profiling is determining the average time and/or resources required to perform a particular task within an application. During performance testing, profiles will be collected for selected application events that have established baselines. It is absolutely critical to follow the same procedure that was used to establish the event baseline to create its profile. Application profiles are extremely valuable in:
It is strongly recommend that a performance/stress test be performed, will the goal of establishing an application profile, as part of any “application performance” review.
Overall, most TM1 applications will have various “to-dos” such as fine tuning of specific feeders and other overall optimizations which may be already in process or scheduled to occur as part of the project plan. It is recommended that if an extended performance review is not feasible at this time, at leaset each of these suggestions should be reviewed and discussed to determine its individual feasibility, expected level of effort to implement and effect on overall design approach (keeping in mind these are only high level – but important suggestions).
The importance of making credible decisions can be the difference between profit or loss, or even survival or extinction.
Decision Support Systems (or DSSs) serve the key decision makers of an organization– helping them to effectively assess predictors (which can be rapidly changing and not easily specified in advance) and make the best decisions, reducing risk.
SPLUNK as the DSS
Can Splunk be considered a true real-time decision support system? The answer is of course, “Yes!”
Splunk does this by providing features and functionalities that provide the ability to:
Splunk the Product
Splunk runs from both a standard command line or an interface that is totally web-based (which means that no thick client application needs to be installed) and performs large-scale, high-speed indexing on both historical and real-time data.
To index, Splunk does not require a “re-store” of any of the original data, but stores a compressed copy of the original data (along with its indexing information) allowing you to delete or otherwise move (or remove) the original data. Splunk then utilizes this “searchable repository” from which to efficiently graph, report, alert, dashboard and visualize in detail.
It just Works
After installation, Splunk is ready to be used. There are no additional “integration steps” required for Splunk to handle data from particular products. To date, Splunk simply works on almost any kind of data or data source you may have access to but should you actually require some assistance, there is a Splunk professional services team that can answer your questions or even deliver specific integration services.
The Big Data market as measured by vendor revenue derived from sales of related hardware, software and services reached $18.6 billion in calendar year 2013. That represents a growth rate of 58% over the previous year (according to Wikibon data).
Also (according to Wikibon), Splunk had over $283 million of that “big data revenue” and has an even brighter outlook for this year. More to come for sure…
The term “Big Data” is used to describe information so large and complex that it becomes almost impossible to process using traditional methods. Because of the volume and/or unstructured nature of this data, making it useful or turning it into what the industry is calling “operational intelligence” (OI) is extremely difficult.
According to information provided by International Data Corporation (IDC), unstructured data (generated by machines) may account for more than 90% of the data in today’s organizations.
This type of data, (usually found in enormous and ever increasing volumes) records some sort of activity, behavior, or measurement of performance. Today, organizations are missing the opportunities that big data can provide because they are focused on structured data using traditional tools for business intelligence (BI) and data warehousing (DW).
Using these main-stream methods- such as relational or multi-dimensional databases in an attempt to understand big data is problematic (to say the least!) Attempting to use these tools for big data solution development requires serious experience and the development of very complex solutions and even then, in practice they do not allow enough flexibility to “ask any question” or get those questions answered in real time—which is now the expectation, not a “nice to have” feature.
Splunk – “solution accelerator”
Splunk started by focusing on the information technology department supporting the monitoring of servers, messaging queues, websites, etc. but is now recognized for its ability to help with the specific challenges (and opportunities) of effectively organizing and managing massive amounts of any kind of machine-generated big data.
Getting “right down to it”, Splunk reads (almost) any (even real-time) data into its internal repository, quickly indexes it and makes it available for immediate analytical analysis and reporting.
Typcial query languages depend on schemas. A (database) schema is how the data is to be “placed together” or structured. This structure is based upon the knowledge of the possible applications that will consume the data, the facts or type of information that will be loaded into the database, or the (identified) interests of the possible end-users. Splunk uses a “NoSQL” approach that is reportedly based on UNIX concepts and does not require any predefined schema.
Correlating of Information
Using Splunk Search, it is possible to easily identify relationships and patterns in your data and data sources based upon:
Keep in mind that the powerful capabilities built into Splunk do not stop with just flexible searching and correlating. With Splunk, users can also quickly create reports and dashboards with charts, histograms, trend lines, and much other visualization without the cost associated with the structuring or modeling of the data first.
Splunk has been emerging as a definitive leader for collecting, analyzing and visualizing machine big data. Its universal method of organizing and extracting insights from massive amounts of data, from virtually any source of data, has opened up and will continue to open up new opportunities for itself in unconventional areas. Bottom line – you’ll be seeing much more form Splunk!
In this blog I wanted to take some time to describe certain “behaviors” of Cognos TM1 TurboIntegrator processes.
Access to Processes
As of version 10.2 (or as of this writing), TM1 Server lists all processes (in alphabetical order) under the consolidation “Processes”. The visibility of processes can be controlled by implementing TM1 security. TM1 groups can be set to have READ, WRITE or NONE/BLANK access to individual processes. Security must be set for each group and for each process individually. Process security (by group) is loaded and maintained in the TM1 control cube “}ProcessSecurity”.
Groups with NONE/BLANK access will not have visibility to the process.
Groups with READ access will have visibility and have the ability to right-click on the process and select Run. This executes the process.
Groups with WRITE access will have visibility and have the ability to either:
What is Splunk?
Splunk prides itself on “disrupting the conventional ways people look at and use data”, which is exciting. So what is Splunk?
Splunk is a horizontal technology used for application management, security and compliance, as well as business and web analytics. Splunk has over 5,200 licensed customers in 74 countries, including more than half of the Fortune 100. – Wikipedia 2013
I found (and you’ll find to) that Splunk the technology applies to almost every industry- government, online services, education, financial, healthcare, retail, telecommunications (and more) so having Splunk experience as part of your repertoire will be relevant for a long time to come. So why am I excited about Splunk? What can a TM1 guy do with it? Well, using Splunk, you can create simple to robust custom real-time (and near real-time) web-style applications for:
your data. It turns out that what Splunk apps are particularly useful for is dealing with transactional data – logged transactions that are machine or (in TM1’s case) application generated. Who hasn’t had the mundane task of pouring through literally thousands and thousands of TM1 transactions looking for the specific occurrence or a particular value such as an individual client ID or error code?
How complicated would it be to jump into Splunk and use it for some TM1 transactional analysis? Let’s see: