Dhananjay Gokhale, Author at Perficient Blogs

Microsoft Fabric: NASDAQ stock data ingestion into Lakehouse via Notebook

Dhananjay Gokhale — Mon, 01 Apr 2024 08:45:16 +0000

Background

Microsoft Fabric is emerging as one-stop solution to aspects revolving around the data. Before the introduction of Fabric, Power BI faced few limitations related to data ingestion, since Power Query offers limited ETL & data transformation functionality. Power Query M Language scripting lacks ease of development, compared to popular languages like Java / C# / Python etc., which might be the need for complex scenarios. Lakehouse of Microsoft Fabric eliminates this downside by providing power of Apache Spark, which can be used in Notebooks to handle complicated requirements. Traditionally, organizations used to provision multiple services of Azure Services, like Azure Storage, Azure Databricks, etc. Fabric brings all the required services into a single platform.

Case Study

A private equity organization wants to have a close eye on equity stocks it has invested in for their clients. They want to generate trends, predictions (using ML), and analyze data based on algorithms developed by their portfolio management team in collaboration with data scientists written in Python. Reporting Team wants to consume data for preparing Dashboards, using Power BI. Organization has subscription of Market Data API, which can pull live market data. This data needs to be ingested on a real-time basis into the warehouse, for further usage by the data scientist & data analyst team.

Terminologies Used

Below are few terms used in the blog. A better understand of these by visiting respective website is advisable for better understanding:

Lakehouse: In layman terms, this is the storehouse which will store unstructured data like CSV files in folders and structured data i.e., table (in Delta lake format). To know more about Lakehouse, visit official documentation link: https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview
Notebook: It is a place to store our Python code along with supporting documentation (in Markdown format). Visit this link for details on Fabric Notebook: https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook
PySpark: Apache Spark is an in-memory engine for analysis of bigdata. Spark supports languages like Java / Scala / SQL / Python / R. PySpark is Python based SDK for Spark. More information on spark can be found on the official website: https://spark.apache.org/
Semantic Model: Power BI Dataset is now re-named as Semantic Model.
Postman: Postman is a popular tool mostly used for API testing (limited feature free edition available). Postman offers Graphical Interface to make HTTP requests & inspect their response in various format like JSON / HTML etc.
Polygon.io: It is a market data platform offering API to query stock prices & related information.

Flow Diagram

Below is the flow diagram to help understand how Fabric components are interlinked to each other to achieve the result.

API Feed Data Capture

In this case study, a free account sign-up of website https://polygon.io was done, which allows querying End-of-Day data with cap of max 5 API request / minute. Considering this limitation, hourly data of only 3 securities have been ingested, to demonstrate POC (Proof-of-Concept). Viewers are encouraged to use a paid account, which supports real-time data with unlimited API request, for their development / testing / production usage.

Below is the screenshot of HTTP request with response made via postman for single security, to be implemented in Notebook, for data ingestion.

JSON response contains property named results, of type object array containing hourly status of specific security.
o = open / c = close / h = high / l = low / v = traded volume / t = timestamp (in Unix style)

Step 01: Create Fabric Capacity Workspace

For the POC, we will create a workspace named Security Market, for our portfolio management division, using New Workspace button (available to Fabric Administrator), with settings as per below screenshots.

It is crucial that in Premium tab of settings, one needs to choose Fabric capacity (or Trial), which offers Lakehouse (refer below screenshot).

Once created, it should look as below (refer below screenshot).

Step 02: Setup Lakehouse

Next, we will create a new Lakehouse to host API feed captured data. Click New button and choose more options (if Lakehouse is not visible in menu). A detailed page as shown in the screenshot below would appear.

Use Lakehouse option to create a new Lakehouse. Rename this Lakehouse as per your choice.

Lakehouse can host structured data Table & Semi-structured / Unstructured data Sub-Folder to store raw / processed files. We will create a sub-folder named EOD_Data to store data received from API request in CSV file format, which in-turn would be available for Data Scientist for further processing (refer below screenshot).

Step 03: Create Notebook

Once Lakehouse is ready, we can proceed towards the next step, where we will be writing Python code to capture & ingest data. Click on Open Notebook > New Notebook to initialize a blank Notebook (refer below screenshot).

This would open a blank Notebook. Copy-paste below Python code into code cell as shown in below screenshot.

import datetime as dt
import requests as httpclient
from notebookutils import mssparkutils

api_key = 'hfoZ81xxxxxxxxxxxxxxxx'  # Secret API Key
symbol_list = ['MSFT', 'GOOG', 'PRFT']  # Symbol list

target_date = dt.datetime.today()
file_content = 'symbol,timestamp,open,high,low,close,volume\n'  # insert CSV header
dt_YYYYMMDD = target_date.strftime('%Y-%m-%d')  # YYYYMMDD

for symbol in symbol_list:  # Iterate through each symbol (security)
    api_url = f'https://api.polygon.io/v2/aggs/ticker/{symbol}/range/1/hour/{dt_YYYYMMDD}/{dt_YYYYMMDD}/?apiKey={api_key}'
    resp_obj = httpclient.get(api_url).json()
    for r in resp_obj['results']:  # Iterate through each rows of security for respective frequency of timestamp
        price_open, price_close, price_high, price_low, trade_volume = r['o'], r['c'], r['h'], r['l'], r['v']
        timestamp = dt.datetime.fromtimestamp(r['t']/1000).strftime('%Y-%m-%dT%H:%M:%S') # decode unix timestamp
        file_content += f'{symbol},{timestamp},{price_open},{price_high},{price_low},{price_close},{trade_volume}\n' # append row
    
mssparkutils.fs.put(f'Files/EOD_Data/{dt_YYYYMMDD}.csv', file_content)  # Save file into Datalake with Date identifier
df = spark.read.load(f'Files/EOD_Data/{dt_YYYYMMDD}.csv', format='csv', header=True, inferSchema=True) # Read file into dataframe
df.write.saveAsTable('nasdaq', mode='append')  # Append dataframe rows to "nasdaq" table

Execute the above code, after the NASDAQ market is closed. Let us understand in nutshell, what this Python code does:

Every Market Data platform offers a secret API key, which needs to be provided in URL or HTTP header (as defined in API documentation).
Just to experiment, we have selected 3 securities MSFT (Microsoft Corp), GOOG (Alphabet Inc – Class C) and PRFT (Perficient Inc).
URL requires date to be in YYYY-MM-DD format, which variable dt_YYYYMMDD is holding.
Next, we run a loop for every security we want to query.
HTTP Get request is made to Market API platform by dynamically preparing URL with target date, security (symbol) and API key, setting frequency of hourly data to be returned.
In the JSON response, result property holds array of hourly data changes of security attributes (like open / close / high / low / etc.) as depicted in postman request screenshot. Kindly refer to respective market platform API documentation to know this in detail.
Next, we run a loop to iterate and capture hourly data and append them to a text variable named file_content in comma separated format, to prepare our CSV file (notice we already wrote CSV header in line no 9 of code).
Post execution of both the loops, in line no 20, a file with naming structure (YYYYMMDD.csv) is created under sub-folder EOD_Data.
In the last, this saved CSV file is read using spark reader into data frame, and the result is appended to a table named “nasdaq” (spark will auto create table if not found).

Let’s preview the data to ensure success of Python script. Navigate to Lakehouse, expand Tables, and ensure a table named “nasdaq” is created. Refer below screenshot for sample data.

Step 04: Schedule Job

This notebook code needs to be run every day. Notebook offers a feature of scheduling the code to run automatically on set frequency event. This option is available in Notebook under option Run > Schedule.

This would open detailed scheduling option page as below. Assuming 4.00 pm EST as closing timing and adding buffer of 30 min for safety, let us apply timer to execute this Notebook Daily at 4:30 pm (refer below image).

The job would run daily even on weekend when market is closed. Ideally this should not affect analytics, as for weekend Friday day-end position would continue. Data scientists are free to delete weekend data or ignore that data from their further calculation scripts.

Step 05: Generate Semantic Model

Semantic Model (previously known as Dataset) serves as data source for Power BI reports. Lakehouse contains an option to generate semantic model providing option to choose specific tables to be loaded into model required by BI developer (refer below screenshot).

BI Developer can further build upon that semantic model creating relationships & measures. Only limitation is that calculated columns cannot be added into tables from model editor, as in backend there is no Power Query. Columns need to be added using in Notebook.

Conclusion

The story does not end here but continue with authoring dashboards & reporting from Power BI based on the semantic model produced by Lakehouse. Fabric enables integration of team of data scientist, data engineers & data analyst on a single unified platform. Azure administrator just need to provision Fabric Capacity, which is scalable just like regular Azure Workload, based on CU (Consumption Units), which can be tweaked on hourly basis, to accommodate for peak workload hours. Blog intends to share few capabilities of Fabric for dealing real scenario. There are many components of Fabric like Data Activator, ML Model, Data Pipeline, which for further complex level use-cases, which can be a great for exploration.

Identifying & Deletion of Orphan Members in OneStream via simple Excel hacks

Dhananjay Gokhale — Fri, 08 Mar 2024 06:40:45 +0000

Background

Orphan members in OneStream are the members with no parent. Due to this, they are bit difficult to locate using Search Hierarchy feature, since technically they do not sit anywhere in hierarchy. They even do not get captured in grid view. Sometimes, an organization might want to delete them as they are no longer required or maybe align them back into appropriate location in hierarchy. This blog focuses on simple Excel & Notepad++ based techniques, to populate the list of orphan members, and deleting them (if required).

Tools Required

Technique shared in the blog requires 2 software:

Microsoft Excel
Notepad++

Notepad++ is an open-source software available free of cost. Microsoft 365 version of Excel is required as this technique uses a function named TEXTSPLIT( ) which was rolled out for 365 version of Excel. Alternatively, Excel for Web can also be used which is always updated, and available via free sign-up of Microsoft Account.

Case Study Showcase

Below is the screenshot of UD1 hierarchy for Product brand-wise.

100+ members have become Orphans node, as their relationship was removed to park them outside.

These members were created to incorporate entire catalog but were never purchased / sold. Organization wants to delete them permanently to keep data lightweight for better performance. Deleting 100+ members one-by-one would be herculean task, wasting hours of efforts. Let’s see some simple hacks, to populate list of orphan members. Once the list gets populated, those members can be deleted or re-aligned as desired, using Load/Extract feature.

Population of Orphan Member list

Following are the steps to derive & populate orphan member list:

Go to Application > Load/Extract > Extract. In the dropdown, select Metadata. In the Metadata hierarchy, select the desired dimension. Click extract button which will open file Save As dialog, to save the XML file.
Open the file using Notepad++ (right-click XML file & choose Edit with Notepad++).
Next, we will select value of name field of all the members from XML present in the dimension (which includes orphans too). In the Notepad++ go to menu Search > Mark. In Search Mode, select Regular Expression. In Find what textbox, write the pattern member name=”[\w\s_-]+” and then click Mark All button. This will mark text from member tag with value in name XML attribute.
(text matching the pattern is highlighted / marked with red background color)
Pattern Reference: \w = any text/number \s = any space _ = any underscore – = any hyphen + = one or more instances of these characters
Click Copy Marked Text button, to copy all pattern matching textual instances.
Open Microsoft Excel, create 2 worksheets named member and relationship. Paste the copied text into worksheet member (as demonstrated below in cell A2)
In the cell B2, write following formula expression =TEXTSPLIT(A2, “”””) [refer below screenshot – formula], which will split text based on double quotes. The split will span to 3 cells since member name value is enclosed in two double quotes. Copy-paste this formula expression for all the cells (make use of fill handle) This will produce a list of all members (including orphan) in column C.
Get back to Notepad++ & press Clear all marks, button. Place cursor in the first line.
This time, again invoke Mark dialog as illustrate in step 3 and perform same actions, with expression as parent=”[\w\s_-]+” child=”[\w\s_-]+” which will select parent/child text from relationship part of XML (refer below screenshot obtained to verify accuracy of selection by scrolling to the end of document)
Repeat the step no 4 to 6, this time copy-pasting text into worksheet relationship as demonstrated in below screenshot.
Column E contains child member and C contains name of parent, representing hierarchy. Technically, every member (excluding root) from worksheet member, should be populated in this list in column E. Member mapped into alternative hierarchy, would be populated multiple times in this list. But an orphan member would never find a place here since it is missing on relationship totally. So, we will setup a VLOOKUP to carve out such instances, as illustrated below in cell F2 with formula expression as =VLOOKUP(C2,relationship!E:E,1,0) and copy-paste it till last row.
Apply filter via menu Data > Filter, and check for #N/A in in value filter (i.e. values not found in relationship), which denotes orphan members.
Below is the result of Orphan members (refer below screenshot)

Deletion of Members

The orphan members identified in above steps, can be deleted (if required), by a simple Excel-based hack. Below are the steps:

Extend following columns with formulas as shown in the screenshot below, to generate Member deletion XML. Copy-paste formula till last row. Grab the formula text from below to avoid typing mistake.
=””
Create a new file in Notepad++, copy-paste XML header / footer from exported XML file, as shown in below screenshot, which serves as bare-bones for deletion lines.
Copy-paste Deletion XML lines generated via Excel formula for all the orphan members in this XML file between XML tags, as shown in below sample screen-clipping for 4-5 members.
Save the above file in Notepad++
Import this file in OneStream via Load/Extract menu, which would delete those orphan members.

Precautions

Always backup entire Dimension hierarchy via Load/Extract menu by exporting it to XML file. This backup XML can be used to revert deletion, provided no further changes were made to hierarchy post deletion.
Plenty of online backup storage are available for cheap price. It is safe to upload multiple versions of hierarchy backup XML file every time such modifications are made, which might come to the rescue in future.
It is wise to double-check member list being deleted. OneStream does not provide Undo button like Excel / Word.
Do not forgot the test things in Development application first and then deploy to Production.

Additional Notes

This approach assumes that member name consists of letters / number / underscore / hyphen. Any other character used in name would be required to be included in regular expression search pattern.
Orphan member deletion would fail if any data for it is found loaded. Kindly ensure this before running deletion XML.
The trick of deletion is generic & works for any member (irrespective of it is orphan or not)

Power BI & Excel Connectivity: Scenarios Which Can Break Dashboard

Dhananjay Gokhale — Mon, 18 Sep 2023 07:35:18 +0000

Background

Excel is the most used spreadsheet software in today’s era, used by every level of organization. Quite a huge amount of unorganized data is maintained in Excel workbooks, owing to ease of quick creation, storage & sharing of Excel files over database. Resultant many of the Power BI reports / dashboards are based on Excel as data-source.

Excel design enables it to act as a quasi-DBMS, with individual worksheet acting as a table and workbook as a database. But Excel, being a spreadsheet genre software, lacks enforceability making it vulnerable to breaking-down entire Power BI report in certain scenario. This blog showcases few scenarios, which developer needs to take care while using Excel as data-source for Power BI report.

Scenario: Data Type mismatch

Excel supports datatypes like text / number / date / time / logical. Unfortunately, it does not support strong enforcement of datatypes in respective columns. For e.g., users are free to type text into date datatype columns & so on. Data Validation rules of Excel can enforce this, but these rules can easily be de-activated or deleted in few seconds.

As shown in above screenshot, the data contained all the dates when initial Power BI report was prepared. But one fine day, some user entered question marks (???) in Date column since he/she was unaware of Date of transaction during data entry & decided to fill up that information when it becomes available. But such placeholder values generates errors, as Power BI attempts to skip these rows.

Power BI Desktop will take care to show up the error as shown in the above screenshot. But Power BI Service might not show the error on the face & silently skip the rows loading remaining data. This might affect reporting since amounts written on those rows would never be added as those rows were not imported into data model.

Scenario: Summary Rows at the bottom

Many people have a habit of calculating grand totals at the bottom of data in Excel (refer below screenshot).

This might ruin reporting in Power BI, as this row also gets incorporated into data, thereby inflating sum totals. Below are the comparative images section of the Power BI report with summary cards showing different figures before/after the summary row

Proper care needs to be exercised when such Excel data is intended to be used as data-source for Power BI. End-user of the Excel workbook needs to be informed of the above thing, and Excel summarization needs to be done in separate worksheets to prevent this.

Scenario: Gaps in Data

Certain times user might insert gaps in data rows (typically observed for printing purposes to adjust print preview range)

Power BI imports data including blank rows. The majority of the calculations would not get affected, except few DAX functions which will go on to include blank rows in calculations.

Above is the result using the COUNTROWS ( ) function, which also includes blank rows in the calculation result.

Calculation results differ a bit from other function like COUNT ( ) since this function excludes blank cells while counting.

Few developers prefer to use the COUNTROWS ( ) function, as it yields results faster (it simply returns back the row count of the table). Whereas, COUNT ( ) is relatively slow since it validates the values of each cell while calculating. Power BI report developer needs to account for these scenarios & develop measures accordingly.

Gaps also create blank options in the slicer dropdown, which does not appear professional.

The above mess could be avoided by adding an extra step of removing empty rows (refer to below image)

Scenario: Renaming of Column(s)

Many times end-users inadvertently change column titles, for better understanding or readability. Some business users might not prefer technical name of the column, so they might be tempted to re-name them before creating PIVOT tables/charts. Like in the below example, Excel workbook user changed column name Amount to Amount (in Rs) since organization is having multi-currency reporting, so user wants column title to depict this fact that amounts are in Indian Rupees.

Renaming results to failing of dataset refresh for Power BI reports, since originally while developing report, the column was titled as Amount. Power Query stores column names derived from Excel in the M script for import of Excel data.

Below is the error displayed when Power BI report is opened through Power BI Desktop

Report viewer needs to be a bit vigilant in monitoring refresh errors, since it shows-up as a small error icon as shown in below image

On clicking the error icon, message as below is shown which clarifies error in detail

Report users should get their email ID added into refresh failure notification triggers. Power BI will display data of last successful refresh for reports, which is even more disastrous.

Scenario: Cell Errors

Certain times, Excel formulas break due to deletion of cells which formula referred or any other miscellaneous reason. This results in cell error (as shown in below screenshot)

Just like Data Type mismatch discussed above, when the report is refreshed from Power BI Desktop, it would display count of rows with errors. But in Power BI Service, these errors are silent. Although Power Query can perform basic level of handling for these errors like substitution of errors with other value. Since this error originates from source, fixing it in source is more sensible than handling it in Power BI.

Scenario: Renaming / Moving of Excel file

Power BI supports absolute path while referencing any source file (refer below screenshot).

So, if the file is moved to some other folder, or maybe renamed, then the path needs to be updated in Power BI report too.

Same applies for Excel files referencing sharepoint (refer below screenshot).

Renaming or moving file to different folder, will result to change of sharepoint URL which needs to be updated.

Report developers can introduce parameter & link file path / URL with parameters which is easy to update from Power BI Service, without having to download, modify & re-publish Power BI report. It is not solution, but just an easy hack.

Google Sheets enjoys advantage in this scenario compared to Microsoft Excel, as links of Google Sheets do not change on renaming or moving file. Google Sheet assigns unique identifiers to the file which is independent of file name or location. Power BI supports Google Sheets as data-source & one can leverage this, if renaming/moving of file is unavoidable & happens frequently as a normal business scenario.

Conclusion

Excel might be a preferred choice of data-source, but one needs to think from broader perspective when using it for analytical & reporting purpose. Moving some of the Excel based data entry into Power Apps would be a strong solution, as forms have capability to validate the data before storing it. Power Apps use Dataverse as a backend which Power BI can connect easily. On an organizational level, this approach provides stronger reporting capability, compared to Excel.

How to Query & Extract data from OneStream metadata XML using XPath & XSLT

Dhananjay Gokhale — Wed, 22 Feb 2023 09:35:27 +0000

Background

OneStream supports exporting metadata into XML file for backup and restore purpose (via menu Application > Tools > Load/Extract). This blog covers technique to extract this information from metadata XML using technology named XSLT (eXtensible Stylesheet Language Transformation), which can read XML hierarchy & extract information from it.

Tools Required

Microsoft Visual Studio supports creating/editing XML & XSLT files, with in-built intellisense (auto-complete) and a validator, which checks for correctness of XSLT file. Visual Studio comes with XSLT processor for handy XML transformations which developers might require. Microsoft offers Community Edition of Visual Studio, available freely, suitable for light-weight development & tasks.

Data at a Glance

Below is the demo Account Member hierarchy which we shall be extracting from XML (screenshot below)

Below is the screenshot of Metadata XML as appearing in Visual Studio, extracted via Load/Extract menu

Understanding XPath

XML file contains hierarchal data. Querying tree structure data is a tricky task compared to tabular data, which can be queried easily using SQL. XPath is used to query XML data.

Let’s see, we want to query description value of Account member 1001. Below will be the XPath expression for this

/OneStreamXF/metadataRoot/dimensions/dimension[@type=’Account’]/members/member[@name=’1001′]/@description

XML tags are represented by /tag and XML attributes by @attribute. XPath supports filtering of data, by specifying query condition in square braces for that tag.

Drafting XSLT

XSLT is used for querying & transforming data from XML file, and generating output in XML or text format. XSLT is written using XML itself, with few XML tags instructing how to transform data. Visual Studio ships XSLT processor, capable of executing on-the-fly transformations via GUI menu. Below is the demo XSLT file, which extracts data from above XML file & generates textual output in Tab delimited format, which can be dumped into Excel or even imported into SQL Server easily.

XSLT can be copy-paste from below



    

    
        Dimension	Name	Description	Account Type

Let’s understand various parts of XSLT:

This instruct XSLT to generate output in textual format

Dimension Name Description Account Type

This line will insert a static column header in output file. XSLT being XML internally, needs escaping of tab ( ) & newline ( )

Above XSLT line, runs a for-each loop of all the member under the dimension which are of type Account.

This line emits content of description attribute of member tag from XML.

../ is XPath expression to fetch value of relative parent. So we are going 2 levels up, to then dimension XML node and then extracting value of name attribute from it.

./ is XPath expression to extract values from relative child XML node.

Generating Output File

Steps to generate textual output file

Open XSLT file in Visual Studio
Go to Properties window and browse XML file in Input and specify location of Output file in Output browse section
Navigate to XML > Start XSLT without Debugging
This will generate & save Output file and open it in Visual Studio
This content can be copy-pasted into Excel or even imported into Database using BULK INSERT statement

Other Benefits

It can filter & extract data from entire backup metadata XML contains multiple Dimensions like Account, Entity etc.
This approach is not limited to Accounts Dimension, but works for all the Dimensions like Entity, Scenario, etc by just changing XPath filter to [@type=’Entity’] and so on.
This approach can be extended to pull even further columns like IsIC, etc
Multiple for-each loop can be initiated in single XSLT file to scan all Dimensions like Account, Entity, Scenario, etc to generate consolidated output to be upload into some Database or Data Lake.
XSLT transformation can be automated via C#/VBA program (in .NET using XslCompiledTransform), or by invoking XSLT compiler from command-line Check this MSDN tutorial

Conclusion

XML is globally used for data interchange, enjoying compatibility with majority of software. OneStream highly leverages XML to backup almost every object / artifact available in the system. Objective of this blog is not just to perform specific task covered in case study, but to gain basic understanding of concept of XPath & XSLT. With a good command over XSLT, one can apply this technique even to re-create XML files with bulk modifications. Endless possibilities exists with varied business use cases one can think of.

Accessing Power BI confidential data in Excel for internal organization users

Dhananjay Gokhale — Tue, 06 Sep 2022 10:28:53 +0000

Background

Microsoft Excel is a popular and preferred spreadsheet solution for quick daily use reporting by the majority corporations and businesses in the world. Many times, corporate users need access to organization’s data in Excel for further development of MIS Reports. Power Query is a powerful tool embedded in Excel which can connect to internal Database Server, Online CRM/ERP Services, Excel files etc. hosted within an organization’s network or over the cloud. Apart from that, users also heavily use hyperlink to another Excel files reusing existing source data already prepared by other people.

Challenges / Issues

Numerous challenges or issues occur when a user attempts to refer to data stored in another Excel file (via formulas) or Database Server / ERP (via Power Query) as below:

There is a high probability that the original source Excel file to which references are made, might get deleted by the owner, or the structure may get modified, eventually breaking cell references. This would certainly lead to errors in reports.
Data imported via Power Query arrives in the form of table. Tables are generally not suitable for reporting, as various reporting authorities prescribe a fixed and pre-defined format in which to prepare and submit a report.
Power Query is a bit technical, and not every user might be able to grasp it.
Database connection requires a server IP address, credentials, etc. Exposing of such critical information might lead to security lapses, allowing easy access to the organization’s internal data.

Solutions

Microsoft offers 2 ways to connect to Power BI published data from Excel

Direct connectivity to Power BI published dataset via Power Pivot, which supports creating DAX measures too.
Access of designated tables of Power BI datasets, which can be used as an Organizational Data Types.

Power PIVOT

A dataset is essentially a collection of tables co-related using the relationship feature of Power BI. Excel lacks an easy and direct way of analyzing relational data. One needs to use VLOOKUP, XLOOKUP, MATCH, INDEX like functions for co-relating data. Power Pivot provides an easy way of reusing Power BI published dataset, having already defined relationships to design reports in the form of Pivot Table or Pivot Charts directly in Excel itself.

Steps to analyze Power BI published dataset using Power PIVOT is as below

In Excel navigate to Data tab > Get Data > From Power BI
Right-hand pane window will appear. Select target dataset from the list
Excel will create a new worksheet with blank Pivot Table designer. Drag & Drop fields into Pivot just like regular Pivot.

Organizational Data Types

Excel supports 3 types of data types natively:

Text
Number (includes Date-Time)
Logical (True/False)

Formula is not a data type but evaluates to any of these 3 data types. Office 365 version of Excel introduced support for a new data type called Linked Data Type, which is of type record. Linked data type holds a reference to a record (or row) containing multiple fields. So virtually it is a cell which can hold multiple values internally (refer screenshot below).

(linked data types have an icon as prefix in cell value)

Value of a linked data type cell can be extracted into another cell by referencing the linked data type cell using formula =cell_reference followed by a dot sign, which further enumerates field names in that record type cell (refer screenshot below)

(evaluated screenshot below)

Excel includes a few built-in linked data type sources such as Stock Market, Currency, Geography, and so on. Apart from that, any Power BI dataset table can be promoted to include itself as a custom linked data type, which is available only to Excel users of that organization. Such data types are available as organizational data types.

Organization Linked Data Types can be created using Power BI, by setting a table as a featured table and then publishing it (screenshot below).

Advantages of Organization Linked Data Types approach:

This approach prevents exposure of the source Excel file or Database to the end user, thereby enforcing privacy.
Power BI Service supports an elegant system of access control by designating workspace access to a specified user group, which also gets applied to Organizational Data Types.
User does not need to import source data again into some separate Worksheet to setup VLOOKUP to fetch values of other fields. So, the resulting excel file is light-weight with smaller size and fewer formulas in it (explained in case study).
Organizational Data Types work seamlessly in Excel Online (browser-based Excel). User does not even require to be on an organization’s VPN to access source data. Power Query or external Excel file references require the user to be on an organization’s LAN, which is a downside.

Case Study: VLOOKUP vs Organization Data Types

Scenario:

HR of an organization is required to prepare an Excel file where we need to do some analysis for individual employees. He exports an Excel file of Employee Master from ERP and then manually copies and pastes Employee Master data from that Excel file every month. Currently he is referencing and linking to this master data using the VLOOKUP function (as per below screenshot). He is maintaining this Master worksheet in many Excel files & has to manually update it.

In the above approach, if the HR fails to update Employee Master sheet, it can lead to incorrect reporting & decision-making. Also, if any column gets added or removed from Employee Master in future, VLOOKUP function will require modification of column references manually.

As a BI Consultant what solution can you offer ?

Solution:

Power BI supports connectivity to popular ERP, Database, Excel files etc. We will simply create a dataset in Power BI, extracting this data from ERP (via Power Query transformations if required). And then, without creating any visualization, we will simply publish the dataset, setting Featured table in the modelling window of Power BI, to the desired workspace of HR. This will enable HR to view tables of Power BI datasets shared with him. Afterwards, we will simply remove VLOOKUP and replace it with cell references as demonstrated below:

Compatibility

All the things explained and demonstrated in the blog are compatible on an Office 365 version of Excel (Desktop + Web). User needs to be on Office 365 Business or Enterprise subscription. Office 365 Personal / Home subscription or perpetual editions of Office like 2013, 2016, 2019, 2021 etc. do not support all these features as they require associated domain, which is missing in these editions.

Power BI: Import vs Direct Query

Dhananjay Gokhale — Tue, 21 Jun 2022 11:43:24 +0000

Background

Core of any BI tools is to acquire business data from multiple sources and then to co-relate it for quality reports & dashboards. Information Technology is being dominantly used for business processes from last 20 years. Volume of data has grown a lot. Businesses have shifted to enterprise database platforms like SQL Server, Oracle, etc, which have capability to store, fetch & process voluminous amount of data efficiently and in optimized manner.

Modes of connectivity in Power BI

Power BI supports 2 modes to connect with data, Import & Direct Query

Import Mode: In this mode, Power BI connects with underlying data source & downloads entire data from the datasource. This data is stored in Power BI model. Fresh copy of this data can be downloaded by pressing Refresh button. PBIX file internally stores model data in compressed format. This published datset model on Power BI Service, internally is stored on Common Data Model, which is sort of Azure Managed SQL Server instance in the backend.

Direct Query Mode: In this mode, Power BI connects to data source, but does not download entire data from the datasource. Instead, it generates SQL Query (or any equivalent) to generate data for specific visualization & fires it to underlying datasource utilizing optimization feature of underlying datasource engine.

Differences

	Import	Direct Query
Storage	Stores data into model	Do not store data
Volume	At max can store 1 GB data in model	No restriction
Performance	Fast with less volume, but degrades when volume increases	Fast if proper indexes are created on database, otherwise might underperform
Compatibility	Compatible with every type of data source	Supported for Database Server type of data source
RLS (Row Level Security)	Supports RLS	Supports RLS. But proper care needs to be taken since one might use Service Accounts to connect to database which practically impersonates identity of Power BI user
DAX & Transformations	Supports all the Power Query transformations & DAX	Supports only those transformations & DAX, for which Power BI is able to generate equivalent SQL Query
Refresh Schedule	User have option to set automatic refresh of data on scheduled interval based on subscription	Since data is not stored in model, concept of scheduling refresh is not applicable. Data is fetched from server when user opens report
Availability	If data refresh fails due to unavailability of datasource, last data persisting in model is used to prepare visuals	If data refresh fails due to unavailability of datasource, then entire report goes blank, as data is not stored in model. No option to go back to previous state

Why Direct Query outperforms Import, in majority scenario ?

To know reason behind this, one needs to understand how visuals are prepared. Any visual, be it a Column Chart, Line Chart, Card, Matrix, etc, needs an underlying tabular data. This tabular data is visually arranged/plotted by Power BI on the visual.

Power BI most of the time is doing below mentioned operations:

Merging
Grouping
Aggregating
Filtering

Let’s understand this in detail. When we drag any field in row/column of matrix/table or X/Y/Legend axis of Column/Line chart, internally Power BI is grouping the data to arrive at unique values in those fields. Upon dragging any field or measure into values box, internally Power BI applies aggregation operation like SUM,COUNT,AVERAGE,MIN,MAX etc. When multiple tables are connected using relationship, upon dragging any of the field/measure from those columns, internally merging of those tables is done by Power BI in the background. And upon selection of value(s) in slicer or filters, filtering operation is done.

Power BI has 2 mighty hands, DAX (powered by SQL Server Analysis Service) and M Script (powered by Power Query), to perform the ETL jobs and visualization related calculations. Both of these engines are efficient with their way of performing calculations. But when it comes to performing calculations in an optimized way, both under-perform compared to database engines, since they are in-memory calculation engines best suited to play with data. They lack indexing of data, which database handles while storing of data. Also, database engine re-uses query results of last few queries, by keeping track of changing data, which is something complex for both DAX & Power Query. As the data size grows, performance difference is quite noticeable.

In which scenario Import mode outperforms Direct Query ?

It’s a myth that Direct Query is always faster than Import. There can be scenarios wherein reverse is observed

Not using indexing feature of database. This might create situation wherein database server is taking too much time to read data from disk, whereas in-memory calculations by Power BI itself might work a little faster.
Running database server with very low resource (like CPU, RAM, etc). In this scenario, database server might be badly struggling with resources to fetch data. Although Power BI Service runs on shared Azure resources, it might outperform in this case.
Many concurrent users querying database. Power BI practically fires multiple SQL queries for multiple visuals. If there are locks on table, then it may lead to long wait thereby freezing Power BI visuals.
Long connection time to database. This happens when database connection request is made to on-premises server over VPN having high latency. If volume of data is not quite large, then it would be wise to import data instead of direct query. With import mode, one always enjoys good service of Azure resource in back-end on Power BI Service.

Is Direct Query real-time ?

No, one should make a distinction between Live and Realtime. Direct Query is a live connection. Whenever report is opened, fresh connection to database server is established to pull the latest data. But the visual will not automatically like a security/share market ticker. After opening screen, it will become static. Only after pressing refresh or re-opening of the report will it load fresh data.

Power Query – Intro

Dhananjay Gokhale — Thu, 23 Dec 2021 06:10:14 +0000

Background

In 21st century era, data is the most valuable and critical asset of any business. To convert this data into knowledge, analytical processes are required to be applied on data. Every level of management requires analysis of data, but its context is different for all. Previously data analytics was primarily done by business ERP itself using programming languages & SQL queries. But this approach requires technical knowledge. So, this created demand for less technical scripting language for data analysis, which is easy to learn for people with non-technical background & focus more on transforming of data for quick results.

Power Query Ideology

Power Query is Microsoft’s answer to the above demand of industry. The core of Power Query is Mashup Engine which has the ability of ETL (Extract –> Transform –> Load). Mashup engine has it’s own scripting termed M Language scripting. It is GUI based tool with the ability to auto-generate scripting for the transformation. Majority of transformation can be done via toolbar available. Just handful of scenario requires manually tweaking of M script via its editor window.

At a Glance

All icons used in the image are for educational purpose 1) Microsoft Inc & Salesforce.com reserve exclusive ownership of their respective product icons used in image 2) Open-Source icon credit icons8.com

Extract

Power Query supports extraction of data from massive number of sources (around 75+). At present handful of ETL solutions exists, which supports querying data from so many data sources. So, some of popular sources logically categorized are as below

Database (like SQL Server, MySQL, PostgreSQL, Oracle Server, IBM DB2, etc)
Files (Excel, CSV, PDF, JSON, XML, etc)
ERP / CRM (Salesforce, Dynamics 365, Quickbooks, Zoho, etc)
Big Data (like Azure Bricks, Parquet, Apache Spark, Hadoop, etc)
External Scripting (Python, R script)
Social Media & Online Feeds (Linked-in Sales, Google Analytics, etc)
Misc (Web, ODBC, Sharepoint, OData, etc)

Transform

Transformation or Data Shaping is the crucial task performed by M engine. M engine library has lot many functions (around 250+), which have been categorized logically based on its type. Function follows syntax of category_name.function_name just similar to .NET Library style. Result of each step of transformation is stored into variable, which is then referenced by next step in a linear manner. It allows user both GUI based interface (for Basic User) & Raw Editor window (for Advance User) to write M code for transformation.

Load

The final role of Power Query after transformation is returning output to primary service / software which invoked Power Query for ETL services. Currently following are some popular services/products of Microsoft, which utilizes Power Query engine:

Power BI: Power BI is a Data Analysis & Visualization tool supporting DAX queries. But the basic data in its model comes from Power Query, which is an integral part of it.
Excel: Power Query was shipped as an extension for Excel 2010/2013. But with increasing usage, Microsoft embedded it inside excel for ease. Limited number of data sources connectors are available in Excel for Power Query.
Power Apps: Power Apps too support Power Query based ETL via Dataflows, which can be automated to schedule. It is used for handy data insertion/updation in Dataverse or Common Data Service.
Azure Data Factory: Data Factory is an ETL tool for Azure products. It too supports Power Query as one of the transformation tools for its tasks.

Pros

Embedded / Plugged Module: Power Query is not circulated as an independent software, instead it is embedded as add-on inside the main solution. Like in Excel 2010/2013, Power Query was circulated as an optional Excel extension. From Office 2016 onward, Microsoft embedded it inside Excel. Azure Data Factory itself is an ETL solution. But it too contains Power Query plugged into it, for transformation.
Step-by-Step Transform / Result evaluation: In real-life, we break any long or complicated task into multiple smaller divisible tasks, for easy management of solution and fool-proof results. Power Query also follows the same rule. Every transformation operation is treated as a step, displaying result post-processing of that step for evaluation.
Source independent transformation: In ETL, data can arrive from various sources. So, there are high chances that structurally data from two different source might be totally different like Tabular vs Tree structure, posing difficulty in co-relating it. Traditionally transformation language for tabular & tree structure is different like for tabular SQL is preferred, where-as for tree structure XPath / XQuery is mostly used. Power Query removes burden of knowing all these languages.
Object Structure: Power Query supports extraction of data from Object (i.e. Key-Value) type of source. JSON is widely used format for data exchange with increasing development of web-based solutions. Power Query provides support for transformation of such types of object, there-by reducing headache of writing code for transformation of such data types.
Reusable Scripting: Power Query supports converting partial portion of scripting into a reusable function, thereby reducing amount of scripting efforts. It supports function capability similar to software development languages.
Recursion for scanning Hierarchy: Many of the ETL solutions generally does not support recursion, which is typically required for tree structure of data sources, where decoding hierarchy is crucial many times. Power Query supports calling M Language functions recursively to scan the tree.
Automatic SQL generation: Power Query supports generating SQL query based on the transformation steps, when Direct Query mode is selected (applicable to selected RDBMS only), for selective actions of transformation supported by underlying database server. This is a welcome feature for non-technical background person, who may find composing SQL query a bit difficult.

Cons

Lack of Optimization: Power Query performs well up to a certain threshold of data volume. But with large datasets, it lags. Database Server utilizes technique of indexation for optimized output, which Power Query lacks. Also, caching of results is also missing in Power Query & it keeps on re-calculating the same transformation numerous times (just like Excel). Direct Query mode, which uses Database side optimization for Power Query is also having limited support.
High Memory Usage: Power Query attempts to perform calculation in-memory with minimal disk usage. This gives performance improvement when data size is small to medium. But with increasing volume of data, it eats up considerable memory (RAM) of system, eventually depriving other applications from memory.
In-progress Documentation: Microsoft has official website MSDN (Microsoft Developer Network) for all its products / services, containing robust documentation along with examples. Power Query portion of documentation is currently in progress, so developer needs to rely on Power BI community platform for knowing any function in-depth.