Databricks Articles / Blogs / Perficient https://blogs.perficient.com/category/partners/databricks/ Expert Digital Insights Tue, 19 Nov 2024 21:14:23 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Databricks Articles / Blogs / Perficient https://blogs.perficient.com/category/partners/databricks/ 32 32 30508587 SAP and Databricks: Better Together https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together/ https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together/#respond Sun, 17 Nov 2024 23:07:21 +0000 https://blogs.perficient.com/?p=372152

Across industries like manufacturing, energy, life sciences, and retail, data drives decisions on durability, resilience, and sustainability. A significant share of this critical data resides in SAP systems, which is why so many business have invested i SAP Datasphere. SAP Datasphere is a comprehensive data service that enables seamless access to mission-critical business data across SAP and non-SAP systems. It acts as a business data fabric, preserving the semantic context, relationships, and logic of SAP data. Datasphere empowers organizations to unify and analyze their enterprise data landscape without the need for complex extraction or rebuilding processes.

No single platform architecture can satisfy all the needs and use cases of large complex enterprises, so SAP partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. This blog explores the key features of SAP Datasphere and Databricks, their complementary roles in modern data architectures, and the business value they deliver when integrated.

What is SAP Datasphere?

SAP Datasphere is designed to simplify data landscapes by creating a business data fabric. It enables seamless and scalable access to SAP and non-SAP data with its business context, logic, and semantic relationships preserved. Key features of the data fabric include:

  • Data Cataloging
    Centralized metadata management and lineage.
  • Semantic Modeling
    Retaining relationships, hierarchies, and KPIs for analytics.
  • Federation and Replication
    Choose between connecting or replicating data.
  • Data Pipelines
    Automated, resilient pipelines for SAP and non-SAP sources.

What is Databricks?

A data lakehouse is a unified platform that combines the scalability and flexibility of a data lake with the structure and performance of a data warehouse. It is designed to store all types of data (structured, semi-structured, unstructured) and support diverse workloads, including business intelligence, real-time analytics, machine learning and artificial intelligence.

  • Unified Data Storage
    Combines the scalability and flexibility of a data lake with the structured capabilities of a data warehouse.
  • Supports All Data Types
    Handles structured, semi-structured, and unstructured data in a single platform.
  • Performance and Scalability
    Optimized for high-performance querying, batch processing, and real-time analytics.
  • Simplified Architecture
    Eliminates the need for separate data lakes and data warehouses, reducing duplication and complexity.
  • Advanced Analytics and AI
    Provides native support for machine learning, predictive analytics, and big data processing.
  • ACID Compliance
    Ensures reliability and consistency for transactional and analytical workloads using features like Delta Lake.
  • Cost-Effectiveness
    Reduces infrastructure and operational costs by consolidating data architectures.

How do they complement each other?

While each architecture has pros and cons, the point of this partnership is that these two architectures are better together. Consider a retail company that combines SAP Datasphere’s enriched sales and inventory data with Databricks Lakehouse’s real-time analytics capabilities. By doing so, they can optimize pricing strategies based on demand forecasts while maintaining a unified view of their data landscape. Data-driven enterprises can achieve the following goals by combining these two architectures.

  • Unified Data Access Meets Unified Processing Power
    A data fabric excels at connecting data across systems while retaining semantic context. Integrating with a lakehouse allows organizations to bring this connected data into a platform optimized for advanced processing, AI, and analytics, enhancing its usability and scalability.
  • Advanced Analytics on Connected Data
    While a data fabric ensures seamless access to SAP and non-SAP data, a lakehouse enables large-scale processing, machine learning, and real-time insights. This combination allows businesses to derive richer insights from interconnected data, such as predictive modeling or customer 360° analytics.
  • Data Governance and Security
    Data fabrics provide robust governance by maintaining lineage, metadata, and access policies. Integrating with a lakehouse ensures these governance frameworks are applied to advanced analytics and AI workflows, safeguarding compliance while driving innovation.
  • Simplified Data Architectures
    Integrating a fabric with a lakehouse reduces the complexity of data pipelines. Instead of duplicating or rebuilding data in silos, organizations can use a fabric to federate and enrich data and a lakehouse to unify and analyze it in one scalable platform.
  • Business Context for Data Science
    A data lakehouse benefits from the semantic richness provided by the data fabric. Analysts and data scientists working in the lakehouse can access data with preserved hierarchies, relationships, and KPIs, accelerating the development of business-relevant models. Add to that the additional use cases provided by Generative AI are still emerging.

Conclusion

The integration of SAP Datasphere and the Databricks Lakehouse represents a transformative approach to enterprise data management. By uniting the strengths of a business data fabric with the advanced analytics and scalability of a lakehouse architecture, organizations can drive better decisions, foster innovation, and simplify their data landscapes. Whether it’s unifying SAP and non-SAP data, enabling real-time insights, or scaling AI initiatives, this partnership provides a roadmap for the future of data-driven enterprises.

Contact us to learn more about how SAP Datasphere and Databricks Lakehouse working together can help supercharge your enterprise.

 

]]>
https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together/feed/ 0 372152
Omnichannel Analytics Simplified – Optimizely Acquires Netspring https://blogs.perficient.com/2024/10/09/omnichannel-analytics-optimizely-netspring/ https://blogs.perficient.com/2024/10/09/omnichannel-analytics-optimizely-netspring/#respond Wed, 09 Oct 2024 12:53:32 +0000 https://blogs.perficient.com/?p=370331

Recently, the news broke that Optimizely acquired Netspring, a warehouse-native analytics platform.

I’ll admit, I hadn’t heard of Netspring before, but after taking a closer look at their website and capabilities, it became clear why Optimizely made this strategic move.

Simplifying Omnichannel Analytics for Real Digital Impact

Netspring is not just another analytics platform. It is focused on making warehouse-native analytics accessible to organizations of all sizes. As businesses gather more data than ever before from multiple sources – CRM, ERP, commerce, marketing automation, offline/retail – managing and analyzing that data in a cohesive way is a major challenge. Netspring simplifies this by enabling businesses to conduct meaningful analytics directly from their data warehouse, eliminating data duplication and ensuring a single source of truth.

By bringing Netspring into the fold, Optimizely has future-proofed its ability to leverage big data for experimentation, personalization, and analytics reporting across the entire Optimizely One platform.

Why Optimizely Acquired Netspring

Netspring brings significant capabilities that make it a best-in-class tool for warehouse-native analytics.

With Netspring, businesses can:

  • Run Product Analytics: Understand how users engage with specific products.
  • Analyze Customer Journeys: Dive deep into the entire customer journey, across all touchpoints.
  • Access Business Intelligence: Easily query key business metrics without needing advanced technical expertise or risking data inconsistency.

This acquisition means that data teams can now query and analyze information directly in the data warehouse, ensuring there’s no need for data duplication or exporting data to third-party platforms. This is especially valuable for large organizations that require data consistency and accuracy.

Omnichannel Analytics Optimizely Netspring

 


Ready to capitalize on these new features? Contact Perficient for a complimentary assessment!


The Growing Importance of Omnichannel Analytics

It’s no secret that businesses today are moving away from single analytics platforms. Instead, they are combining data from a wide range of sources to get a holistic view of their performance. It’s not uncommon to see businesses using a combination of tools like Snowflake, Google BigQuery, Salesforce, Microsoft Dynamics, Qualtrics, Google Analytics, and Adobe Analytics.
How?

These tools allow organizations to consolidate and analyze performance metrics across their entire omnichannel ecosystem. The need to clearly measure customer journeys, marketing campaigns, and sales outcomes across both online and offline channels has never been greater. This is where warehouse-native analytics, like Netspring, come into play.

Why You Need an Omnichannel Approach to Analytics & Reporting

Today’s businesses are increasingly reliant on omnichannel analytics to drive insights. Some common tools and approaches include:

  • Customer Data Platforms (CDPs): These platforms collect and unify customer data from multiple sources, providing businesses with a comprehensive view of customer interactions across all touchpoints.
  • Marketing Analytics Tools: These tools help companies measure the effectiveness of their marketing campaigns across digital, social, and offline channels. They ensure you have a real-time view of campaign performance, enabling better decision-making.
  • ETL Tools (Extract, Transform, Load): ETL tools are critical for moving data from various systems into a data warehouse, where it can be analyzed as a single, cohesive dataset.

The combination of these tools allows businesses to pull all relevant data into a central location, giving marketing and data teams a 360-degree view of customer behavior. This not only maximizes the return on investment (ROI) of marketing efforts but also provides greater insights for decision-making.

Navigating the Challenges of Omnichannel Analytics

While access to vast amounts of data is a powerful asset, it can be overwhelming. Too much data can lead to confusion, inconsistency, and difficulties in deriving actionable insights. This is where Netspring shines – its ability to work within an organization’s existing data warehouse provides a clear, simplified way for teams to view and analyze data in one place, without needing to be data experts. By centralizing data, businesses can more easily comply with data governance policies, security standards, and privacy regulations, ensuring they meet internal and external data handling requirements.

AI’s Role in Omnichannel Analytics

Artificial intelligence (AI) plays a pivotal role in this vision. AI can help uncover trends, patterns, and customer segmentation opportunities that might otherwise go unnoticed. By understanding omnichannel analytics across websites, mobile apps, sales teams, customer service interactions, and even offline retail stores, AI offers deeper insights into customer behavior and preferences.

This level of advanced reporting enables organizations to accurately measure the impact of their marketing, sales, and product development efforts without relying on complex SQL queries or data teams. It simplifies the process, making data-driven decisions more accessible.

Additionally, we’re looking forward to learning how Optimizely plans to leverage Opal, their smart AI assistant, in conjunction with the Netspring integration. With Opal’s capabilities, there’s potential to further enhance data analysis, providing even more powerful insights across the entire Optimizely platform.

What’s Next for Netspring and Optimizely?

Right now, Netspring’s analytics and reporting capabilities are primarily available for Optimizely’s experimentation and personalization tools. However, it’s easy to envision these features expanding to include content analytics, commerce insights, and deeper customer segmentation capabilities. As these tools evolve, companies will have even more ways to leverage the power of big data.

A Very Smart Move by Optimizely

Incorporating Netspring into the Optimizely One platform is a clear signal that Optimizely is committed to building a future-proof analytics and optimization platform. With this acquisition, they are well-positioned to help companies leverage omnichannel analytics to drive business results.

At Perficient, an Optimizely Premier Platinum Partner, we’re already working with many organizations to develop these types of advanced analytics strategies. We specialize in big data analytics, data science, business intelligence, and artificial intelligence (AI), and we see firsthand the value that comprehensive data solutions provide. Netspring’s capabilities align perfectly with the needs of organizations looking to drive growth and gain deeper insights through a single source of truth.

Ready to leverage omnichannel analytics with Optimizely?

Start with a complimentary assessment to receive tailored insights from our experienced professionals.

Connect with a Perficient expert today!
Contact Us

]]>
https://blogs.perficient.com/2024/10/09/omnichannel-analytics-optimizely-netspring/feed/ 0 370331
Dreamforce 2024 Session Recap: Data Cloud + Databricks: As Good Together as PB&J https://blogs.perficient.com/2024/10/08/dreamforce-2024-session-recap-data-cloud-databricks-as-good-together-as-pbj/ https://blogs.perficient.com/2024/10/08/dreamforce-2024-session-recap-data-cloud-databricks-as-good-together-as-pbj/#respond Tue, 08 Oct 2024 20:35:06 +0000 https://blogs.perficient.com/?p=370388

At Dreamforce 2024, Perficient explored the integration of Databricks and Salesforce Data Cloud, focusing on an insurance industry use case. This session showcased data processing, customer engagement, and AI-driven insights, offering real-world value to enterprises.

Here’s a comprehensive recap of the session, highlighting the key takeaways and technical depth discussed.

Speakers 

Two of Perficient’s top experts, Eric Walk (Director, Data Strategy Consulting) and Johnathon Rademacher, JR (Principal, Salesforce Global Operations), led the session.

Both speakers brought years of expertise to the discussion, focusing on helping enterprises become more data-driven with AI and cloud-based technologies.

Business Scenario: Insurance with Real-Time Customer Data 

The session featured a real-world auto insurance scenario. The story centered on Roberta, a customer of Acme Insurance, and her son Ricky, who is flagged for risky driving behaviors. Acme’s use of telematics and a safe-driving tracker, combined with real-time insights from Databricks and Data Cloud, allowed Acme’s customer service team to proactively engage the family.

This outreach not only enhances customer satisfaction but also offers potential savings on insurance premiums.

Technical Integration: Data Cloud and Databricks

Attendees discovered how these technologies work together to:

  • Process massive data pipelines in real time, leveraging Databricks’ Lakehouse architecture.
  • Use ACID transactions and data governance to maintain data integrity while benefiting from the flexibility of data lakes.
  • Drive personalized customer experiences with AI and machine learning models that can be quickly deployed using the Databricks and Salesforce Data Cloud platforms.

Key Features of the Integration 

  • Lakehouse Architecture: This hybrid system combines data lakes and warehouses to allow for both structured and unstructured data, enhancing scalability and flexibility.
  • Data Harmonization: The integration unifies data from various sources, providing a consistent view across the organization.
  • AI Integration with Salesforce: With tools like Einstein GPT, the combined platform makes it easier to derive actionable insights from data, improving both sales and service operations.

AI and Data Cloud Advancements 

Eric and J.R. highlighted Salesforce’s paradigm shift, focusing on how the combination of Data + AI + CRM is set to transform customer relationship management. This includes Salesforce’s Einstein GPT, which leverages large language models and real-time data to automate tasks, deliver insights, and improve customer experience.

The addition of Databricks’ data processing capabilities allows for sophisticated data modeling and activation, giving enterprises the power to engage customers more meaningfully.

Technical Breakdown: Demo Architecture

A major part of the session included a demo showcasing how Databricks and Data Cloud work together. The demo architecture’s key components included:

  • Data Ingestion: Bringing in large volumes of telemetry and customer data in real time.
  • Data Harmonization: Consolidating disparate data into unified customer profiles, enabling a 360-degree view of the customer.
  • Actionable Insights: Using predictive analytics to drive real-time customer engagement, including proactive alerts for risky driving behaviors.

The architecture leveraged Salesforce’s Service Cloud to provide customer support teams with the right tools to manage customer interactions. This holistic platform not only simplifies data management but also accelerates the time it takes to extract actionable insights, making it a key tool for data-driven companies.

PACE Framework 

A significant part of Perficient’s service offerings is their P.A.C.E. Framework, designed to operationalize AI responsibly:

  • Policies: Setting guidelines for AI usage.
  • Advocacy: Promoting AI adoption.
  • Enablement: Offering tools and resources for AI deployment.
  • Controls: Ensuring governance and risk management of AI systems.

Final Takeaways: The Future of Data and AI in Customer Engagement

The session closed with an emphasis on the future possibilities of combining Databricks and Data Cloud. Businesses, especially in industries like insurance, can now engage with customers in real time, leveraging AI to deliver personalized and proactive experiences.

Much like how peanut butter and jelly combine to create a classic sandwich, the integration of Databricks and Data Cloud creates a powerful combination that’s greater than the sum of its parts.

Perficient + Salesforce 

We are a Salesforce Summit Partner with more than two decades of experience delivering digital solutions in the manufacturing, automotive, healthcare, financial services, and high-tech industries. Our team has deep expertise in all Salesforce Clouds and products, artificial intelligence, DevOps, and specialized domains to help you reap the benefits of implementing Salesforce solutions.

Missed Dreamforce? 

Don’t worry! Schedule some time with us and let our experts fill you in. And stay tuned to our Salesforce blog for all our post-conference insights.

 

]]>
https://blogs.perficient.com/2024/10/08/dreamforce-2024-session-recap-data-cloud-databricks-as-good-together-as-pbj/feed/ 0 370388
Perficient Colleague Attains Champion Status https://blogs.perficient.com/2024/07/12/perficient-colleague-attains-champion-status/ https://blogs.perficient.com/2024/07/12/perficient-colleague-attains-champion-status/#respond Fri, 12 Jul 2024 15:23:28 +0000 https://blogs.perficient.com/?p=365389

Databricks has recognized David Callaghan as a Partner Champion. As the first Perficient colleague to receive inclusion in the program, David is paving the way for others to get their footing with the partner.

Program Overview

To be a Databricks Partner champion, one must:

  1. Display Thought Leadership
  2. Harness Technical Expertise
  3. Become Community Leader
  4. Demonstrate Innovation

Individuals who show promise and interest are approached with an intensive multi-step program that leads to becoming an official Databricks Partner Champion.  This program recognizes the best and brightest on the Databricks platform and its capabilities. It helps to deepen a preexisting understanding of the platform by offering exclusive trainings and limitless growth potential for individuals daring enough to commit to the program. Those who advance through the rigor are equipped with knowledge like no other and are capable of demonstrating advanced understanding of the Lakehouse and Databricks.

How Did I Get Here?

David Callaghan, a Senior Solutions Architect, is on a mission to bring trusted data to complex regulated industries and has been deep in the Databricks weeds developing innovative solutions that are widely applicable. He participated in a Databricks Architect Panel and presented some of Perficient’s Databricks Accelerators and as a result was approached by program leadership to take the next steps to receive the Partner Champion recognition and be inducted into the program. David has since aided in the development of the Perficient Migration Factory, Databricks Brickbuilder solution and plans to leverage his expertise to shepherd a global team of Partner Champions at Perficient.

 

“I would like to bolster a global team of Databricks Partner Champions and establish a talent pool that brings diverse strengths to Databricks data and analytics platform to deliver value across enterprises and industries through training and mentorship by Perficient’s Databricks Center of Excellence.”

– David Callaghan, Senior Solutions Architect

What’s Next?

David is spearheading the charge of the development of a new life sciences solution related to one of Perficient’s most successful engagements with Databricks in this space. Our Migration Factory is a unique approach to migrating legacy data platforms into the Lakehouse and has set the tone for new innovative Brickbuilder solutions to be developed by Perficient’s experts.

More About our Partnership

At Perficient, we are a trusted Databricks consulting partner and our passion for creating custom data engineering, data science and advanced analytics knows no limits. With over 50 Databricks certified consultants, we build end-to-end solutions that empower our customers to gain more value from their data.

Learn more about the practice here.

]]>
https://blogs.perficient.com/2024/07/12/perficient-colleague-attains-champion-status/feed/ 0 365389
The Quest for Spark Performance Optimization: A Data Engineer’s Journey https://blogs.perficient.com/2024/06/18/the-quest-for-spark-performance-optimization-a-data-engineers-journey/ https://blogs.perficient.com/2024/06/18/the-quest-for-spark-performance-optimization-a-data-engineers-journey/#respond Tue, 18 Jun 2024 13:43:04 +0000 https://blogs.perficient.com/?p=364402

In the bustling city of Tech Ville, where data flows like rivers and companies thrive on insights, there lived a dedicated data engineer named Tara. With over five years of experience under her belt, Tara had navigated the vast ocean of data engineering, constantly learning, and evolving with the ever-changing tides.
One crisp morning, Tara was called into a meeting with the analytics team at the company she worked for. The team had been facing significant delays in processing their massive datasets, which was hampering their ability to generate timely insights. Tara’s mission was clear: optimize the performance of their Apache Spark jobs to ensure faster and more efficient data processing.
The Analysis
Tara began her quest by diving deep into the existing Spark jobs. She knew that to optimize performance, she first needed to understand where the bottlenecks were. she started with the following steps:
1. Reviewing Spark UI: Tara meticulously analyzed the Spark UI for the running jobs, focusing on stages and tasks that were taking the longest time to execute. she noticed that certain stages had tasks with high execution times and frequent shuffling.

Monitoring Spark with the web interface | DataStax Enterprise | DataStax  Docs
2. Examining Cluster Resources: she checked the cluster’s resource utilization. The CPU and memory usage graphs indicated that some of the executor nodes were underutilized while others were overwhelmed, suggesting an imbalance in resource allocation.

                                           Apache Spark Cluster Manager: YARN, Mesos and Standalone - TechVidvan
The Optimization Strategy
Armed with this knowledge, Tara formulated a multi-faceted optimization strategy:

1. Data Serialization: she decided to switch from the default Java serialization to Kryo serialization, which is faster and more efficient.
conf = SparkConf().set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

pyspark tunning #Data Serialization
2. Tuning Parallelism: Tara adjusted the level of parallelism to better match the cluster’s resources. By setting `spark.default.parallelism` and `spark.sql.shuffle.partitions` to a higher value, she aimed to reduce the duration of shuffle operations.
conf = conf.set(“spark.default.parallelism”, “200”)
conf = conf.set(“spark.sql.shuffle.partitions”, “200”)
3. Optimizing Joins: she optimized the join operations by leveraging broadcast joins for smaller datasets. This reduced the amount of data shuffled across the network.
small_df = spark.read.parquet(“hdfs://path/to/small_dataset”)
large_df = spark.read.parquet(“hdfs://path/to/large_dataset”)
small_df_broadcast = broadcast(small_df)
result_df = large_df.join(small_df_broadcast, “join_key”)

Hadoop, Spark, Hive and Programming: Broadcast Join in Spark
4. Caching and Persisting: Tara identified frequently accessed DataFrames and cached them to avoid redundant computations.
df = spark.read.parquet(“hdfs://path/to/important_dataset”).cache()
df.count() – Triggering cache action

Caching In Spark
5. Resource Allocation: she reconfigured the cluster’s resource allocation, ensuring a more balanced distribution of CPU and memory resources across executor nodes.
conf = conf.set(“spark.executor.memory”, “4g”)
conf = conf.set(“spark.executor.cores”, “2”)
conf = conf.set(“spark.executor.instances”, “10”)

The Implementation
With the optimizations planned, Tara implemented the changes and closely monitored their impact. she kicked off a series of test runs, carefully comparing the performance metrics before and after the optimizations. The results were promising:
– The overall job execution time reduced by 40%.
– The resource utilization across the cluster was more balanced.
– The shuffle read and write times decreased significantly.
– The stability of the jobs improved, with fewer retries and failures.
The Victory
Tara presented the results to the analytics team and the management. The improvements not only sped up their data processing pipelines but also enabled the team to run more complex analyses without worrying about performance bottlenecks. The insights were now delivered faster, enabling better decision-making, and driving the company’s growth.
The Continuous Journey
While Tara had achieved a significant milestone, she knew that the world of data engineering is ever evolving. she remained committed to learning and adapting, ready to tackle new challenges and optimize further as the data landscape continued to grow.
And so, in the vibrant city of Tech Ville, Tara’s journey as a data engineer continued, navigating the vast ocean of data with skill, knowledge, and an unquenchable thirst for improvement.

]]>
https://blogs.perficient.com/2024/06/18/the-quest-for-spark-performance-optimization-a-data-engineers-journey/feed/ 0 364402
Data & Dragons: Perficient Attends Data + AI Summit https://blogs.perficient.com/2024/06/04/data-dragons-perficient-attends-data-ai-summit/ https://blogs.perficient.com/2024/06/04/data-dragons-perficient-attends-data-ai-summit/#respond Tue, 04 Jun 2024 20:06:53 +0000 https://blogs.perficient.com/?p=363488

Dancing with Data

It was but a fortnight into 2024 AC (After Conquest) when the great council gathered to decide who would succeed Perficient’s 2023 Data & AI Summit attendees. Many claims were heard, but only a few were considered.  The council was assembled to prevent a war from being fought over the succession, for all knew the only thing that could tear down the house of Perficient, was itself.

As interesting as it would be to see a hypothetical war fought over annual conference attendees, it is certainly a stretch of the truth for this House of the Dragon fan. Databricks’ Data and AI Summit will be held on June 10th-13th at the Moscone Center in San Francisco, CA. Shortly after on June 16th, this author will be journeying back to Westeros as season two of HBO’s House of the Dragon will be returning to screens.

Meet the Heirs

After much deliberation, the names of the newest attendee council were proclaimed. Five individuals will represent Perficient at this year’s Data and AI Summit.

Bloggraphic

  • Grand Chancellor; Senior Vice President & Data Solutions GM, Santhosh Nair
  • Master of Data; Databricks Practice Director, Nick Passero
  • Master of Partnerships; Alliance Manager, Kyla Faust
  • Master of Coin; Portfolio Specialist, Brian Zielinski
  • Master of Commerce; Portfolio Specialist, Al Muse

The council of heirs will set off for San Francisco, California, for a full week of cutting-edge content, networking, and collaboration. The Perficient Council will have the opportunity to discover new use cases and capabilities on the Databricks platform that can be brought back to their loyal subjects (customers) to strengthen Perficient’s ability to serve and deliver quality solutions.

Decrees of the Council

“The hour of Data and AI Summit approaches, and I am afire with anticipation. The secrets of data shall unfold before us, like the ancient tomes of Old Valyria. Let the banners of Perficient and Databricks fly high, for this shall be a gathering remembered in the annals of our time!”

Databricks Practice Director,  Nick Passero

 

“I am honored to stand among such great minds where the arcane arts of data and AI are revealed—’twill be a journey more thrilling than any battle. Knowledge shall flow as freely as the rivers of the Trident.”

Alliance Manager, Kyla Faust

 

“The Data and AI Summit approaches, and my excitement burns brighter than a dragon’s flame. This gathering shall echo through time as a beacon of innovation and power.”

Portfolio Specialist, Brian Zielinski

See you there!

The Perficient council of heirs would love to meet you if you will be at this year’s conference! Please reach out to Kyla Faust to organize a meeting with the team.

Check out more information about the Perficient, Databricks practice here.

]]>
https://blogs.perficient.com/2024/06/04/data-dragons-perficient-attends-data-ai-summit/feed/ 0 363488
Salesforce Data Cloud – What Does noETL / noELT Mean for Me? https://blogs.perficient.com/2024/04/30/salesforce-data-cloud-what-does-noetl-noelt-mean-for-me/ https://blogs.perficient.com/2024/04/30/salesforce-data-cloud-what-does-noetl-noelt-mean-for-me/#comments Tue, 30 Apr 2024 18:46:30 +0000 https://blogs.perficient.com/?p=362222

In the realm of data management and analytics, the terms ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) have been commonplace for decades. They describe the processes involved in moving data from one system to another, transforming it as needed along the way. However, with the advent of technologies like Salesforce Data Cloud, a new concept is gaining traction: “noETL / noELT.” But what does this mean for you, especially if you’re not knee-deep in the technical jargon of data integration? Let’s explore.

Understanding ETL and ELT

First, a quick refresher on ETL and ELT:

  • ETL (Extract, Transform, Load): This traditional approach involves extracting data from various sources, transforming it into a usable format, and then loading it into a target system, often a data warehouse or analytics platform.
  • ELT (Extract, Load, Transform): ELT reverses the transformation step, loading raw data directly into the target system and then transforming it as needed within that system.

Both ETL and ELT have their pros and cons, but they can be complex and time-consuming processes, requiring specialized skills and infrastructure.

Enter noETL / noELT

Now, let’s talk about noETL / noELT, as championed by platforms like Salesforce Data Cloud. The “no” in noETL / noELT signifies a departure from the traditional data integration approaches. Here’s what it means for you:

1. Faster Time to Insights

  • With noETL / noELT, data can be accessed and utilized more quickly. Instead of waiting for data to go through multiple transformation stages, you can start analyzing it almost immediately.

2. Real-Time or Near Real-Time Analytics

  • By eliminating the upfront transformation step, noETL / noELT enables real-time or near real-time analytics. This means you can make decisions based on the most current data available.

3. Simplified Data Integration

  • For end-users, noETL / noELT translates to simplified data integration. You don’t need to worry about intricate data pipelines or complex transformation logic. Data becomes more accessible and usable.

4. Scalability and Cost Efficiency

  • Scalability is often improved with noETL / noELT, as it reduces the overhead associated with managing large-scale data integration processes. This can result in cost savings as well.

What it Means for You

If you’re a business user, analyst, or decision-maker leveraging Salesforce Data Cloud or similar technologies, here’s what noETL / noELT means for you:

  • Ease of Use: You can focus more on extracting value from data rather than managing its integration.
  • Quicker Insights: Rapid access to data means quicker insights, enabling faster and more informed decision-making.
  • Adaptability: NoETL / noELT architectures are often more adaptable to changing data sources and analytical needs.

Where are we today?

As of April 2024 there are two platforms that are Generally Available (GA) that can be used like this with Salesforce Data Cloud.

  1. Snowflake
  2. Google BigQuery

There are two other platforms that are in Pilot mode as of April 2024.  We are excited to see those move from Pilot to GA.

  1. Databricks
  2. Amazon RedShift

And looking forward, as mentioned in this article at cio.com, Salesforce Data Cloud is looking towards leveraging these two abilities moving forward.

  1. Allowing for data lakes that use Apache Iceberg to surface in Data Cloud with direct file access at the storage level.
  2. Salesforce Data Cloud will also add zero-copy support to the Data Kits that ISVs use to distribute datasets and enrich customers’ data in Salesforce Data Cloud.

What we are so excited about at Perficient is that we can bring expertise to both sides of a project involving these technologies.  We have two different business units that focus on each side…

  1. A Salesforce Business Unit with experts in Salesforce Data Cloud
  2. A Data Solutions Business Unit to help with the Data Lake solutions like Snowflake, Google BigQuery, Databricks and Amazon Redshift.
    1. Here is a recent blog post from a colleague of mine in that Data Solutions business unit.

In conclusion, the rise of noETL / noELT represents a significant shift in how we approach data integration and analytics. It promises to democratize data access and streamline processes for users across organizations. As these technologies continue to evolve, staying informed about their implications will be crucial for maximizing their benefits. Embrace the simplicity and agility that noETL / noELT brings, and harness the power of data more effectively in your day-to-day operations.

]]>
https://blogs.perficient.com/2024/04/30/salesforce-data-cloud-what-does-noetl-noelt-mean-for-me/feed/ 2 362222
ELT IS DEAD. LONG LIVE ZERO COPY. https://blogs.perficient.com/2024/04/29/elt-is-dead-long-live-zero-copy/ https://blogs.perficient.com/2024/04/29/elt-is-dead-long-live-zero-copy/#respond Mon, 29 Apr 2024 16:31:26 +0000 https://blogs.perficient.com/?p=362146

Imagine a world where we can skip Extract and Load, just do our data Transformations connecting directly to sources no matter what data platform you use?

Salesforce has taken significant steps over the last 2 years with Data Cloud to streamline how you get data in and out of their platform and we’re excited to see other vendors follow their lead. They’ve gone to the next level today by announcing their more comprehensive Zero Copy Partner Network.

By using industry standards, like Apache Iceberg, as the base layer, it means it’s easy for ALL data ecosystems to interoperate with Salesforce. We can finally make progress in achieving the dream of every master data manager, a world where the golden record can be constructed from the actual source of truth directly, without needing to rely on copies.

This is also a massive step forward for our clients as they mature into real DataOps and continue beyond to full site reliability engineering operational patterns for their data estates. Fewer copies of data mean increased pipeline reliability, data trustability, and data velocity.

This new model is especially important for our clients when they choose a heterogeneous ecosystem combining tools from many partners (maybe using Adobe for DXP and marking automation, and Salesforce for sales and service) they struggle to build consistent predictive models that can power them all—their customers end up getting different personalization from different channels. When we can bring all the data together in the Lakehouse faster and simpler, it makes it possible to build one model that can be consumed by all platforms. This efficiency is critical to the practicality of adopting AI at scale.

Perficient is unique in our depth and history with Data + Intelligence, and our diversity of partners. Salesforce’s “better together” approach is aligned precisely with our normal way of working. If you use Snowflake, RedShift, Synapse, Databricks, or Big Query, we have the right experience to help you make better decisions faster with Salesforce Data Cloud.

]]>
https://blogs.perficient.com/2024/04/29/elt-is-dead-long-live-zero-copy/feed/ 0 362146
Apache Spark: Merging Files using Databricks https://blogs.perficient.com/2024/03/30/apache-spark-merging-files-using-databricks/ https://blogs.perficient.com/2024/03/30/apache-spark-merging-files-using-databricks/#respond Sat, 30 Mar 2024 19:35:45 +0000 https://blogs.perficient.com/?p=358528

In data engineering and analytics workflows, merging files emerges as a common task when managing large datasets distributed across multiple files. Databricks, furnishing a powerful platform for processing big data, prominently employs Scala. In this blog post, we’ll delve into how to merge files efficiently using Scala on Databricks.

Introduction:

Merging files entails combining the contents of multiple files into a single file or dataset. This operation proves necessary for various reasons, such as data aggregation, data cleaning, or preparing data for analysis. Databricks streamlines this task by providing a distributed computing environment conducive to processing large datasets using Scala.

Prerequisites:

Before embarking on the process, ensure you have access to a Databricks workspace, and a cluster configured with Scala support. Additionally, you should have some files stored in a location accessible from your Databricks cluster.

Let’s explore the Merging through an example:

In the below example we have three files – a header file, a detail file and a trailer file which we will be merging using Databricks Spark Scala.

The Header file needs to be written first followed by the Detail File and the Trailer file.

Preparing up the files:

Detail File:

The Detail File contains the major data of the file here in this case it contains the Country and its corresponding capitals.

Detail Dataframe

Header File:

Header File contains the Name of what kind of file, sometimes date when the file is generated and the header for the content in the detail file.

Header Dataframe

Trailer File:

Trailer File often contains the count of the rows present in the Detail File.

Trailer Dataframe

Merging Approach:

We will be reading the files in the appropriate order and then write them into a single file. At the last we need to remove the files which we have used which is a good approach.

Merging File Spark Scala

Merged File:

Below is the output merged file were all the header, detail and trailer are displayed in the order.

Merged File Output

References:

Check out the blog on writing into DataFrame here: and Using DBFS here : DBFS (Databricks File System) in Apache Spark / Blogs / Perficient

Check out more about Databricks here: Databricks documentation | Databricks on AWS

Conclusion:

Effectively merging files is pivotal for data processing tasks, especially when grappling with large datasets. In this blog post, we’ve elucidated how to merge files using Scala on Databricks through both sequential and parallel approaches. Depending on your specific use case and the size of your dataset, you can opt for the method best suited to merge files efficiently. Databricks’ distributed computing capabilities, coupled with Scala’s flexibility, render it a potent combination for handling big data tasks.

]]>
https://blogs.perficient.com/2024/03/30/apache-spark-merging-files-using-databricks/feed/ 0 358528
Introduction to Star and Snowflake schema https://blogs.perficient.com/2024/03/29/introduction-to-star-and-snowflake-schema/ https://blogs.perficient.com/2024/03/29/introduction-to-star-and-snowflake-schema/#respond Fri, 29 Mar 2024 05:24:53 +0000 https://blogs.perficient.com/?p=359149

In the world of data warehousing and business intelligence, two key concepts are fundamental: Snowflake and Star Schema. These concepts play a pivotal role in designing effective data models for analyzing large volumes of data efficiently. Let’s delve into what Snowflake and Star Schema are and how they are used in the realm of data warehousing.

Snowflake Schema

The Snowflake Schema is a type of data warehouse schema that consists of a centralized fact table that is connected to multiple dimension tables in a hierarchical manner. The name “Snowflake” stems from its resemblance to a snowflake, where the fact table is at the center, and dimension tables branch out like snowflake arms. In this schema:

  • The fact table contains quantitative data or measures, typically numeric values, such as sales revenue, quantity sold, or profit.
  • Dimension tables represent descriptive attributes or perspectives by which data is analyzed, such as time, geography, product, or customer.

Star Schema

The key characteristics of a Snowflake Schema include:

  • Normalization: Dimension tables are normalized, meaning redundant data is minimized by breaking down the dimension into multiple related tables.
  • Complex Joins: Analytical queries may involve complex joins between the fact table and multiple dimension tables to retrieve the desired information.

Snowflake Schema is particularly useful when dealing with large and complex datasets. However, the downside is that it can introduce more complex query logic due to the need for multiple joins.

Star Schema

The Star Schema is another widely used schema for data warehousing that consists of a single fact table connected directly to multiple dimension tables. In this schema:

  • The fact table contains quantitative data or measures, similar to the Snowflake Schema.
  • Dimension tables represent descriptive attributes, similar to the Snowflake Schema.

Snowflake Schema

The key characteristics of a Star Schema include:

  • Denormalization: Dimension tables are denormalized, meaning redundant data is included directly in the dimension tables, simplifying query logic.
  • Simpler Joins: Analytical queries typically involve simpler joins between the fact table and dimension tables compared to the Snowflake Schema.

Star Schema is known for its simplicity and ease of use. It is well-suited for simpler analytical queries and is often favored for its performance benefits in query execution.

Key Differences

The main difference between Star and Snowflake schemas lies in their approach to storing dimensional data. Star schemas are simpler, with denormalized dimension tables, making them well-suited for fast query performance and simpler analytical queries. On the other hand, Snowflake schemas prioritize data integrity and storage efficiency through normalization but may result in slightly slower query performance due to additional joins.

Conclusion

Both Snowflake and Star Schema are essential concepts in the field of data warehousing, each with its own set of advantages and use cases. Choosing between them depends on the specific requirements of your data analysis tasks, the complexity of your data, and the performance considerations of your analytical queries. By understanding these schemas, you can design effective data models that cater to the needs of your business intelligence initiatives, enabling you to derive valuable insights from your data efficiently.

To Know More, Refer :

]]>
https://blogs.perficient.com/2024/03/29/introduction-to-star-and-snowflake-schema/feed/ 0 359149
Spark DataFrame: Writing into Files https://blogs.perficient.com/2024/03/06/spark-dataframe-writing-into-files/ https://blogs.perficient.com/2024/03/06/spark-dataframe-writing-into-files/#respond Thu, 07 Mar 2024 04:28:19 +0000 https://blogs.perficient.com/?p=358399

This blog post explores how to write Spark DataFrame into various file formats for saving data to external storage for further analysis or sharing.

Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame along with writing a DataFrame into tables and views.

Dataset:

The Below is the Dataset that we will be using for looking on writing into a file from DataFrame.

Dataset

Writing Spark DataFrame to File:

CSV Format:

Below is the Syntax to write a Spark DataFrame into a csv.

df.write.csv("output_path")

Lets go over writing the DataFrame to File using examples and scenarios.

Example:

The below snapshot is the sample for writing a DataFrame into a File.

Spark DataFrame Write to File - Display From The Path

After writing the DataFrame into the path, the files in the path are displayed. The displayed Part Files are the ones where the data is loaded. Databricks automatically partitioned each row into a file and created a file for all of the rows. We can repartition and create a single file from the DataFrame.

DataFrame Repartition:

Spark DataFrame Write into csv - Display From The Path Repartition

After repartitioning, we observe that all the part files combine into a single file, and we notice other files besides the part files, which we can ignore from creating by using some Spark configurations below. These files will be created even when writing the data into other file formats rather than csv.

Removing _committed and _started Files:

We can use the below spark configuration which will not create the files starting with _commited and _started. =

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Spark DataFrame Write to File -Display From The Path Commit Protocol

Removing _SUCCESS File:

We can use the below spark configuration to stop the _SUCCESS file from getting generated.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Spark DataFrame Write to File -Display From The Path Makesuccessfulljobs

Data in the File:

With all the additional files removed we can see the data present within what is being loaded into the file. We can notice that by default spark doesn’t write header into the files we can modify them by using option/options. In addition, let’s see the available options when writing a DataFrame into a file.

Output From The File

Header Option:

Dataframe Write With Header

By adding the header option, we observe that the header is populated in the file. Similarly, we have a option to change the delimiter.

Delimiter Option:

Dataframe Write With Delimiter

 

We can change the delimiter to our desired format by adding the additional option – delimiter or we can also use sep (syntax provided below).

df.write.option("header","true").option("sep","|").mode("overwrite").csv("dbfs:/FileStore/df_write/")

nullValue Option:

From the previous output we can notice that the Capital for Tonga in the DataFrame is null though in the csv it would have populated as empty. We can have it retained as null by using the nullValue option.

Dataframe Write With Nullvalue

With this option, we observe that null is retained.

emptyValue Option:

In some scenarios we may need to populate null for empty values, in that case we can use the below option.

Dataframe Write With Emptyvalue

From the output above, we observe that Denmark previously had an empty value populated for its capital, but it is now being populated with null.

ignoreLeadingWhiteSpaces and ignoreTrailingWhiteSpaces Option:

If we need to retain the spaces before or after the value in a column, we can use the below options.

Dataframe Write With Ignoreleadtrail

Different Way to use Multiple Options:

We can have all the options for the file format in a common variable and then use it whenever needed if we have to use the same set of options for multiple files.

Dataframe Write With Multile Options

We have created a variable writeOptions of Map type which has the options stored within it and we can use it whenever we need that Output Option.

JSON Format:

We can use the below syntax and format to write into a JSON file from the DataFrame.

Dataframe Write With Json

Other Formats:

ORC Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").orc("dbfs:/FileStore/df_write/")

Parquet Format:

Below is the syntax for writing the DataFrame in ORC Format:

df.write.mode("overwrite").parquet("dbfs:/FileStore/df_write/")

Similar to the above there are several more formats and examples along with syntaxes which you can reference from the below links.

In this blog post, we covered the basics of writing Spark DataFrame into different file formats. Depending on your specific requirements and use cases, you can choose the appropriate file format and configuration options to optimize performance and compatibility.

]]>
https://blogs.perficient.com/2024/03/06/spark-dataframe-writing-into-files/feed/ 0 358399
Spark SQL Properties https://blogs.perficient.com/2024/03/05/spark-sql-properties/ https://blogs.perficient.com/2024/03/05/spark-sql-properties/#respond Wed, 06 Mar 2024 05:17:29 +0000 https://blogs.perficient.com/?p=358258

The spark.sql.* properties are a set of configuration options specific to Spark SQL, a module within Apache Spark designed for processing structured data using SQL queries, DataFrame API, and Datasets. These properties allow users to customize various aspects of Spark SQL’s behavior, optimization strategies, and execution environment. Here’s a brief introduction to some common spark.sql.* properties:

spark.sql.shuffle.partitions

The spark.sql.shuffle.partitions property in Apache Spark determines the number of partitions to use when shuffling data during operations like joins or aggregations in Spark SQL. Shuffling involves redistributing and grouping data across partitions based on certain criteria, and the number of partitions directly affects the parallelism and resource utilization during these operations. The default behavior splits DataFrames into 200 unique partitions when shuffling data.

Syntax:

// Setting the number of shuffle partitions to 200
spark.conf.set("spark.sql.shuffle.partitions", "200")

spark.sql.autoBroadcastJoinThreshold

The spark.sql.autoBroadcastJoinThreshold property in Apache Spark SQL determines the threshold size beyond which Spark SQL automatically broadcasts smaller tables for join operations. Broadcasting involves replicating a smaller DataFrame or table to all executor nodes to avoid costly shuffling during join operations.

Syntax:

// Setting the autoBroadcastJoinThreshold to 10MB
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")

spark.sql.execution.arrow.enabled

In Apache Spark SQL, the spark.sql.execution.arrow.enabled property determines whether Arrow-based columnar data transfers are enabled for DataFrame operations. Arrow is a columnar in-memory data format that can significantly improve the performance of data serialization and deserialization, leading to faster data processing.

Syntax:

// Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.sql.sources.partitionOverwriteMode

The spark.sql.sources.partitionOverwriteMode property in Apache Spark SQL determines the mode for overwriting partitions when writing data into partitioned tables. This property is particularly relevant when updating existing data in partitioned tables, as it specifies how Spark should handle the overwriting of partition directories. By default,  partitionOverwriteMode will be Static.

Syntax:

// Setting the partition overwrite mode to "dynamic"
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

spark.sql.statistics.histogram.enabled

The spark.sql.statistics.histogram.enabled property in Apache Spark SQL determines whether Spark SQL collects histograms for data statistics computation. Histograms provide additional insights into the distribution of data in columns, which can aid the query optimizer in making better execution decisions. By default, the config is set to false.

Syntax:

// Enable collection of histograms for data statistics computation
spark.conf.set("spark.sql.statistics.histogram.enabled", "true")

spark.sql.streaming.schemaInference

The spark.sql.streaming.schemaInference property in Apache Spark SQL determines whether schema inference is enabled for streaming DataFrames. When enabled, Spark SQL automatically infers the schema of streaming data sources during runtime, simplifying the development process by eliminating the need to manually specify the schema.

Syntax:

// Enable schema inference for streaming DataFrames
spark.conf.set("spark.sql.streaming.schemaInference", "true")

spark.sql.adaptive.skewJoin.enabled

The spark.sql.adaptive.skewJoin.enabled property in Apache Spark SQL determines whether adaptive query execution is enabled for skew join optimization. When enabled, Spark SQL automatically detects and mitigates data skewness in join operations by dynamically adjusting the join strategy to handle skewed data distributions more efficiently. By Default skew join is True.

Syntax:

// Enable adaptive query execution for skew join optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

spark.sql.inMemoryColumnarStorage.batchSize

The spark.sql.inMemoryColumnarStorage.batchSize property in Apache Spark SQL configures the batch size for columnar caching. This property defines the number of rows that are processed and stored together in memory during columnar caching operations. By Default, batchsize is 10000.

Syntax:

// Setting the batch size for columnar caching to 1000 rows
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000")

spark.sql.adaptive.coalescePartitions.enabled

The spark.sql.adaptive.coalescePartitions.enabled property in Apache Spark SQL determines whether adaptive partition coalescing is enabled. When enabled, Spark SQL dynamically adjusts the number of partitions during query execution to optimize resource utilization and improve performance.

Syntax:

// Enable adaptive partition coalescing
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Example

Here’s an example demonstrating the usage of all the mentioned Spark SQL properties along with a SQL query:

// Importing necessary Spark classes
import org.apache.spark.sql.{SparkSession, DataFrame}

// Setting Spark SQL properties
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760") // 10 MB
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.conf.set("spark.sql.statistics.histogram.enabled", "true")
spark.conf.set("spark.sql.streaming.schemaInference", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.inMemoryColumnarStorage.batchSize", "1000")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

// Creating DataFrames for the tables
val employeesData = Seq((1, "Aarthii", 1000), (2, "Gowtham", 1500), (3, "Saranya", 1200))
val departmentsData = Seq((1000, "HR"), (1200, "Engineering"), (1500, "Finance"))
val employeesDF = spark.createDataFrame(employeesData).toDF("emp_id", "emp_name", "dept_id")
val departmentsDF = spark.createDataFrame(departmentsData).toDF("dept_id", "dept_name")

// Registering DataFrames as temporary views
employeesDF.createOrReplaceTempView("employees")
departmentsDF.createOrReplaceTempView("departments")

// Executing a SQL query using the configured properties
val result = spark.sql(
"SELECT emp_name, dept_name FROM employees e JOIN departments d ON e.dept_id = d.dept_id"
)

// Showing the result
result.show()

OUTPUT:

Spark Sql Properties

In this example:

  • We import the necessary Spark classes, including SparkSession and DataFrame.
  • We create a SparkSession object named spark.
  • We set various Spark SQL properties using the spark.conf.set() method.
  • We create DataFrames for two tables: “employees” and “departments”.
  • We register the DataFrames as temporary views using createOrReplaceTempView().
  • We execute a SQL join query between the “employees” and “departments” tables using spark.sql().
  • Finally, we display the result using show().

These properties provide fine-grained control over Spark SQL’s behavior and optimization techniques, enabling users to tailor the performance and functionality of Spark SQL applications to specific requirements and use cases.

Reference: https://spark.apache.org/docs/latest/configuration.html

]]>
https://blogs.perficient.com/2024/03/05/spark-sql-properties/feed/ 0 358258