Skip to main content

Data & Intelligence

Databricks strengthens MosaicAI with Lilac

Fine Tunning Min

Databricks has acquired LilacAI as it continues to strengthen its end-to-end data intelligence platform. The 2023 acquisition of MosaicML gave Databricks significant capabilities in the in the Generative AI space with the ability to train and deploy Large Language Models (LLMs) at scale. Next, Databricks purchased Arcion to provide native real-time data ingestion into their AI and ML models as well as into their Lakehouse. The Einblick acquisition was next, providing strong Natural Language Processing (NLP) capabilities. Databricks has strong open-source roots and this led to a partnership with the open-source MistralAI. How does Databricks think LilacAI will strengthen their MosaicAI platform?

Easier Unstructured

Databricks has always provided a platform for structured, semi-structured and unstructured data. Working with unstructured data, particularly at scale, can be difficult. Unstructured data lacks labels or even useful metadata. Even the definition of “good” is harder to define with unstructured data than structured data. The stated mission of LilacML is to

make unstructured data visible, quantifiable and malleable leading to higher quality AI model and providing better control and visibility of model bias and better actionability when AI models fail.

The stated goal of Databricks MosaicAI is to provide enterprise customers with end-to-end tooling to develop GenerativeAI solutions with their own data. Its interesting to note how they called out preparing datasets for RAG as well as pre-training, monitoring the output of LLMs and fine-tuning. There is a persistent problem with training LLM’s, particularly on relatively small enterprise-sized datasets. Large Language Models (LLMs) are prone to hallucinations.

Cleanup with a RAG

LLMs use deep learning to process large amounts of text from multiple sources using Natural Language Processing (NLP) to learn patterns and connections between words. Some of this data is accurate, up-to-date, internally consistent and does not contain secure data; some less so. Conflating so many disparate data sources with vary levels of data quality can lead to results that are untrue, out of date, inappropriate, insecure or just nonsense. Practical solutions have not been easy to come by.

Training an LLM is an extremely resource-intensive and expensive process. For very large models, this could involve months of runtime on dozens of high-end GPUs. From a practical perspective, this means that most publicly available LLMs are working on outdated information. For smaller, enterprise-level datasets, poor quality can have an outsized impact on the utility, and even safety, of the data.

Currently, the responsibility for limiting hallucinations are left to individual AI/ML teams. Fine-tuning an LLM manually requires substantial specialized work in model training. Recently, Meta developed an approach to improve the quality of LLM output called Retrieval-Augmented Generation (RAG).

A RAG can improve (augment) the generation of a result by retrieving more current information into the existing model without retraining the whole model. Once you realize that the high costs associated with fully retraining a model are prohibitive, augmenting the model with more current information seems like a reasonable approach to continuous quality improvement. Langchain makes doing an initial POC with RAG pretty straightforward. You can see a more advanced implementation by cloning Meta’s Fusion-in-Decode research project.

Conclusion

Enterprise data, particular in regulated industries, rely on robust, effective and ubiquitous quality pipelines. Being able to unlock the value in unstructured data has been a goal since the Hadoop days. An elusive goal, at best. Databricks is strengthening is MosaicAI platform with the inclusion of Lilac by recognizing the difficulty and criticality of data quality on unstructured data in building business-ready LLMs and making it easier on the dev team.

Contact us to learn more about how to build a robust GenAI pipeline in Databricks for your organization.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us