Databricks has acquired LilacAI as it continues to strengthen its end-to-end data intelligence platform. The 2023 acquisition of MosaicML gave Databricks significant capabilities in the in the Generative AI space with the ability to train and deploy Large Language Models (LLMs) at scale. Next, Databricks purchased Arcion to provide native real-time data ingestion into their AI and ML models as well as into their Lakehouse. The Einblick acquisition was next, providing strong Natural Language Processing (NLP) capabilities. Databricks has strong open-source roots and this led to a partnership with the open-source MistralAI. How does Databricks think LilacAI will strengthen their MosaicAI platform?
Easier Unstructured
Databricks has always provided a platform for structured, semi-structured and unstructured data. Working with unstructured data, particularly at scale, can be difficult. Unstructured data lacks labels or even useful metadata. Even the definition of “good” is harder to define with unstructured data than structured data. The stated mission of LilacML is to
make unstructured data visible, quantifiable and malleable leading to higher quality AI model and providing better control and visibility of model bias and better actionability when AI models fail.
The stated goal of Databricks MosaicAI is to provide enterprise customers with end-to-end tooling to develop GenerativeAI solutions with their own data. Its interesting to note how they called out preparing datasets for RAG as well as pre-training, monitoring the output of LLMs and fine-tuning. There is a persistent problem with training LLM’s, particularly on relatively small enterprise-sized datasets. Large Language Models (LLMs) are prone to hallucinations.
Cleanup with a RAG
LLMs use deep learning to process large amounts of text from multiple sources using Natural Language Processing (NLP) to learn patterns and connections between words. Some of this data is accurate, up-to-date, internally consistent and does not contain secure data; some less so. Conflating so many disparate data sources with vary levels of data quality can lead to results that are untrue, out of date, inappropriate, insecure or just nonsense. Practical solutions have not been easy to come by.
Training an LLM is an extremely resource-intensive and expensive process. For very large models, this could involve months of runtime on dozens of high-end GPUs. From a practical perspective, this means that most publicly available LLMs are working on outdated information. For smaller, enterprise-level datasets, poor quality can have an outsized impact on the utility, and even safety, of the data.
Currently, the responsibility for limiting hallucinations are left to individual AI/ML teams. Fine-tuning an LLM manually requires substantial specialized work in model training. Recently, Meta developed an approach to improve the quality of LLM output called Retrieval-Augmented Generation (RAG).
A RAG can improve (augment) the generation of a result by retrieving more current information into the existing model without retraining the whole model. Once you realize that the high costs associated with fully retraining a model are prohibitive, augmenting the model with more current information seems like a reasonable approach to continuous quality improvement. Langchain makes doing an initial POC with RAG pretty straightforward. You can see a more advanced implementation by cloning Meta’s Fusion-in-Decode research project.
Conclusion
Enterprise data, particular in regulated industries, rely on robust, effective and ubiquitous quality pipelines. Being able to unlock the value in unstructured data has been a goal since the Hadoop days. An elusive goal, at best. Databricks is strengthening is MosaicAI platform with the inclusion of Lilac by recognizing the difficulty and criticality of data quality on unstructured data in building business-ready LLMs and making it easier on the dev team.
Contact us to learn more about how to build a robust GenAI pipeline in Databricks for your organization.