Databricks acquires Lilac to supercharge data quality efforts for gen AI apps
Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.
Today, Databricks announced the acquisition of Lilac, a Boston-based applied research startup offering tools for data understanding and manipulation. The terms of the deal were not disclosed.
The Ali Ghodsi-led data giant plans to bring Lilac’s team and technology to its data intelligence platform, formerly known as the data lakehouse, giving users across domains a more seamless way to improve the quality of their datasets for developing production-quality large language model (LLM) applications.
The deal comes as the latest effort from Databricks to become the one-stop-shop for not only data but also all things generative AI. Just recently, it also invested an undisclosed sum in Mistral, the generative AI startup that raised Europe’s largest seed round last year and has become a strong player in the gen AI domain.
How Lilac will make exploring data easy
When Databricks acquired Mosaic AI in a massive deal last year, the company shifted gears towards an AI-driven future, where users would use the data securely hosted on its platform to build generative AI applications. Since then, the company has made several developments in the space and even rolled out multiple open models to give customers everything they need to customers build, deploy and maintain high-quality large language model (LLM) apps targeting different business use cases.
VB Event
The AI Impact Tour – Atlanta
Request an invite
However, as it is widely said in the industry, data remains critical to all AI efforts, including LLM systems. Teams have to make sure that they have high-quality data for training the models as well as testing how they perform in the real world — covering aspects like bias and hallucinations. This is what Lilac helps with and will tackle with Databricks.
Traditionally, teams have had to use time-consuming manual methods to explore unstructured data and address its gaps. Lilac, founded by former Google engineers Daniel Smilkov and Nikhil Thorat in 2023, addresses this challenge with a scalable open-source solution that offers an intuitive UI and AI-driven features to analyze, understand and modify unstructured text data, at scale.
According to the company’s website, data scientists and AI researchers could do a lot with Lilac when handling unstructured data, right from clustering and assigning categories to docs, performing semantic and keyword searches to detecting personal information or duplicates and making necessary edits to remove them (with a comparison view) and tailor the dataset.
“The team behind Lilac specifically built their product to enable an analysis of model outputs for bias or toxicity, and preparation of data for RAG and fine-tuning or pre-training LLMs,” Databricks executives Matei Zaharia, Naveen Rao, Jonathan Frankle, Hanlin Tang and Akhil Gupta wrote in a joint blog post.
They added that Lilac’s entire tech stack will come under Databricks’ Mosaic AI tooling to give developers a way to better curate datasets for custom gen AI systems. While the specifics of the integration remain undisclosed at this stage, it will do the same job: simplify data tailoring to make it easier for teams to evaluate and monitor the outputs of their LLMs as well as prepare datasets for RAG, fine-tuning and pre-training.
“We believe that bringing the real-time, interactive data curation experience of Lilac to Databricks’ enterprise-scale platform will enable businesses to have much more visibility and control over their unstructured data. This will enable world-class, customizable AI products that serve end-users. Joining forces with Databricks will enable an entirely new class of enterprise developers to unlock the potential of their data with generative AI, with just a few clicks,” the startup wrote in a separate post published on its website.
The acquisition, as mentioned above, marks a notable step from Databricks to provide its customers with end-to-end tooling to develop high-quality gen AI apps using their own data. As of now, users on the Databricks platform have everything they need to build LLM-powered systems.
This includes open models from players like Meta, Stability and Mistral as well as dedicated Mosaic tools to experiment with them, use them as optimized model endpoints or customize them with their proprietary data hosted on the platform (Mosaic AI Foundation Model Adaptation) to target a specific use case.
Snowflake, the company’s major competitor, is also moving in the same direction and has introduced Cortex, a fully managed service to help its customers build apps driven by powerful open models.