Natural language AI has proliferated into many of today’s applications and platforms. One of the high in demand use cases is the ability to find quick answers to questions about what’s hidden within organizational data, such as operational, financial, or other enterprise type data. Therefore leveraging the latest advancements in the GenAI space together with enterprise data warehouses has valuable benefits. The SelectAI feature of the Oracle Autonomous Database (ADB) achieves this outcome. It eliminates the complexity of leveraging various Large Language AI Models (LLMs) from within the database itself. From an end user perspective, SelectAI is as easy as asking the question, without having to worry about GenAI prompt generation, data modeling, or LLM fine tuning.
In this post, I will summarize my findings on implementing ADB SelectAI and share some tips on what worked best and what to look out for when planning your implementation.
Several GenAI Models: Which One to Use?
What I like about SelectAI is that switching the underlying GenAI model is simple. This is important over time to stay up to date and take advantage of the latest and greatest of what LLMs have to offer and at the most suitable cost. We can also set up SelectAI with multiple LLMs simultaneously, for example, to cater to different user groups, at varying levels of service. In the future, there will always be a better LLM model to use, but at this time these findings are based on trials of the Oracle Cloud Infrastructure (OCI) shared Cohere Command model, the OpenAI GPT-3.5-Turbo model and the OpenAI GPT-4 model. Here is a summary of how each worked out:
Cohere Command:
While this model worked well for simple questions that are well phrased with nouns that relate to the metadata, it didn’t work well when the question got more complex. It didn’t give a wrong answer, as much as it returned a message as follows apologizing for the inability to generate an answer: “Sorry, unfortunately a valid SELECT statement could not be generated…”. At the time of this writing, the Command R+ model had just been introduced and became generally available, but it wasn’t attempted as part of this exercise. It remains to be found out how effective the newer R+ model is in comparison to the other ones.
OpenAI GPT-4:
This LLM worked a lot better than Cohere Command in that it answered all the questions that Command couldn’t. However, it comes at a higher cost.
OpenAI GPT-3.5-Turbo:
This one is my favorite so far as it also answered all the questions that Command couldn’t and is roughly 50 times less expensive than GPT-4. It is also a lot faster to respond compared to the OCI shared Cohere Command. There were some differences though at times in how the answers are presented. Below is an example of what I mean:
Sample Question: Compare sales for package size P between the Direct and Indirect Channels
Responses Generated by Each Model:
- Cohere command: Sorry, unfortunately, a valid SELECT statement could not be generated
- OpenAI gpt-3.5-turbo: This was able to generate a good result set based on the following query, but the results weren’t automatically grouped in a concise manner.
SELECT s.PROD_ID, s.AMOUNT_SOLD, s.QUANTITY_SOLD, s.CHANNEL_ID, p.PROD_PACK_SIZE, c.CHANNEL_CLASS
FROM ADW_USER.SALES_V s
JOIN ADW_USER.CHANNELS_V c ON s.CHANNEL_ID = c.CHANNEL_ID
JOIN ADW_USER.PRODUCTS_V p ON s.PROD_ID = p.PROD_ID
WHERE p.PROD_PACK_SIZE = 'P' AND c.CHANNEL_CLASS IN ('Direct', 'Indirect');
- OpenAI gpt-4: This provided the best answer, and the results were most suitable with the question as it grouped by Channel Class to easily compare sales.
SELECT c.CHANNEL_CLASS AS Channel_Class, SUM(s.AMOUNT_SOLD) AS Total_Sales
FROM ADW_USER.SALES_V s
JOIN ADW_USER.PRODUCTS_V p ON s.PROD_ID = p.PROD_ID
JOIN ADW_USER.CHANNELS_V c ON s.CHANNEL_ID = c.CHANNEL_ID
WHERE p.PROD_PACK_SIZE = 'P'AND c.CHANNEL_CLASS IN ('Direct', 'Indirect')
GROUP BY c.CHANNEL_CLASS;
Despite this difference, most of the answers were similar between GPT-4 and GPT-3.5-Turbo and that’s why I recommend to start with the 3.5-Turbo and experiment with your schemas at minimal cost.
Another great aspect of the OpenAI GPT models is that they support conversational type questions to follow up in a thread-like manner. So, after I ask for total sales by region, I can do a follow up question in the same conversation and say for example, “keep only Americas”. The query gets updated to restrict previous results to my new request.
Tips on Preparing the Schema for GenAI Questions
No matter how highly intelligent you pick of an LLM model, the experience of using GenAI won’t be pleasant unless the database schemas are well-prepared for natural language. Thanks to the Autonomous Database SelectAI, we don’t have to worry about the metadata every time we ask a question. It is an upfront setup that is done and applies to all questions. Here are some schema prep tips that make a big difference in the overall data Q&A experience.
Selective Schema Objects:
Limit SelectAI to operate on the most relevant set of tables/views in your ADB. For example exclude any intermediate, temporary, or irrelevant tables and enable SelectAI on only the reporting-ready set of objects. This is important as SelectAI automatically generates the prompt with the schema information to send over to the LLM together with the question. Sending a metadata that excludes any unnecessary database objects, narrows down the focus for the LLM as it generates an answer.
Table/View Joins:
To result in correct joins between tables, name the join columns with the same name. For example, SALES.CHANNEL_ID = CHANNELS.CHANNEL_ID. Foreign key constraints and primary keys constraints don’t affect how tables are joined, at least at the time of writing this post. So we will need to rely on consistently naming join columns in the databases objects.
Create Database Views:
Creating database views are very useful for SelectAI in several ways.
- Views allow us to reference tables in other schemas so we can setup SelectAI on one schema that references objects in several other schemas.
- We can easily rename columns with a view to make them more meaningful for natural language processing.
- When creating a view, we can exclude unnecessary columns that don’t add value to SelectAI and limit the size of the LLM prompt at the same time.
- Rename columns in views so the joins are on identical column names.
Comments:
Adding comments makes a huge difference in how much more effective SelectAI is. Here are some tips on what to do with comments:
- Comment on table/view level: Describe what type of information a table or view contains: For example, a view called “Demographics” may have a comment as follows: “Contains demographic information about customer education, household size, occupation, and years of residency”
- Comment on column level: For security purposes SelectAI (in a non-Narrate mode) doesn’t send data over to the GenAI model. Only metadata is sent over. That means if a user asks a question about a specific data value, the LLM doesn’t have visibility where that exists in the database. To enhance the user experience where sending some data values to the LLM is not a security concern, include the important data values in the comment. This enables the LLM to know where that data is. For example, following is a comment on a column called COUNTRY_REGION: “region. some values are Asia, Africa, Oceania, Middle East, Europe, Americas”. Or for a channel column, a comment like the following can be useful by including channel values: “channel description. For example, tele sales, internet, catalog, partners”
Explain certain data values: Sometimes data values are coded and require translation. Following is an example of when this can be helpful: comment on column Products.VALID_FLAG: “indicates if a product is active. the value is A for active”
Is There a Better Way of Asking a Question?
While the aforementioned guidance is tailored for the upfront administrative setup of SelectAI, here are some tips for the SelectAI end user.
- Use double quotations for data values consisting of multiple words: This is useful for example when we want to filter data on particular values such as a customer or product name. The quotation marks also help pass the right case sensitivity of a word. For example: what are the total sales for “Tele Sales” in “New York City”.
- Add the phrase “case insensitive” at the end of your question to help find an answer. For example: “calculate sales for the partners channel case insensitive”. The SQL query condition generated in this case is: WHERE UPPER(c.CHANNEL_CLASS) = ‘PARTNERS’, which simply means ignore case sensitivity when looking for information about partners.
- If the results are filtered, add a statement like the following at the end of the question to avoid unnecessary filters: “Don’t apply any filter condition”. This was more applicable with the cohere command model than the OpenAI models.
- Starting the question with “query” instead of “what is”, for instance, worked better with the cohere command model.
- Be field specific when possible: Instead of just asking for information by customer or by product, be more field specific such as “customer name” or “product category”.
- Add additional instructions to your question: You can follow the main question with specific requests for example to filter or return the information. Here is an example of how this can be done:
“what is the average total sales by customer name in northern america grouped by customer. Only consider Direct sales and customers with over 3 years of residency and in farming. case insensitive.”
Results are returned based on the following automatically generates SQL query:
SELECT c.CUST_FIRST_NAME || ' ' || c.CUST_LAST_NAME AS CUSTOMER_NAME, AVG(s.AMOUNT_SOLD)
FROM ADW_USER.SALES_V s JOIN ADW_USER.CUSTOMERS_V c ON s.CUST_ID = c.CUST_ID
JOIN ADW_USER.COUNTRIES_V co ON c.COUNTRY_ID = co.COUNTRY_ID
JOIN ADW_USER.CHANNELS_V ch ON s.CHANNEL_ID = ch.CHANNEL_ID
JOIN ADW_USER.CUSTOMER_DEMOGRAPHICS_V cd ON c.CUST_ID = cd.CUST_ID
WHERE UPPER(co.COUNTRY_SUBREGION) = 'NORTHERN AMERICA'
AND UPPER(ch.CHANNEL_CLASS) = 'DIRECT'
AND cd.YRS_RESIDENCE > 3
AND UPPER(cd.OCCUPATION) = 'FARMING'
GROUP BY c.CUST_FIRST_NAME, c.CUST_LAST_NAME;
It’s impressive to see how GenAI can take the burden off the business in finding quick and timely answers to questions that may come up throughout the day, all without data security risks. Contact us if you’re looking to unlock the power of GenAI for your enterprise data.