Kanchan Bawane, Author at Perficient Blogs

Sending Real-Time Logs to Splunk Cloud Using Universal Forwarder

Kanchan Bawane — Thu, 19 Sep 2024 06:31:40 +0000

This comprehensive guide walks you through the process of setting up Splunk Universal Forwarder to send real-time logs to Splunk Cloud. Learn how to enhance your organization’s log management capabilities, from installation to troubleshooting.

Introduction to Splunk Cloud and Universal Forwarder

What is Splunk Cloud?

Splunk Cloud is a powerful cloud-based platform for collecting, analyzing, and visualizing machine-generated data from various sources. It offers robust tools for searching, monitoring, and analyzing log data, making it indispensable for IT operations, security, and business analytics.

The Role of Universal Forwarder

To efficiently get logs into Splunk Cloud, you need Splunk Universal Forwarder (UF). This lightweight version of Splunk collects and forwards log data to Splunk Cloud in real-time, bridging the gap between your data sources and the cloud platform.

Why Use Universal Forwarder for Log Forwarding?

Real-Time Data Monitoring: Enables quick insights by forwarding logs to Splunk Cloud in real-time.
Efficiency: Lightweight design minimizes impact on system resources.
Scalability: Capable of forwarding logs from multiple sources, suitable for small to large-scale deployments.

Getting Started: Prerequisites and Setup

Before diving into the setup process, ensure you have:

Splunk Cloud Account: An active account (free trial or paid subscription). If you don’t have an account yet, you can follow the instructions provided in the blog, “Understanding Splunk and Setting Up Splunk Cloud,” to get started.
Universal Forwarder Installation Package: Downloaded for your specific operating system.
Access to Log Files: Identified and accessible log files for monitoring.

Step 1: Setting Up the Splunk Universal Forwarder

Download and Install Splunk Universal Forwarder

Visit the Splunk Cloud platform and navigate to the Universal Forwarder option.
Click on the Splunk Downloads web page to choose and download your specific version.
Run the installer or extract the package to your desired location.
Follow on-screen prompts to complete the installation.
The Universal Forwarder service should start automatically after installation.

Step 2: Configuring Universal Forwarder

Basic Configuration

Follow these steps to connect Universal Forwarder to your Splunk Cloud instance:

Download the Splunk Cloud certificate from the platform.

Install the certificate using the command:

splunk.exe install app C:\tmp\splunkclouduf.spl

Restart Splunk:
```
Splunk restart
```
Configure the forwarder to send data to Splunk Cloud:
```
splunk add forward-server prd-xyz.splunkcloud.com:9997
```
Replace the server address with your specific Splunk Cloud instance.
Enable the forwarder to receive data:
```
splunk enable listen 9997
```

Add a data input:

splunk add monitor C:/KANCHAN/POCs/scf.log

Step 3: Verifying Data Indexing and Customizing Indexes

Checking Data Indexing

Log in to your Splunk Cloud instance.
Navigate to the Search & Reporting app.

Run a search query, e.g.:

index= "main" source="C:\\KANCHAN\\POCs\\scf.log"

Configuring Custom Indexes

Locate or create the inputs.conf file in $SPLUNK_HOME/etc/apps/search/local/.

Add the following configuration:

disabled = false 
index = test 
sourcetype = log_file

Save the file and restart Universal Forwarder.
Verify the new index is receiving data.
```
index=test sourcetype=log_file
```
Remember to create the custom index in your Splunk Cloud instance if it doesn’t exist.

Troubleshooting Common Issues

If you encounter data transmission problems, check these common culprits:

Firewall settings: Ensure outbound connections on the specified port are allowed.
Network configuration: Verify network connectivity to Splunk Cloud:
```
telnet  9997
```
Certificate issues: Validate that certificates are properly configured

Understanding Real-Time Log Forwarding

it’s crucial to understand what we mean by “real-time log forwarding” in the context of Splunk Universal Forwarder.

What is Real-Time Log Forwarding?

Real-time log forwarding refers to the process where log files are continuously monitored for changes, and any new data is immediately sent to Splunk Cloud. This means that as soon as new log entries are written to a file, they are detected and forwarded, ensuring that your Splunk Cloud instance always has the most up-to-date information.

Let’s break this down using our example:

splunk add monitor C:/KANCHAN/POCs/scf.log

When you use this command, you’re telling the Universal Forwarder to:

Continuously Monitor: The Universal Forwarder will keep a constant watch on the scf.log file.
Detect Changes: Whenever the scf.log file is modified (i.e., new log entries are added), the Universal Forwarder immediately detects these changes.
Forward New Data: As soon as new log entries are detected, they are forwarded to Splunk Cloud without any manual intervention.
Maintain File Position: The Universal Forwarder keeps track of where it last read in the file, ensuring that only new data is sent and no duplicates are created.

This real-time nature of log forwarding is incredibly powerful because it means:

You always have the most current data in Splunk Cloud for analysis.
You can set up near real-time alerts and dashboards in Splunk Cloud, as the data is being continuously updated

For example, if your scf.log file is an application log that records user activities, errors, or system events, any new entries will be almost immediately available in Splunk Cloud. This allows for rapid detection of issues, real-time monitoring of user activities, or instant alerts on critical events.

It’s important to note that while we use the term “real-time,” there is always a small delay due to factors like network latency and processing time. However, in most cases, this delay is negligible, and the data can be considered to be available in near real-time.

Splunk Cloud Capabilities

Setting up real-time log forwarding to Splunk Cloud using Universal Forwarder significantly enhances your organization’s log management capabilities. By following this guide, you’ve taken a crucial step towards more efficient and insightful data analysis. Continue exploring Splunk Cloud’s features to unlock the full potential of your log data.

Sending Data to Splunk Cloud Using HTTP Event Collector (HEC)

Kanchan Bawane — Thu, 19 Sep 2024 06:27:34 +0000

In our previous blog, we explored how to set up Splunk Cloud and index dummy data using the upload option. Now that you have your Splunk Cloud environment up and running, let’s take it a step further. In this blog, we’ll dive into the HTTP Event Collector (HEC), a powerful feature in Splunk that allows you to send data to Splunk over HTTP or HTTPS. This is particularly useful for real-time data ingestion from various sources, such as applications or cloud services. Let’s get started!

Setting Up HTTP Event Collector in Splunk Cloud

1: Enable HEC in Splunk Cloud

Login to your Splunk Cloud dashboard.
Navigate to the Settings menu and select Data Inputs.
Click on HTTP Event Collector and then Global Settings.
Enable the HEC by toggling the switch to Enabled.
Click Save to apply the changes.

2: Create a New Token

In the HTTP Event Collector page, click on New Token.
Enter a name for your token and configure the necessary settings, such as source type and index.
Click Next and review your settings.
Click Finish to create the token. Make sure to copy the token value as you’ll need it to send data.

Important: Make sure to copy and securely store the token value. You’ll need this to authenticate when sending data to Splunk Cloud.

3: Sending Data to Splunk Cloud Using HEC

Prepare Your Data: Format your data as a JSON payload. Here’s an example:

{
'event': 'Hello, world!',
'sourcetype': 'sourcetype-test',
'index': 'your_index'
}

Send Data Using cURL: Use the following cURL command to send data to Splunk

curl --location "https://:8088/services/collector/event" --header "Authorization: Splunk " --header "Content-Type: application/json" --data "{\"event\": \"Hello, world!\", \"sourcetype\": \"sourcetype-test\", \"index\": \"\"}" -k

Replace with your actual Splunk Cloud URL and with the token you created earlier.

4: Verifying Data in Splunk Cloud

After sending data via HEC, it’s crucial to verify that it has been successfully indexed in Splunk Cloud.

Log in to your Splunk Cloud dashboard.
Use the search bar to query your indexed data. For example, you can search for index=”logs”.

By following these steps, you’ve successfully set up the HTTP Event Collector in Splunk Cloud and sent data using Curl. This powerful feature allows you to seamlessly integrate data from various sources in real-time, making it easier to monitor and analyze your data streams.

Understanding Splunk and Setting Up Splunk Cloud

Kanchan Bawane — Thu, 19 Sep 2024 06:24:10 +0000

Splunk is a powerful platform designed for searching, monitoring, and analyzing machine-generated data via a web-style interface. It captures, indexes, and correlates real-time data in a searchable repository, from which it can generate graphs, reports, alerts, dashboards, and visualizations. It’s designed for anyone who wants to gain insights from their data without the need to manage the underlying infrastructure. Splunk Cloud provides the same powerful features as Splunk Enterprise but is hosted and managed by Splunk, so you don’t have to worry about maintenance or scalability.

What is Splunk?

Splunk is a software platform that enables organizations to gain valuable insights from their machine data. It helps in:

Data Collection: Aggregating data from various sources.
Indexing: Storing data in a searchable format.
Search and Analysis: Querying data to find patterns, anomalies, and trends.
Visualization: Creating dashboards and reports for better understanding.

Why Choose Splunk Cloud?

Easy to Set Up: No need to install software or manage servers.
Scalable: Start small and scale up as your data needs grow.
Secure: Built-in security features to protect your data.
Accessible from Anywhere: Access your data and dashboards from any device with an internet connection.

Setting Up Splunk Cloud

Splunk Cloud is a cloud-based service that provides all the features of Splunk Enterprise, without the need to manage infrastructure. Here’s how to set it up:

1: Sign Up for Splunk Cloud

Navigate to the Splunk Cloud website. (https://www.splunk.com/en_us/download.htm)
Review the available plans. Splunk often offers a free trial, which is an excellent way to explore the platform’s capabilities.
Select the plan that best fits your needs and click “Get Started” or “Start Free Trial”.

2: Create Your Splunk Cloud Account

Fill out the registration form with your details.
Agree to the terms of service and click “Create Account.”
Verify your email address by clicking the link sent to your inbox.
Once verified, log in to access your new Splunk Cloud dashboard.

3: Configure Data Inputs

Bringing Your Data into Splunk

Log in to Splunk Cloud: Use your credentials to log in.
Add Data: Navigate to the “Add Data” section.
Select Data Source: Splunk supports various data sources, Choose the type of data you want to index (e.g., files, directories, network ports). For this example, let’s add a sample log file.
Configure Data Inputs: Follow the wizard to configure your data inputs.
- - Upload a File: Click on “Upload” and select a log file from your computer. If you don’t have a log file, you can download a sample file from the internet.
  - Index Your Data: Choose or create an index where your data will be stored. The index helps you organize your data and makes it easier to search later.
  - Review and Submit: After configuring your data source and index, review your settings and click “Submit”.

Step 4: Start Searching and Analyzing Your Data

Once your data is indexed, you’re ready to start searching. Click on “Start Searching”.

By following the steps outlined above, you can quickly set up Splunk Cloud and start gaining insights from your data.

Stay tuned for more updates on Splunk in upcoming blogs, where we’ll see the advanced features & get the most out of your Splunk experience. Happy Learning!

Building a Conversational Search Application with Azure Cognitive Search and OpenAI Embedding

Kanchan Bawane — Wed, 13 Dec 2023 05:43:32 +0000

Introduction

In this blog, we will show you how to build a conversational search application that can interact with Azure Cognitive Search (ACS) and retrieve relevant content from a web-scraped index by asking natural language questions, requesting for summary information, and using vector search. The application will also use OpenAI embeddings, which are pre-trained models that can embed queries and documents into vectors, and Azure Chat OpenAI, which is a service that can generate natural language responses using OpenAI models.

Vector search is a technique that uses deep learning models to represent queries and documents as vectors in a high-dimensional space. These vectors capture the semantic meaning and context of the queries and documents, and can be compared using a similarity measure, such as cosine similarity, to find the most relevant matches. Vector search enables you to perform semantic search, which can understand the intent and meaning of the user’s query, rather than just matching keywords.

Azure Cognitive Search: A cloud-based service that provides a rich set of capabilities to index, query, and analyze structured and unstructured data. We will use ACS to create and manage our web-scraped index, as well as to perform vector search using the built-in semantic ranking feature.
OpenAI: A research organization that develops and provides access to cutting-edge artificial intelligence models and tools. We will use OpenAI to create and deploy a custom language model that can generate natural language responses based on the search results, as well as to condense follow-up questions into standalone questions.
ConversationalRetrievalChain: A Python class that implements a conversational retrieval pipeline using a large language model (LLM) and a vector-based retriever. We will use this class to orchestrate the interaction between ACS and OpenAI, and to handle the user input and output.

Prerequisites

To follow along with this blog post, you will need the following:

An Azure subscription
An Azure Cognitive Search service
An Azure Cognitive Search index with some data. You can use any data source that you want, but for this example, I will use a scraped blog index that contains some blog posts from various websites. You can find the instructions on how to create and populate this index [here].
Azure OpenAI service: You will need this to access the OpenAI embeddings and Azure Chat OpenAI services.

Conversational Search Application

Here are the steps to use the Conversational Retrieval Chain to fetch data from Azure Cognitive Search and generate responses using OpenAI ADA model:

Import the necessary modules and classes from the Conversational AI Toolkit and other libraries.

from langchain.chains  import ConversationalRetrievalChain  
from langchain.vectorstores.azuresearch import AzureSearch 
from langchain.chat_models import AzureChatOpenAI 
from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.prompts  import PromptTemplate

Import the necessary modules and classes from the Conversational AI Toolkit and other libraries.

deployment: the name of the deployment that hosts the model.
model: the name of the model that generates the embeddings.
openai_api_base: the base URL of the OpenAI API.
openai_api_type: the type of the OpenAI API (either sketch or engine).

chunk_size: the number of sentences to process at a time.

embeddings=OpenAIEmbeddings(deployment= “ada_embedding_deployment_name”, 
                                model=” text-embedding-ada-model-name”, 
                               openai_api_base=” https://abc.openai.azure.com/”, 
                                openai_api_type=” azure”, 
                                chunk_size=1)

Set up the AzureSearch class, which will access the data in Azure Cognitive Search. You need to provide the following parameters for the Azure Cognitive Search service:
- azure_search_endpoint: the endpoint of the Azure Cognitive Search service.
- azure_search_key: the key to authenticate with the Azure Cognitive Search service.
- index_name: the name of the index that contains the data.
- embedding_function: the embedding function is the same as the one we created in step 2 using the OpenAIEmbeddings class, so we can use the embeddings object that we already have.
```
vector_store: AzureSearch = AzureSearch( 
        azure_search_endpoint="https://domain.windows.net", 
        azure_search_key="your_password", 
        index_name="scrapped-blog-index", 
        embedding_function=embeddings.embed_query)
```
Configure the AzureChatOpenAI class, which will be used to generate natural language responses using the OpenAI Ada model. You need to provide the following parameters for the OpenAI Ada model:
- deployment_name: the name of the deployment that hosts the model.
- model_name: the name of the model that generates the responses.
- openai_api_base: the base URL of the OpenAI API.
- openai_api_version: the version of the OpenAI API.
- openai_api_key: the key to authenticate with the OpenAI API.
- openai_api_type: the type of the OpenAI API (either sketch or engine).
```
llm = llm = AzureChatOpenAI(deployment_name="ada_embedding_deployment_name", 
                          model_name= "open_gpt_model-name", 
                          openai_api_base="https://model.openai.azure.com/", 
                          openai_api_version= "2023-07-01-preview", 
                          openai_api_key=OPENAI_API_KEY, 
                          openai_api_type= "azure")
```
Define the PromptTemplate class, which will be used to rephrase the user’s follow-up questions to be standalone questions. You need to provide a template that takes the chat history and the follow-up question as inputs and outputs a standalone question.
```
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template("""Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question. 
    Chat History: 
    {chat_history} 
    Follow Up Input: {question} 
    Standalone question:""")
```
Construct the ConversationalRetrievalChain class, which will be used to generate responses based on the user’s questions and the data in Azure Cognitive Search. You need to provide the following parameters for this class:
- llm: the language model that generates the natural language responses
- retriever: the retriever that fetches the relevant documents from the Azure Cognitive Search service.
- condense_question_prompt: the prompt template that rephrases the user’s follow up questions to be standalone questions.
- return_source_documents: the option to return the source documents along with the responses. - verbose: the option to print the intermediate steps and results.
```
qa = ConversationalRetrievalChain.from_llm(llm=llm, 
                                            retriever=vector_store.as_retriever(), 
                                            condense_question_prompt=CONDENSE_QUESTION_PROMPT, 
                                            return_source_documents=True, 
                                            verbose=False)
```

Define a function called search, which will take the user’s input as a parameter, and return a response.

def search(user_input): 
    query = user_input[-1]['content'] 
    history = [] 
    if len(user_input) == 1: 
        chat_history = "" 
        result = qa({"question": query, "chat_history": chat_history})   
        response = result["answer"] 
    else:  
        for item in user_input[:-1]: 
            history.append(item["content"]) 
        chat_history = [(history[i], history[i+1]) for i in range(0, len(history), 2)] 
        result = qa({"question": query, "chat_history": chat_history})   
        response = result["answer"] 
    return response

Test the function with some sample inputs and see the outputs in the notebook.

user_input = [{"content": "Tell me about Perficient’s blog posts about Generative AI"}] 
response = search(user_input) 
print(response)

Here is the screenshot for more reference:

Conclusion

In this blog, we have demonstrated how to build a conversational search application that can leverage the power of Azure Cognitive Search and OpenAI embeddings to provide relevant and natural responses to the user’s queries.

By using components – Azure Cognitive Search, OpenAI, and ConversationalRetrievalChain, we have been able to create a conversational search application that can understand the intent and meaning of the user’s query, rather than just matching keywords. We hope you have enjoyed this blog and learned something new. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!

Additional References

How to use Azure Blob Data and Store it in Azure Cognitive Search along with Vectors

Kanchan Bawane — Wed, 13 Dec 2023 05:38:33 +0000

Introduction

In the previous blog post, we showed you how to scrap a website, extract its content using Python, and store it in Azure Blob Storage. In this blog post, we will show you how to use the Azure Blob data and store it in Azure Cognitive Search (ACS) along with vectors. We will use some popular libraries such as OpenAI Embeddings and Azure Search to create and upload the vectors to ACS. We will also show you how to use the vectors for semantic search and natural language applications.

By following this blog post, you will learn how to:

Read the data from Azure Blob Storage using the BlobServiceClient class.
Create the vectors that ACS will use to search through the documents using the OpenAI Embeddings class.
Read the Data from Azure Blob Storage.
Load the data along with vectors to ACS using the AzureSearch class.

Read Data from Azure Blob Storage:

The first step is to read the data from Azure Blob Storage, which is a cloud service that provides scalable and secure storage for any type of data. Azure Blob Storage allows you to access and manage your data from anywhere, using any platform or device.

To read the data from Azure Blob Storage, you need to have an Azure account and a storage account. You also need to install the Azure Storage, which is a library that provides a simple way to interact with Azure Blob Storage using Python.

To install the Azure Storage SDK for Python, you can use the following command:

pip install azure-storage-blob

To read the data from Azure Blob Storage, you need to import the BlobServiceClient class and create a connection object that represents the storage account. You also need to get the account URL, the credential, and the container name from the Azure portal. You can store these values in a .env file and load them using the dotenv module.

For example, if you want to create a connection object and a container client, you can use:

STORAGEACCOUNTURL = os.getenv("STORAGE_ACCOUNT_URL") 

STORAGEACCOUNTKEY = os.getenv("STORAGE_ACCOUNT_KEY") 

CONTAINERNAME = os.getenv("CONTAINER_NAME") 

blob_service_client_instance = BlobServiceClient(account_url=account_url, credential=credential) 
container_client = blob_service_client_instance.get_container_client(container=container) 
blob_list = container_client.list_blobs()

Load the Documents and the Vectors to ACS:

The final step is to load the documents and the vectors to ACS, which is a cloud service that provides a scalable and secure search engine for any type of data. ACS allows you to index and query your data using natural language and semantic search capabilities.

To load the documents and the vectors to ACS, you need to have an Azure account and a search service. You also need to install the Azure Search library, which is a library that provides a simple way to interact with ACS using Python.

To install the Azure Search library, you can use the following command:

pip install azure-search

To load the documents and the vectors to ACS, you need to import the AzureSearch class and create a vector store object that represents the search service. You also need to get the search endpoint, the search key, and the index name from the Azure portal. You can store these values in a .env file and load them using the dotenv module.

For example, if you want to create a vector store object and an index name, you can use:

from azure_search import AzureSearch 
from dotenv import load_dotenv 
import os 

# Load the environment variables 
load_dotenv() 

# Get the search endpoint, the search key, and the index name 
vector_store_address : str = os.getenv("VECTOR_STORE_ADDRESS") 
vector_store_password : str = os.getenv("VECTOR_STORE_PASSWORD") 
index_name : str = os.getenv("INDEX_NAME") 

# Create a vector store object 
vector_store: AzureSearch = AzureSearch( 
    azure_search_endpoint=vector_store_address, 
    azure_search_key=vector_store_password, 
    index_name=index_name, 
    embedding_function=embeddings.embed_query, 
)

Then, you can load the documents and the vectors to ACS using the add_documents method. This method takes a list of documents as input and uploads them to ACS along with their vectors. A document is an object that contains the page content and the metadata of the web page.

For example, if you want to load the documents and the vectors to ACS using the stored in blob storage, you can use below code snippet by utilizing container_client and blob_list from above:

def loadDocumentsACS(index_name,container_client,blob_list): 
    docs=[] 
    for blob in blob_list: 
        # Read the blobs and parse them as JSON  
        blob_client = container_client.get_blob_client(blob.name) 
        streamdownloader = blob_client.download_blob() 
        fileReader = json.loads(streamdownloader.readall()) 

        # Process the data and creating the document list 
        text = fileReader["content"] + "\n author: " + fileReader["author"] + "\n date: " + fileReader["date"] 
        metafileReader = {'source': fileReader["url"],"author":fileReader["author"],"date":fileReader["date"],"category":fileReader["category"],"title":fileReader["title"]} 
         if fileReader['content'] != "": 
            doc = Document(page_content=text, metadata=metafileReader) 
        else: 
            pass 
        docs.append(doc) 
     #Loading the documents to ACS 
   vector_store.add_documents(documents=docs)

You can verify whether your data has been indexed or not in the indexes of the Azure Cognitive Search (ACS) service on the Azure portal. Refer to the screenshot below for clarification.

Conclusion:

This blog post has guided you through the process of merging Azure Blob data with Azure Cognitive Search, enhancing your search capabilities with vectors. This integration simplifies data retrieval and empowers you to navigate semantic search and natural language applications with ease. As you explore these technologies, the synergy of Azure Blob Storage, OpenAI Embeddings, and Azure Cognitive Search promises a more enriched and streamlined data experience. Stay tuned for the next part, where we step into utilizing vectors and generating responses and performing vector search on user queries.

References:

How to Scrape a Website and Extract its Content

Kanchan Bawane — Wed, 13 Dec 2023 05:36:11 +0000

Introduction

The web is a vast source of information, but it is not always easy to access and use for natural language applications.

In this blog post, we will show you how to crawl and scrape the target URL, extract and clean the content, and store it in Azure Blob Storage. We will use Python as the programming language, and some popular libraries such as requests, asyncio, BeautifulSoup, and lxml.

By following this blog post, you will learn how to:

Make asynchronous HTTP requests to get the HTML content of a web page.
Use different libraries to parse and extract the content from the HTML.
Compare the advantages and disadvantages of BeautifulSoup and lxml.
Clean and normalize the extracted content.
Store the content in Azure Blob Storage using the Azure Storage Python library.

Crawl and Scrape the Target URL

Scraping is a method to extract information from HTML content but to do this, we must know first the page structure from where we want to extract information. The first thing you need to do when scraping a web page is get the HTML content through an HTTP request, to process it. The native library from Python to work with HTTP requests is requests.

import requests

The main problem with this library is that it doesn’t support asynchronous requests directly. To solve this issue and use asynchronous calls we use another library called asyncio, which allows us to use tasks and async/await.

import asyncio

Now we can use both to make an async request to get the HTML:

async def getHTML(url: str): 
    loop : asyncio.AbstractEventLoop = asyncio.get_event_loop() 
    try: 
        future = loop.run_in_executor(None, requests.get, url) 
        return await future 
    # Handle exceptions related to the requests module 
    except requests.exceptions.RequestException as e: 
        pass 
    # Handle all other exceptions 
    except Exception as e: 
        print("An error occurred:", e)

Libraries to Extract the Content

Once we get the HTML content, we need to process it with a parser. For this there are several libraries, the most used are BeautifulSoup and lxml. This project uses BeautifulSoup, but there is also another class developed with lxml for experiment purposes.

Extract the Content using BeautifulSoup

At first, you must import the corresponding library:

from bs4 import BeautifulSoup

With the HTML that the request returned, you must build an object that will be used to process the HTML.

soup = BeautifulSoup (response.content, "html.parser", from_encoding="iso-8859-1")

To Get the Information, the Most Used Functions Are:

find: The .find() returns the first element that matches your query criteria.
find_all: The .find_all() returns an array of elements that you can then parse individually.
select_one: The .select_one() returns the first element that matches your query criteria using CSS selectors.

For example, if you want to get the title of the web page, you can use:

title = soup.find("title").text

Or, if you want to get all the links in the web page, you can use:

links = soup.find_all("a") 
for link in links: 
    print(link["href"])

Or, if you want to get the first paragraph with the class intro, you can use:

intro = soup.select_one("p.intro").text

Extract the Content using lxml

At first, you must import the corresponding library:

from lxml import html

With the HTML that the request returned, you must build an object that will be used to process the HTML.

parsed_content = html.fromstring(content)

To get the information, the function to use is .xpath(), where the parameter is an XPath string. XPath is a syntax for defining parts of an XML document. You can use XPath expressions to select nodes or node-sets in an XML document.

For example, if you want to get the title of the web page, you can use:

title = parsed_content.xpath("//title/text()")[0]

Difference between BeautifulSoup and lxml

BeautifulSoup is recommended for scenarios where flexibility on the search is necessary, for example search by two CSS classes without any particular order. lxml instead, since it uses XPath to make the search, it’s very strict and not so much flexible.

However, lxml has some advantages over BeautifulSoup, such as:

It is faster and more memory efficient.
It supports XML namespaces and validation.
It has better support for XPath and XSLT.

Therefore, the choice of the library depends on your needs and preferences. You can try both and see which one works better for you.

After extracting the content from the HTML, you may need to clean and normalize it before storing it in Azure Blob Storage.

Store the Extracted Content in Azure Blob Storage

The final step is to store the extracted content in Azure Blob Storage, which is a cloud service that provides scalable and secure storage for any type of data. Azure Blob Storage allows you to access and manage your data from anywhere, using any platform or device.

To use Azure Blob Storage, you need to have an Azure account and a storage account. You also need to install the Azure Storage, which provides a simple way to interact with Azure Blob Storage using Python.

To install the Azure Storage SDK for Python, you can use the following command:

pip install azure-storage-blob

To use the Azure Storage SDK for Python, you need to import the BlobServiceClient class and create a connection object that represents the storage account. You also need to get the connection string and the container name from the Azure portal. You can store these values in a .env file and load them using the dotenv module.

For example, if you want to create a connection object and a container client, you can use:

from azure.storage.blob import BlobServiceClient 
from dotenv import load_dotenv 
import os 
# Load the environment variables 
load_dotenv() 

# Get the connection string and the container name 
AZURE_BLOB_CONNECTION_STRING : str = os.getenv("AZURE_BLOB_CONNECTION_STRING") 
AZURE_PAGE_CONTAINER = os.getenv("AZURE_PAGE_CONTAINER") 

# Create a connection object 
blobServiceClient = BlobServiceClient.from_connection_string(AZURE_BLOB_CONNECTION_STRING) 

# Create a container client 
container_client = blobServiceClient.get_container_client(AZURE_PAGE_CONTAINER)

Then, you can upload the extracted content to Azure Blob Storage as a JSON document using the upload_blob method. You need to create a blob client that represents the blob that you want to upload and provide the data as a JSON string. You also need to generate a unique file name for the blob, which can be based on the current date and time.

Example:

If you want to upload the content from the previous steps, you can use:

import json 
from datetime import datetime 

# Create a document with the extracted content 
document = { 
    "title": title, 
    "summary": summary, 
    "texts": texts 
} 

# Convert the document to a JSON string 
json_document = json.dumps(document) 

# Create a blob client 
dt = datetime.now() 
fileName = dt.strftime("%Y%m%d_%H%M%S%f") + ".json" 
blob = blobServiceClient.get_blob_client(container=AZURE_PAGE_CONTAINER, blob=fileName) 

# Upload the content 
blob.upload_blob(json_document)

You can also download the content from Azure Blob Storage as a JSON document using the download_blob method. There also is the need to create a blob client that represents the blob that you want to download and provide the file name as a parameter. After that, you can then read the data as a JSON string and parse it into a Python object.

For example, if you want to download the content with a given file name, you can use:

# Create a blob client 
blob = blobServiceClient.get_blob_client(container=AZURE_PAGE_CONTAINER, blob=fileName 

# Download the content 
data = blob.download_blob().readall() 
document = json.loads(data) 
print(document)

What You Can Achieve:

By following this blog post, you will gain the skills to crawl, scrape, and extract content from websites efficiently and store web content securely in Azure Blob Storage. The code provided utilizes both BeautifulSoup and LXML, giving you a comprehensive understanding of the two widely used libraries. The asynchronous approach enhances performance, making it suitable for large-scale web scraping tasks.

Conclusion:

Web scraping is not only about data extraction but also about making that data usable. In this blog post, we’ve explored the intricacies of crawling, scraping, and storing web content. Stay tuned for the next part, where we step into utilizing Azure Blob Data and storing it in ACS along with vectors.

References:

Building a Private ChatBot with Langchain, Azure OpenAI & Faiss Vector DB for Local Document Query

Kanchan Bawane — Tue, 07 Nov 2023 16:36:04 +0000

In this blog, we will explore how we can effectively utilize Langchain, Azure OpenAI Text embedding ADA model and Faiss Vector Store to build a private chatbot that can query a document uploaded from local storage. A private chatbot is a chatbot that can interact with you using natural language and provide you with information or services that are relevant to your needs and preferences. Unlike a public chatbot, a private chatbot does not rely on external data sources or APIs, but rather uses your own local document as the source of knowledge and content. This way, you can ensure that your chatbot is secure, personalized, and up to date

Why use Langchain, Azure OpenAI, and Faiss Vector Store?

Langchain, Azure OpenAI, and Faiss Vector Store are three powerful technologies that can help you build a private chatbot with ease and efficiency.

Langchain is a Python library that allows you to create and run chatbot agents using a simple and intuitive syntax. Langchain provides you with various classes and methods that can handle the common tasks of chatbot development, such as loading text, splitting text, creating embeddings, storing embeddings, querying embeddings, generating responses, and defining chains of actions. Langchain also integrates with other popular libraries and services, such as Faiss, OpenAI, and Azure OpenAI Service, to enable you to leverage their functionalities within your chatbot agent.
Azure OpenAI Service is a cloud-based service that allows you to access the powerful natural language processing capabilities of OpenAI, such as GPT-4, Codex, and DALL-E. Azure OpenAI Service enables you to encode text into embeddings, decode embeddings into text, and generate text based on a prompt or a context. You can use Azure OpenAI Service to create high-quality natural language responses for your chatbot, as well as to create embeddings for your local document that capture its semantic meaning.
Faiss Vector Store is a vector database that allows you to store and retrieve embeddings efficiently and accurately. Faiss Vector Store uses a state-of-the-art algorithm called Product Quantization (PQ) to compress and index embeddings, which reduces the storage space and improves the search speed. You can use Faiss Vector Store to store the embeddings of your local document and to query them for the most relevant chunks based on the user input.

Build a Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store

Now that you have an idea of what these technologies are and what they can do, let’s use them to build a private chatbot with Langchain, Azure OpenAI, and Faiss Vector Store for local document queries. The steps are as follows:

Step 1: Install Langchain and its Dependencies

You need to install Langchain and its dependencies, such as Faiss, OpenAI, and Azure OpenAI Service, on your machine. You also need to import the required libraries and modules for your chatbot.
Step 2: Load your Local Document

You need to load your local document using Langchain’s TextLoader class. You can use any text format, such as PDF, HTML, or plain text, as long as it is readable by Langchain. For example, you can load a PDF document using the PyPDFLoader class, and docx using Docx2txtLoader as shown in the following code snippet:
```
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import Docx2txtLoader

//load PDF 
loader = PyPDFLoader(file_path=tmp_file_path)

//load docx
 loader = Docx2txtLoader(file_path=tmp_file_path)
```
Step 3: Split your Document into Smaller Chunks

You need to split your document into smaller chunks using Langchain’s CharacterTextSplitter or SentenceTextSplitter classes. Below code snippet split your document into sentences using the SentenceTextSplitter class.
```
# Split the document into sentences using SentenceTextSplitter
splitter = SentenceTextSplitter()
chunks = splitter.split(document)
```
You can also use the load_and_split method of the loaders to split your document into chunks automatically, based on the file format and the structure of your document.
Below is the code snippet of PyPDFLoader to split your PDF document into pages:
```
# Split the PDF document into pages using PyPDFLoader
loader = PyPDFLoader()
pages = loader.load_and_split("my_document.pdf", encoding="utf-8", language="en", title="My Document")
```
Similarly, you can use Docx2txtLoader to split your DOCX document into paragraphs, as shown in the following code snippets:
```
# Split the DOCX document into paragraphs using Docx2txtLoader
loader = Docx2txtLoader()
paragraphs = loader.load_and_split("my_document.docx", encoding="utf-8", language="en", title="My Document")
```

Step 4: Create Embeddings and Store them in a Faiss Vector Database

You can use the FAISS class to create a Faiss vector database from your local document, which will store the embeddings locally and allow you to query them later.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings=OpenAIEmbeddings(deployment=OPENAI_ADA_DEPLOYMENT_NAME,
                                model=OPENAI_ADA _MODEL_NAME,
                                openai_api_base=OPENAI_DEPLOYMENT_ENDPOINT,
                                openai_api_type="azure",
                chunk_size=1)

db = FAISS.from_documents(documents=pages, embedding=embeddings)
db.save_local("./dbs/documentation/faiss_index")

Step 5: Create a Chatbot Agent

You need to create a chatbot agent using Langchain’s OpenAIAgent class. This class allows you to interact with the Azure OpenAI Service and generate natural language responses based on the user input and the retrieved chunks. You can choose the model and the parameters of the Azure OpenAI Service according to your preference. You also need to create an instance of the AzureChatOpenAI class for your chatbot, as shown in the following code snippet:
```
from langchain.chat_models import AzureChatOpenAI

llm = AzureChatOpenAI(deployment_name=OPENAI_DEPLOYMENT_NAME,
                      model_name=OPENAI_MODEL_NAME,
                      openai_api_base=OPENAI_DEPLOYMENT_ENDPOINT,
                      openai_api_version=OPENAI_DEPLOYMENT_VERSION,
                      openai_api_key=OPENAI_API_KEY,
                      openai_api_type="azure")
```

Step 7: Define a Chain of Actions for your Chatbot Agent

You need to define a chain of actions for your chatbot agent using Langchain’s Chain class. A chain is a sequence of calls that can be executed by the agent to perform a specific task. For example, you can define a chain that takes the user input, queries the Faiss vector database for the most relevant chunks, and generates a response using the Azure OpenAI Service.
Use ConversationalRetrievalChain from azure-openai to create a chatbot agent that can answer questions using Azure OpenAI Service models and the local document retriever.

from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.question_answering import load_qa_chain

#load the faiss vector store we saved locally 
vectorStore = FAISS.load_local("./dbs/documentation/faiss_index", embeddings)

#use the faiss vector store we saved to search the local document
retriever = vectorStore.as_retriever(search_type="similarity", search_kwargs={"k":2})
    
qa = ConversationalRetrievalChain.from_llm(llm=llm,
                                            retriever=retriever,
                                            condense_question_prompt=QUESTION_PROMPT,
                                            return_source_documents=True,
                                            verbose=False)

Step 8: Run your Chatbot Agent

You can use the Streamlit library to create a web-based interface for your chatbot, as shown in the following code snippet:

# Process user query and get response
def ask_question_with_context(qa, question, chat_history):
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    return chat_history


user_query = st.text_input("Ask a question:")

if st.button("Submit"):
    if user_query:
        st.write("User Query:", user_query)
        chat_history = ask_question_with_context(qa, user_query, chat_history)
        response = chat_history[-1][1] if chat_history else "No response"
        st.write("Answer:", response)

What are the Benefits of Using a Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store?

By using a private chatbot with Langchain, Azure OpenAI, and Faiss Vector Store for local document query, you can achieve the following benefits:

Security: You can keep your local document private and secure, as you do not need to upload it to any external server or service. You can also control the access and usage of your chatbot, as you do not need to share it with anyone else.
Personalization: You can customize your chatbot according to your needs and preferences, as you can choose the text format, the chunk size, the embedding model, the index type, the generation model, and the chain of actions for your chatbot.
Real-Time Updates: You can ensure that your chatbot is always up-to-date, as you can update your local document and your chatbot whenever you want. You can also leverage the latest natural language processing technologies, such as OpenAI, to create high-quality natural language responses for your chatbot.

How to Use Your Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store?

Once you have built your private chatbot with Langchain, Azure OpenAI, and Faiss Vector Store for local document query, you can use it for various purposes, such as:

Learning: You can use your chatbot to learn new information or skills from your local document, such as a textbook, a manual, or a tutorial. You can ask your chatbot questions, request summaries, or request examples from your local document.
Researching: You can use your chatbot to research a topic or a problem from your local document, such as a paper, a report, or a case study. You can ask your chatbot to provide you with relevant facts, arguments, or evidence from your local document.
Creating: You can use your chatbot to create new content or products from your local document, such as a blog, a presentation, or a prototype. You can ask your chatbot to generate ideas, suggestions, or solutions from your local document.

Conclusion

In this blog, we have learned how to build a private chatbot with Langchain, Azure OpenAI, and Faiss Vector Store for local document query. The integration of these technologies enables the development of a secure and personalized private chatbot. This approach offers enhanced security, personalization, and access to local document knowledge. By following the provided steps, you can create a chatbot tailored to your needs, ensuring privacy and control over your data. This technology stack holds great potential for various applications, including learning, research, and content creation.

Thank you for reading!

Additional Resources

Coveo Headless Library Integration with SAPUI5 Framework: Development Environment Setup – Phase I

Kanchan Bawane — Fri, 13 Oct 2023 03:22:55 +0000

In this blog, we will explore how to integrate Coveo Headless, a powerful search and relevance platform, with OpenUI5, a popular UI framework for building web applications. As search functionality becomes increasingly crucial for modern applications, this integration will allow us to create an advanced search experience within OpenUI5 projects.

Introduction

Coveo Headless is a search and relevance platform that offers a set of APIs to build tailored search experiences. It leverages machine learning and AI to deliver personalized results, making it a powerful tool for enhancing search functionality.

OpenUI5 is a UI framework based on JavaScript that facilitates the development of responsive web applications. It provides a collection of libraries and tools for creating consistent and visually appealing user interfaces.

By integrating Coveo Headless with OpenUI5, we can combine the strengths of Coveo’s advanced search capabilities with OpenUI5’s flexible UI components, resulting in a comprehensive and user-friendly search experience.

Requirements

Before we dive in, it’s essential to ensure you have the following prerequisites:

Basic knowledge of Coveo and OpenUI5 components.
Familiarity with JavaScript and Node.js.
Node.js version >= 18.12.0 installed (you can use Node Version Manager, NVM, for this).

Setting Up the Development Environment

In this section, we’ll guide you through the process of setting up your development environment to integrate Coveo Headless with OpenUI5. This includes cloning a sample OpenUI5 repository, upgrading your Node.js version, installing required dependencies, adding dependencies to the package.json file, and configuring shims for compatibility.

Clone Sample OpenUI5 Repository:

To get started, clone the OpenUI5 sample application repository from GitHub.

Repository URL: https://github.com/SAP/openui5-sample-app

This sample repository provides a basic structure for an OpenUI5 application and will serve as the foundation for integrating Coveo Headless library.

Configurations

Step-01: Add Dependencies to package.json:

Open the package.json file in your project directory. Add the following dependencies to the “dependencies” section:

"dependencies": {
    "@coveo/headless": "^1.109.0",
    "http-proxy": "^1.18.1",
    "openui5-redux-model": "^0.4.1"
}

Step-02: Add Shim Configuration:

In your ui5.yaml configuration file, add the shim configuration for the Coveo Headless package. This configuration ensures that OpenUI5 correctly loads the Coveo Headless module:

---
specVersion: "2.5"
kind: extension
type: project-shim
metadata:
  name: ui5-ts-shim-showcase.thirdparty
shims:
  configurations:
    "@coveo/headless":
      specVersion: "2.5"
      type: module
      metadata:
        name: "@coveo/headless"
      resources:
        configuration:
          paths:
            "/resources/@coveo/headless/": ""

Step-03: Install Dependencies:

Run the following commands in your project directory to install the newly added dependencies.

npm install
cd webapp
yarn install

Please note that the installation might take some time.

Step-04: Configure Component.js:

Open your Component.js file located within the webapp folder and add the following code. It ensures that Coveo Headless is properly mapped and recognized as a module by OpenUI5:

sap.ui.loader.config({
  map: {
    "*": {
      "@coveo/headless": "@coveo/headless/dist/browser/headless"
    }
  },
  shim: {
    "@coveo/headless/": {
      "amd": true,
      "deps": [],
      "exports": "CoveoHeadless"
    }
  }
});

sap.ui.define(["sap/ui/core/UIComponent", "sap/ui/core/ComponentSupport", "@coveo/headless"], function(UIComponent) {
  "use strict";
  return UIComponent.extend("sap.ui.demo.todo.Component", {
    metadata: {
      manifest: "json"
    }
  });
});

Start a local server and run the application (http://localhost:8080/index.html).

npm start or ui5 serve -o index.html

This setup ensures that Coveo Headless is correctly loaded and available within your OpenUI5 project. You can also verify this in your browser’s console as shown in the screenshot below:

Now you can use the CoveoHeadless variable within your OpenUI5 project to initialize the Coveo search engine and start building advanced search functionality.

Summary

By performing the above steps, you will have successfully prepared your development environment to integrate Coveo Headless with OpenUI5. The sample OpenUI5 application and the added dependencies will serve as the basis for building your enhanced search functionality.

Additional resources

Build a Search Interface Using SAPUI5 Framework with Coveo Headless Library- Phase II

Kanchan Bawane — Fri, 13 Oct 2023 03:22:49 +0000

In our previous blog post, “Coveo Headless Library Integration with OPENUI5 Framework: Development Environment Setup – Phase I,” we started the integration between Coveo Headless and OpenUi5.

Coveo Headless is a search and relevance platform, and OpenUI5 is a dynamic UI framework for web application development. Phase I establishes the foundation for what we’re about to explore in this Phase II edition, so if you haven’t had a chance to read it, we highly recommend doing so.

In Phase I, we set up our development environment, ensuring that all the prerequisites were met to integrate Coveo Headless with OpenUI5. We covered everything from cloning the sample OpenUI5 repository to configuring shims for compatibility. Now that our development environment is ready, it’s time to build the search interface using OpenUI5 controls and Coveo Headless controller instances.

Initializing the Search Engine using CoveoHeadless:

To get things rolling, we need to initialize the Coveo search engine using the buildSearchEngine instance from the Headless library inside the Oninit function inside App.controller.js file. This is where we define the necessary configurations.

For demonstration purposes, we will use a sample configuration:

// Initialize the Coveo search engine
this.searchEngine = CoveoHeadless.buildSearchEngine({
    configuration: {...CoveoHeadless.getSampleSearchEngineConfiguration()}
});

For detailed insight into how to customize and add your configuration parameters, refer to the documentation.

After building the search engine, you can verify if it was successfully initialized by using console.log(this.searchEngine) in same file. Check the browser’s console to see if the search engine object is displayed without any errors.

#image

Creating a UI5 Component with Coveo Headless:

Whenever you’re constructing a new UI5 component that incorporates Coveo Headless capabilities, the following steps must be followed:

Create a controller instance.
Create a fragment and include the necessary UI5 control.
Bind the value to the UI5 control.
Perform necessary actions.

Building a Search Box Component:

Let’s walk through a practical example of creating a search box component using OpenUI5 with Coveo Headless functionality. In this example, you’ll gain insight into the process of building a search box that allows users to input search queries and receive relevant search results.

Step 1: Create a Controller Instance

In your OpenUI5 controller, start by creating a controller instance that will manage the behaviour of your search box component.

sap.ui.define([
    "sap/ui/core/mvc/Controller",
    "coveo/headless"
], function(Controller, CoveoHeadless) {
    "use strict";

    return Controller.extend("your.namespace.ControllerName", {
        onInit: function() {
            // Create a search box controller instance
            this.buildSearchBox();
        },

        buildSearchBox: function() {
            const searchBoxOptions = {
                enableQuerySyntax: true,
                numberOfSuggestions: 5,
                id: "main-searchBox",
                clearFilters: false
            };
            this.searchBox = CoveoHeadless.buildSearchBox(this.searchEngine, {
                options: searchBoxOptions
            });
        },

        // Other methods and event handlers...
    });
});

Step 2: Create a UI5 Fragment

Next, create a fragment that includes the UI5 controls for your search box component’s interface. We’ll use the SearchFieldcontrol to allow users to input search queries.

Step 3: Bind UI5 Control Values

While not needed for the SearchField control, you might need to bind values to other UI5 controls in your fragment to ensure synchronization with the controller.

Step 4: Implement Search Functionality

Implement the onSearch function in your controller. This function will be triggered when users interact with the search field. It will update the search box text using updateText method and trigger the search query using this.searchBox.submit().

onSearch: function(oEvent) {
    var sSearchQuery = oEvent.getParameter("query");
    if (sSearchQuery && sSearchQuery.length > 0) {
        this.searchBox.updateText(sSearchQuery);
        this.searchBox.submit();
    }
}

Verifying Network Calls after Successful Search Box Integration

After integrating the search box component using OpenUI5 and Coveo Headless, it’s important to ensure that the search functionality is working as expected. One way to verify this is by checking the network calls made between your application and the Coveo search engine.

Here’s how you can do it:

Input a search query in the search box.
Press Enter or trigger the search action.
When you initiate a search, check the “Network” tab in your developer tools.
You will see network calls between your application and the Coveo search engine.
These calls include the search query and the corresponding responses.
Responses may include search results, suggestions, and other related data.

By Analyzing the Network Calls, you Can:

Confirm that the search query is being sent to the Coveo search engine.
Review the response to ensure that relevant search results or suggestions are being received.

Keep in mind that the specific URLs and details of the network calls will depend on your Coveo Headless configuration and the API endpoints you’re using.

You can ensure that your search box component successfully communicates with the Coveo search engine and receives the desired search results or suggestions through network call verification. This step is crucial to confirming the successful integration of the search functionality into your OpenUI5 application.

The complete code can be found in the repository that’s attached below.

Conclusion:

We’ve gone into more detail in this Phase II of our blog series about how to integrate Coveo Headless with the OpenUI5 framework to create a powerful search interface for your web applications. From setting up the Coveo search engine to creating a search box component, we have covered all the necessary processes. These steps will help you add advanced search capabilities to your OpenUI5 applications. Stay tuned for more insights in our ongoing series!

To be Continued…

Additional resources

Generate Embeddings using OpenAI Service

Kanchan Bawane — Thu, 07 Sep 2023 05:46:20 +0000

Introduction:

Embeddings are essential in the fields of natural language processing (NLP) and machine learning because they convert words and phrases into numerical vectors. By successfully capturing semantic linkages and contextual meanings, these vectors help machines comprehend and process human language. We will examine the idea of embeddings in this blog, learn about their uses, and investigate how to create and incorporate them using Azure Cognitive Search.

Organizations can create advanced search solutions using Azure Cognitive Search which is an effective cloud-based search and AI service. Additionally, the models from OpenAI and the revolutionary powers of embeddings make it more precise and efficient.

What are Embeddings?

Word, phrase, or document embeddings are multi-dimensional vector representations that capture semantic meaning and contextual relationships. Embeddings detect complexity that conventional approaches frequently miss by mapping words onto numerical vectors in a dense vector space. Similar words are placed closer together in this area, allowing algorithms to comprehend and compare textual material more effectively.

Sentiment analysis, recommendation engines, semantic search, and many other uses are applications of embeddings.

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

We are required to generate embeddings initially, followed by sending these embeddings to Azure Cognitive Search. This enables Azure Cognitive Search to leverage the embeddings, thereby enhancing its effectiveness.

Generating Embeddings with OpenAI Services:

Generating embeddings involves utilizing pre-trained models or training custom models on specific datasets. OpenAI’s API provides endpoints for generating embeddings as follows:

API: https://api.openai.com/v1/embeddings: The default endpoint from OpenAI for generating embeddings without deployment-specific information.

Headers:

Content-Type: application/json

Authorization: Bearer your-openai-api-key

Navigate to below URI to generate OpenAI API key:

https://platform.openai.com/account/api-keys

The complete response object, that contains these embeddings, is in json is given as follows:

"object": "list",
    "data": [
        {
            "object": "embedding",
            "index": 0,
            "embedding": [
                  -0.022749297,
                   0.018456243,
                 -0.0120750265,
                  0.013086683,
                 -0.0018012022……
            ]
        }
    ],
    "model": "text-embedding-ada-002-v2",
    "usage": {
        "prompt_tokens": 2,
        "total_tokens": 2
    }
}

The size of the generated embeddings is around 1024 floats total for ADA.

“1024 floats total for Ada” likely means that when using the ADA model in the context of Azure OpenAI embeddings, each text embedding is represented as a vector with 1024 floating-point values, which collectively make up the embedding for the given text.

The input text for our embedding models must not exceed 2048 tokens, which is approximately equivalent to 2-3 pages of text. Please ensure that your inputs are within this limit before initiating a request.

As you keep going, we can incorporate these generated embeddings into search solutions, you’re not only improving the search experience with these embeddings but also taking a significant step toward advancement.

Conclusion:

Embeddings are like special tools that help connect what we say and what computer understand. These tools uses special codes to capture the meaning of words and how they fit together, making it easier for computers to figure out what we mean. In this blog, we’ve seen how to generate these embeddings using OpenAI service.

These embeddings are super useful for making search results better, understanding how people express from their search words, suggesting things you might like, and helping search engines understand what you’re looking for.

Translating Different Content Types Using Helsinki-NLP ML Model from Hugging Face

Kanchan Bawane — Mon, 04 Sep 2023 06:37:23 +0000

Introduction

In Hugging Face, a translation model is a pre-trained deep learning model that can be used for machine translation tasks, These models are pre-trained on large amounts of multilingual data and fine-tuned on translation-specific datasets.

To use a translation model in Hugging Face, we typically load the model using the from_pretrained() function, which fetches the pre-trained weights and configuration. Then, we can use the model to translate text by passing the source language text as input and obtaining the translated text as output.

Hugging Face’s translation models are implemented in the Transformers library, which is a popular open-source library for natural language processing (NLP) tasks. The library provides a unified interface and a set of powerful tools for working with various NLP models, including translation models.

Let’s start by implementing a translation model using the Helsinki-NLP model from Hugging Face:

Install the necessary libraries: Install the transformers library, which includes the translation models, use pip to install it.
```
pip install transformers
```
Load the translation model: Use the from_pretrained()function to load a pre-trained translation model. need to specify the model’s name or the model’s identifier. For example, to load the English-to-French translation model we can use the following code.
```
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)
```
Tokenize the input: Before translating the text, we need to tokenize it using the appropriate tokenizer. The tokenizer splits the text into smaller units, such as words or subwords, that the model understands.
```
source_text = "Translate this English text to French."
encoded_input = tokenizer.encode(source_text, return_tensors="pt")
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
```
translated_output = model.generate(encoded_input)
```
Translate the text: Pass the encoded input to the translation model to obtain the translated output.
```
translated_text = tokenizer.decode(translated_output[0], skip_special_tokens=True
```
These steps provide a basic outline of implementing a translation model in Hugging Face.

Handling Metadata, HTML Body, and Plain Text

With the fundamentals of the Helsinki-NLP Hugging Face model in hand, let us gets started by translating various forms of content, including plain text, HTML body content, and metadata.

It is essential to determine the type of content we will be dealing with before we start translating.

Methods used to differentiate between plain text, HTML, and Metadata are as follows:

is_plain_text(content): By looking for the presence of HTML tags and Python string identifiers, this function can tell if the content is plain text.
is_html_content(content): identifies the existence of the html tag to identify HTML content.
is_python_string(content): Recognizes metadata in Python strings based on specific delimiters.

Approaches that demonstrates the translation of different content:

Translating Metadata Content: Metadata often consists of structured data in the form of key-value pairs like name, title etc,. This translate just the values of the metadata object while leaving the keys as it is:

def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
            translated_metadata = {}
            # Loop through each field and perform the translation process
            for key, value in metadata.items():
                # Translate the value if it is a string and included in fields_to_translate
                if isinstance(value, str) and key in fields_to_translate:
                    value_tokens = tokenizer.encode(value, return_tensors='pt')
                    translated_value_tokens = model.generate(value_tokens, max_length=100)
                    translated_value = tokenizer.decode(translated_value_tokens[0], skip_special_tokens=True)
                else:
                    translated_value = value

                translated_metadata[key] = translated_value

            return json.dumps(translated_metadata)

Translating Plain Text Content: Plain text translation is a simple technique. To translate plain text from one language to another, we’ll use our translation model:

def translate_plainText(content,model,tokenizer):
            # Tokenize the plain text content
            encoded = tokenizer(content, return_tensors="pt", padding=True, truncation=True)

            # Translate the text
            translated_tokens = model.generate(**encoded, max_length=1024, num_beams=4, early_stopping=True)
            return tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

Translating HTML Body Content: Due to the existence of markup, HTML body material requires certain processing. This method focuses on translating HTML body text:

def translate_html_content(content,model,tokenizer):
             # Tokenize the HTML content
             soup = BeautifulSoup(content, 'html.parser')

             # Translate the text
             translated_text = model.generate(**tokenizer(content, return_tensors="pt", padding=True, truncation=True),
                                          max_length=1024, num_beams=4, early_stopping=True)
             translated_text = tokenizer.decode(translated_text[0], skip_special_tokens=True)

             # Create a new soup with the translated text
             new_soup = BeautifulSoup(translated_text, 'html.parser')

             # Replace the text in the original HTML structure
             for original_tag, translated_tag in zip(soup.find_all(), new_soup.find_all()):
                 if original_tag.string:
                     original_tag.string = translated_tag.get_text()
             return soup.prettify()

Putting it All Together

We provide a central approach that manages the translation method according to the content type to bring everything together:

import sys
import subprocess
import json
import sacremoses
from transformers import MarianMTModel, MarianTokenizer


# Install necessary packages if not already installed
try:
    import transformers
    import sacremoses
except ImportError:
    subprocess.check_call(['pip', 'install', 'torch', 'transformers', 'sacremoses'])
    import transformers
    import sacremoses

from transformers import MarianMTModel, MarianTokenizer
from bs4 import BeautifulSoup



def translate_content(content):
    # Load the translation model and tokenizer
    model_name = f'Helsinki-NLP/opus-mt-en-fr'
    model = MarianMTModel.from_pretrained(model_name)
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    tokenizer.src_tokenizer = sacremoses.MosesTokenizer()
    tokenizer.tgt_tokenizer = sacremoses.MosesTokenizer()

    # Check if the input is HTML or plain text or metadata
    if is_html_content(content):
        translated_content = translate_html_content(content,model,tokenizer)     
    elif is_python_string(content):
        print("Content is a Python string expression.")
        fields_to_translate=['title','name']
        content = content.replace("null", "None")
        metadata = eval(content)
        translated_content = translate_metadata_content(metadata,model,tokenizer,fields_to_translate)
    elif is_plain_text(content):
        translated_content = translate_plainText(content,model,tokenizer)
        
    return translated_content          
       
def translate_metadata_content(metadata,model,tokenizer,fields_to_translate):
    # Utilize the code snippet from the first point above to translate metadata values.
    #...

def translate_plainText(content,model,tokenizer):
    # Utilize the code snippet from the second point above to translate plain text
    #...

def translate_html_content(content,model,tokenizer):
    # Utilize the code snippet from the third point above to translate html content
    #... 

def is_html_content(content):
    return "" in content.lower()

def is_plain_text(content):
    return "" not in content.lower() and not is_python_string(content)
    
def is_python_string(content):
    return (content.startswith("'") and content.endswith("'")) or \
           (content.startswith('"') and content.endswith('"')) or \
           (content.startswith("{") and content.endswith("}"))

Example Usage

Here are examples of using the provided functions with different content types.

# Example usage with HTML content:
html_content = """


    Example HTML


    Hello, world!
    This is a sample HTML content to be translated.


"""
translated_html = translate_content(html_content)
print(translated_html)


# Example usage with plain text:
plain_text = "plain text content for testing translation functionality "

translated_text = translate_content(plain_text)
print(translated_text)


# Example usage with metadata
metadata ="{'title':'title for testing translation of metadata value'}"


translated_metadata = translate_content(metadata)
print(translated_metadata)

You just need to run this script separately using below command.

python your_file_name.py command

Conclusion:

The Helsinki-NLP model from Hugging Face is like a strong tool that can translate different types of content. This includes regular text, website text (HTML), and extra information (metadata). Using special models in the Transformers library, we can easily translate words from one language to another.

Additional reference:

Getting Started with Coveo: A Comprehensive Overview for Tech Enthusiasts

Kanchan Bawane — Sun, 23 Jul 2023 12:18:18 +0000

Introduction:

Welcome, to this comprehensive blog post about COVEO – a cutting-edge technology that is revolutionizing the way we search and access information. In this blog, we will delve into what COVEO is, its key features, benefits, and how it enhances the overall user experience. Whether you’re an IT professional, a developer, or simply curious about the latest tech innovations, this blog will offer valuable insights into the world of COVEO.

What is COVEO?

COVEO is an AI-powered enterprise search solution that revolutionizes how businesses interact with data and information. It goes beyond traditional keyword-based searches, leveraging machine learning algorithms to deliver personalized and relevant results in real-time. Whether you are looking for documents, files, customer information, or insights from vast knowledge repositories, COVEO’s intelligent capabilities make finding information effortless.

By analyzing user behavior, preferences, and context, COVEO learns from interactions to continually improve the search results and cater to individual needs. Its ability to index and process vast amounts of data ensures quick and accurate retrieval of information.

Key Features of COVEO

Intelligent Search: COVEO’s intelligent search capabilities enable users to find the most relevant information quickly, even from extensive databases, by understanding the context of the search query and providing real-time suggestions.
Machine Learning-Powered Recommendations: Leveraging machine learning algorithms, COVEO offers personalized content recommendations, enhancing user engagement and overall satisfaction.
Unified Content Access: COVEO integrates seamlessly with various platforms, such as CRM systems, e-commerce websites, and knowledge bases, providing a unified access point for all critical data sources. This consolidation ensures that users can access all the necessary information from a single interface, streamlining workflows and saving time.
Advanced Analytics: Gain valuable insights into user behavior, content usage patterns, and search trends with COVEO’s robust analytics, helping organizations optimize their content and improve decision-making.
Natural Language Processing: COVEO’s natural language processing capabilities enable users to interact with the system using everyday language, simplifying the search process and reducing the learning curve.
Real-time Updates: COVEO ensures that users have access to the latest and most up-to-date information. Real-time indexing and constant data synchronization mean that users won’t miss any crucial updates or changes.

How COVEO Enhances User Experience

Personalization: COVEO’s ability to understand user preferences and past interactions allows it to deliver personalized search results, increasing user satisfaction and productivity.
Faster Access to Information: With its lightning-fast search capabilities, COVEO significantly reduces search time, empowering users to find the information they need in mere seconds.
Contextual Relevance: By considering the context of a user’s query, COVEO ensures that search results are not only accurate but also contextually relevant, leading to more informed decision-making.
AI-Driven Recommendations: COVEO’s AI-powered content recommendations anticipate user needs, presenting relevant information proactively, and facilitating a smooth user journey.

Benefits of Implementing COVEO

Increased Productivity: With quick and accurate access to information, employees can complete tasks faster and be more efficient in their daily work.
Enhanced Customer Experience: COVEO’s personalization and recommendation features extend to customer-facing platforms, improving customer satisfaction and retention rates.
Data-Driven Decision Making: The insights generated by COVEO’s analytics empower businesses to make data-driven decisions, identify knowledge gaps, and optimize content.
Reduced Support Costs: By enabling users to find answers to their queries independently, COVEO reduces the burden on support teams, leading to lower support costs.

Real-World Use Cases of COVEO

E-commerce: COVEO enhances online shopping experiences by providing personalized product recommendations, boosting conversions and revenue for businesses.
Customer Support: By empowering support agents with access to relevant information, COVEO improves first-call resolution rates and customer satisfaction in call center environments.
Employee Intranet: Organizations can deploy COVEO on their intranet, allowing employees to find internal resources, documents, and company information with ease.

Conclusion:

In conclusion, COVEO is a game-changing technology that significantly improves the way we search for and access information. With its advanced AI-driven capabilities, personalized recommendations, and seamless integration with various platforms, COVEO is a valuable asset for businesses seeking to enhance user experiences and streamline operations. Whether it’s in the realm of e-commerce, customer support, or internal knowledge management, COVEO has proven to be a versatile solution with countless benefits. Embracing COVEO is a step towards empowering users, optimizing processes, and staying ahead in the ever-evolving tech landscape. So, dive into the world of COVEO and unlock the true potential of your organization’s data and content management.

Kanchan Bawane, Author at Perficient Blogs

Sending Real-Time Logs to Splunk Cloud Using Universal Forwarder

Introduction to Splunk Cloud and Universal Forwarder

What is Splunk Cloud?

The Role of Universal Forwarder

Why Use Universal Forwarder for Log Forwarding?

Getting Started: Prerequisites and Setup

Step 1: Setting Up the Splunk Universal Forwarder

Step 2: Configuring Universal Forwarder

Step 3: Verifying Data Indexing and Customizing Indexes

Troubleshooting Common Issues

Understanding Real-Time Log Forwarding

What is Real-Time Log Forwarding?

When you use this command, you’re telling the Universal Forwarder to:

This real-time nature of log forwarding is incredibly powerful because it means:

Splunk Cloud Capabilities

Sending Data to Splunk Cloud Using HTTP Event Collector (HEC)

Setting Up HTTP Event Collector in Splunk Cloud

1: Enable HEC in Splunk Cloud

2: Create a New Token

3: Sending Data to Splunk Cloud Using HEC

4: Verifying Data in Splunk Cloud

Understanding Splunk and Setting Up Splunk Cloud

What is Splunk?

Why Choose Splunk Cloud?

Setting Up Splunk Cloud

1: Sign Up for Splunk Cloud

2: Create Your Splunk Cloud Account

3: Configure Data Inputs

Bringing Your Data into Splunk

Step 4: Start Searching and Analyzing Your Data

Building a Conversational Search Application with Azure Cognitive Search and OpenAI Embedding

Introduction

Prerequisites

Conversational Search Application

Here are the steps to use the Conversational Retrieval Chain to fetch data from Azure Cognitive Search and generate responses using OpenAI ADA model:

Import the necessary modules and classes from the Conversational AI Toolkit and other libraries.

Import the necessary modules and classes from the Conversational AI Toolkit and other libraries.

Set up the AzureSearch class, which will access the data in Azure Cognitive Search. You need to provide the following parameters for the Azure Cognitive Search service:

Configure the AzureChatOpenAI class, which will be used to generate natural language responses using the OpenAI Ada model. You need to provide the following parameters for the OpenAI Ada model:

Define the PromptTemplate class, which will be used to rephrase the user’s follow-up questions to be standalone questions. You need to provide a template that takes the chat history and the follow-up question as inputs and outputs a standalone question.

Define a function called search, which will take the user’s input as a parameter, and return a response.

Conclusion

Additional References

How to use Azure Blob Data and Store it in Azure Cognitive Search along with Vectors

Introduction

Read Data from Azure Blob Storage:

Load the Documents and the Vectors to ACS:

Conclusion:

References:

How to Scrape a Website and Extract its Content

Introduction

Crawl and Scrape the Target URL

Libraries to Extract the Content

Extract the Content using BeautifulSoup

To Get the Information, the Most Used Functions Are:

Extract the Content using lxml

Difference between BeautifulSoup and lxml

Store the Extracted Content in Azure Blob Storage

Example:

What You Can Achieve:

Conclusion:

References:

Building a Private ChatBot with Langchain, Azure OpenAI & Faiss Vector DB for Local Document Query

Why use Langchain, Azure OpenAI, and Faiss Vector Store?

Build a Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store

Step 1: Install Langchain and its Dependencies

Step 2: Load your Local Document

Step 3: Split your Document into Smaller Chunks

Step 4: Create Embeddings and Store them in a Faiss Vector Database

Step 5: Create a Chatbot Agent

Step 7: Define a Chain of Actions for your Chatbot Agent

Step 8: Run your Chatbot Agent

What are the Benefits of Using a Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store?

How to Use Your Private Chatbot with Langchain, Azure OpenAI, and Faiss Vector Store?

Conclusion

Additional Resources

Coveo Headless Library Integration with SAPUI5 Framework: Development Environment Setup – Phase I

Introduction

Requirements