In today’s data-driven era, where we are dealing with massive data pertaining to different data sources having different data formats, it creates a very big challenge, especially in terms of efficient search and information retrieval. Dive into this deep ocean and locating the correct data is not at all an easy task with AI.
In this article, I’ll show you how you can integrate Azure SQL Database, FAISS, and advanced models from Hugging Face to enhance your search capabilities.
Let’s take a quick look at each of these pillars.
Understanding Terminology
Azure SQL Database
Azure SQL Database is a cloud-based service that provides a totally managed database system. The high availability, scalability, and security of this particular service type make it the best option for handling large amounts of data.
FAISS
FAISS is short for Facebook AI Similarity Search. It is a library written by Facebook AI Research that enables efficient similarity search and clustering of dense vectors. It supports various indexing methods. You can find more details about it here.
Hugging face
Hugging Face is very well-known for its collection of natural language processing (NLP) models. These models are able to produce embeddings for text, which can then be sorted within FAISS and retrieved from it. You can find more details about the model here and we are going to use the same model.
Problem Statement
This article walks you through the process of querying your data stored in Azure SQL DB, generating the embeddings using a sentence transformer model named 'all-MiniLM-L6-v2' from Hugging Face, indexing the embeddings using FAISS library and then answering the user’s question based on the similarity search on the indexed embeddings.
Prerequisites
To replicate this flow, make sure that data is available in Azure SQL Database and that you’ve your connection string handy.
Integrating Azure SQL DB, FAISS, and Hugging Face
Here is the step-by-step walkthrough of how we can make these three systems talk to each other.
Import the Required Python Packages
Let's kick off the process by importing these packages and making sure to install them before importing.
import numpy as np import pandas as pd import faiss from sentence_transformers import SentenceTransformer import pyodbc
Grab the Data from the Azure SQL Database
Here, the recommendation is to pick only those columns that are actually participating in answering the user's query. Here is the code to do so.
# Establish a connection conn = "CONNECTION_STRING conn = pyodbc.connect(conn) # Define the SQL query to retrieve retrieve cols from the required columns sql_query = 'SELECT title,description FROM dbo.MoviesAndSeries' # Execute the query and read the results into a Pandas DataFrame df = pd.read_sql_query(sql_query, conn) # Close the connection conn.close()
Here are a few rows from my data. I've taken this data set from Kaggle, and it contains data about TV shows and movies on Netflix.
Select the Perfect Model from Hugging Face
I'm going with a sentence transformer model named all-MiniLM-L6-v2, which maps sentences and paragraphs to a given dimension. Here is the link to the model.
You can load this pre-trained model using the code below.
model = SentenceTransformer('all-MiniLM-L6-v2')
Convert Text to Vectors
Here, I am converting only the values from the description column to vectors as a title, I can get it directly from DataFrame named df.
# Convert all descriptions to vectors vector = model.encode(df['description'].tolist(), convert_to_numpy=True) description_vectors = vector / np.linalg.norm(vector, axis=1, keepdims=True) # Create a FAISS index dimension = description_vectors.shape[1] index = faiss.IndexFlatIP(dimension) faiss.normalize_L2(description_vectors) # Add vectors to the index index.add(description_vectors.astype('float32'))
If you want to know the theoretical aspects of IndexFlatIP, you can check it out on its official documentation.
Now, the next two lines are optional, depending on whether you want to save your indexed data and the data frame for future usage.
# Save the index to a file faiss.write_index(index, "indexes/title_movie.index") # Save the DataFrame to a CSV file for later use df.to_csv("indexes/title_movie.csv", index=False)
Search the Top Matching Similar Records
Here comes the search part, wherein we will be doing multiple things.
- Generate the embedding of user queries.
- Search the indexed data for top_k matches.
- For all the top_k matches, grab the respective title from the data frame.
- Then, display the results to the user.
Here are the two functions taking care of all of these.
def get_data(rowNumber): row = df.iloc[rowNumber] response = {} response['title'] = row['title'] response['description'] = row['description'] return response def search(query, top_k, index, model): query_vector = model.encode([query]) top_k = index.search(query_vector,top_k) top_k_ids = top_k[1].tolist()[0] top_k_ids = list(np.unique(top_k_ids)) results = [get_data(id) for id in top_k_ids] return results
Answer the User's Query
Now we have all the setup ready, it is time to answer the user's query.
query = "Which movie is based on the story of Ali Baba and the Forty Thieves" results = search(query, top_k=2,index=faiss.read_index(index_file_path),model=model) for result in results: print('\n', result)
And here is the outcome:
I hope you got an idea about how we can use an open-source model with an open-source similarity search library to ask questions from the Azure SQL Database.
Comments
Post a Comment