Databricks Vector Search
Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.
In the walkthrough, we'll demo the SelfQueryRetriever
with a Databricks Vector Search.
create Databricks vector store indexโ
First we'll want to create a databricks vector store index and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
Note: The self-query retriever requires you to have lark
installed (pip install lark
) along with integration-specific requirements.
%pip install --upgrade --quiet langchain-core databricks-vectorsearch langchain-openai tiktoken
Note: you may need to restart the kernel to use updated packages.
We want to use OpenAIEmbeddings
so we have to get the OpenAI API Key.
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
databricks_host = getpass.getpass("Databricks host:")
databricks_token = getpass.getpass("Databricks token:")
OpenAI API Key: ยทยทยทยทยทยทยทยท
Databricks host: ยทยทยทยทยทยทยทยท
Databricks token: ยทยทยทยทยทยทยทยท
from databricks.vector_search.client import VectorSearchClient
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
emb_dim = len(embeddings.embed_query("hello"))
vector_search_endpoint_name = "vector_search_demo_endpoint"
vsc = VectorSearchClient(
workspace_url=databricks_host, personal_access_token=databricks_token
)
vsc.create_endpoint(name=vector_search_endpoint_name, endpoint_type="STANDARD")
[NOTICE] Using a Personal Authentication Token (PAT). Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().
index_name = "udhay_demo.10x.demo_index"
index = vsc.create_direct_access_index(
endpoint_name=vector_search_endpoint_name,
index_name=index_name,
primary_key="id",
embedding_dimension=emb_dim,
embedding_vector_column="text_vector",
schema={
"id": "string",
"page_content": "string",
"year": "int",
"rating": "float",
"genre": "string",
"text_vector": "array<float>",
},
)
index.describe()
index = vsc.get_index(endpoint_name=vector_search_endpoint_name, index_name=index_name)
index.describe()
from langchain_core.documents import Document
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"id": 1, "year": 1993, "rating": 7.7, "genre": "action"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"id": 2, "year": 2010, "genre": "thriller", "rating": 8.2},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"id": 3, "year": 2019, "rating": 8.3, "genre": "drama"},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={"id": 4, "year": 1979, "rating": 9.9, "genre": "science fiction"},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"id": 5, "year": 2006, "genre": "thriller", "rating": 9.0},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"id": 6, "year": 1995, "genre": "animated", "rating": 9.3},
),
]
from langchain_community.vectorstores import DatabricksVectorSearch
vector_store = DatabricksVectorSearch(
index,
text_column="page_content",
embedding=embeddings,
columns=["year", "rating", "genre"],
)
vector_store.add_documents(docs)