➡️ Click to try the Live Wikipedia Search App! ⬅️
Overview
This guide demonstrates the steps required to ingest a large text corpus, like the Wikipedia 41M sample dataset, for efficient semantic search and RAG.
We will cover:
- Connecting to LanceDB
- Defining your data structure
- Adding data in batches
- Creating vector indexes for fast search
To use it yourself, you can visit this repo and follow the instructions.
Performance
When running this workflow on Modal with 50 GPUs:
- Ingestion: Complete 41M records in ~11 minutes
- Indexing: Vector index creation in ~30 minutes
By storing Wikipedia chunks and their embeddings in LanceDB, you create a powerful retriever for:
- Semantic search applications
- Retrieval-Augmented Generation (RAG) pipelines
- Knowledge discovery tools
Setup
Before starting, ensure you have:
- LanceDB python package
pip install lancedb
- Other required Libraries
pip install datasets sentence-transformers
- Dataset: Access to the Wikipedia dataset.
- Embedding Model: A Sentence Transformer model.
# Example using Sentence Transformers from sentence_transformers import SentenceTransformer # Ensure MODEL_NAME is defined (e.g., 'all-MiniLM-L6-v2') model = SentenceTransformer(MODEL_NAME) VECTOR_DIM = model.get_sentence_embedding_dimension()
These are the implementation steps.
1. Connecting to LanceDB
Establish a connection to your LanceDB database URI. You’ll need to create a project on LanceDB cloud . You can visit cloud.lancedb.com to create a new project and get the project uri and api key to initialize a connection.
import lancedb
# Connect to lancedb cloud. It requires your project's uri, api_key and region
db = lancedb.connect(
uri=LANCEDB_URI,
api_key=LANCEDB_API_KEY,
region=LANCEDB_REGION
)
2. Defining the Data Schema
LanceDB supports both PyArrow and Pydantic for defining schema. Here we’ll use Pydantic:
class WikipediaSchema(LanceModel):
emb: Vector(VECTOR_DIM) # This stores vector embedding
identifier: int # This is an id for the chunk
chunk_index: int # This is the chunk number of the context
content: str # The content of the chunk
url: str # The link to the article
title: str # The title of the article
vector
: Stores the embedding. Ensure list_size matches yourVECTOR_DIM
.Metadata
: Include fields useful for filtering searches or providing context later.
3. Creating (or Opening) the Table
Use db.create_table
to initialize your table with the defined schema. It’s good practice to handle cases
where the table might already exist in case you’re running ingestion operation in distributed fashion.
def get_or_create_table(table_name: str = None):
"""Get an existing table or create a new one if it doesn't exist."""
db = get_db_connection()
table_name = table_name or DEFAULT_TABLE_NAME
try:
return db.open_table(table_name)
except Exception:
return db.create_table(table_name, schema=WikipediaSchema)
4. Adding Data in Batches
Ingest data efficiently using table.add()
. Prepare data matching the schema (including the computed embedding vector) and add it in reasonably sized batches. For this we used batches of 200K rows at once.
# Assume 'chunks' is a list of dictionaries like:
# { "content": "...", "identifier": "...", ..., "vector": [0.1, 0.2, ...] }
# The 'add_batch_data' function likely does this:
BATCH_SIZE_LANCEDB = 200_000
for i in range(0, len(all_processed_chunks), BATCH_SIZE):
batch_to_add = all_processed_chunks[i : i + BATCH_SIZE]
try:
table.add(batch_to_add)
except Exception as e:
print(f"Error adding batch to LanceDB: {e}")
table.add(list_of_dicts)
is much faster than adding records individually. Adjust BATCH_SIZE_LANCEDB
based on memory and performance.
5. Creating a vector index
table.create_index(
metric="cosine", # Use "dot" or "cosine"
vector_column_name="vector", # Must match the schema field name
index_type="IVF_PQ", # IVF_PQ balances speed and accuracy well. LanceDB also supports other vector indices
# Optional tuning parameters below:
# num_partitions=256, # Adjust based on dataset size
# num_sub_vectors=96 # Adjust based on dataset size
)
# Create a full-text search index
table.create_fts_index(["content", "title"])
Once the index creation is done, you’ll see these labels appear on the columns with index.
wait_timeout
parameter on create_index()
or call wait_for_index()
on the table.
- Indexing is crucial for query performance.
The core pattern is: parallelize data loading, chunking, and embedding generation, then use table.add(batch) within each parallel worker to write to LanceDB. LanceDB’s design efficiently handles these concurrent additions. This example uses modal for performing distributed embedding generation and ingestion.
For brevity other common steps like embedding and normalization have not been covered, but you can follow along the steps to reproduce on the github repo for a complete implementation including:
- Distributed embedding generation
- Data preprocessing
- Error handling
- Performance optimization