DuckDB

In Python, LanceDB tables can also be queried with DuckDB , an in-process SQL OLAP database. This means you can write complex SQL queries to analyze your data in LanceDB.

This integration is done via Apache Arrow , which provides zero-copy data sharing between LanceDB and DuckDB. DuckDB is capable of passing down column selections and basic filters to LanceDB, reducing the amount of data that needs to be scanned to perform your query. Finally, the integration allows streaming data from LanceDB tables, allowing you to aggregate tables that won’t fit into memory. All of this uses the same mechanism described in DuckDB’s blog post DuckDB quacks Arrow .

We can demonstrate this by first installing duckdb and lancedb.

code
pip install duckdb lancedb

We will re-use the dataset created previously :

python
import lancedb

db = lancedb.connect("data/sample-lancedb")
data = [
    {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
    {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
]
table = db.create_table("pd_table", data=data)

The to_lance method converts the LanceDB table to a LanceDataset, which is accessible to DuckDB through the Arrow compatibility layer. To query the resulting Lance dataset in DuckDB, all you need to do is reference the dataset by the same name in your SQL query.

python
import duckdb

arrow_table = table.to_lance()

duckdb.query("SELECT * FROM arrow_table")
code
┌─────────────┬─────────┬────────┐
   vector      item    price  
   float[]    varchar  double 
├─────────────┼─────────┼────────┤
 [3.1, 4.1]   foo        10.0 
 [5.9, 26.5]  bar        20.0 
└─────────────┴─────────┴────────┘

You can very easily run any other DuckDB SQL queries on your data.

code
duckdb.query("SELECT mean(price) FROM arrow_table")
code
┌─────────────┐
│ mean(price) │
│   double    │
├─────────────┤
│        15.0 │
└─────────────┘