SQL Console on HF Datasets

Published on Sep 15, 2024

#ai#data#sql#duckdb

This past week, I had the pleasure to ship one of my first big features. It’s the SQL Console on Hugging Face Datasets.

After today, you should see this nifty little button on almost every one of the datasets on Hugging Face.

Hugging Face SQL Console

  • Powered by DuckDB WASM / 100% local 🦆
  • Shareable sessions 🔗
  • Export results to Parquet 📄

What is it?

It’s a SQL console that lets you interact with your dataset. There’s no backend, server, or anything like that. It’s all run locally in your browser.

It’s powered by DuckDB WASM, which has really grown and opened up a lot of possibilities.

You may be thinking “I could have just used DuckDB locally”, and you’re correct. This is essentially just an add-on to that. The main benefit here is that it’s a lot easier to interact with the dataset. You don’t need to download the dataset, unzip it, and copy it to your local machine. You can just use the SQL console to interact with it.

Example

Hugging Face SQL Example

Example of the SQL Console on the Alpaca dataset with a query to convert the dataset to a conversation format.

In this example, without the SQL console, I would have to download the dataset, write a Python script to convert it to the conversation format, and then run the script. With the SQL console, I can just write a query to do the conversion.

It will get easier and easier to share 1 click SQL queries with others and even natural language at a later point.

How does it work?

It works so well since Hugging Face converts all the datasets to the Parquet format which is a columnar format. Since Parquet is columnar, it’s really easy to read and query.

DuckDB has amazing support for Parquet, and can use byte ranges to read only the data you need. I wrote a neat blog post on reading remote Parquet files here if you’re interested.

Datasets are the future of ML. As datasets grow in size and we move towards a world of small, specialized models, the ability to interact with the data will become more and more important.