Daft: The Distributed Python Dataframe

Daft is a fast and scalable Python dataframe for Complex Data and Machine Learning workloads.

illustration

Get Started

You can get started with Daft by installing it with a simple command using pip:

$ pip install getdaft

Community

Daft Blog

Daft is a fast, Pythonic and scalable Open-Source dataframe library. Checkout https://getdaft.io

By Sammy Sidhu

More Resources

10-minutes to Daft

10-minute walkthrough of all of Daft's major functionality.

Tutorials

Hosted examples using Daft in various common use-cases

Docs

Developer documentation for referencing Daft APIs.

Integrations

Daft is open-sourced and you can use any Python library when processing data in a dataframe. It integrates with many other open-sourced technologies as well, plugging directly into your current infrastructure and systems.

Data Science & Machine Learning

alt

Cloud Platforms' Storage

alt

Use Cases

# Data Science Experimentation
Daft enables data scientists/engineers to work from their preferred Python notebook environment for interactive experimentation on complex data
# Complex Data Warehousing
The Daft Python dataframe efficiently pipelines complex data from raw data lakes to clean, queryable datasets for analysis and reporting.
# Machine Learning Training Dataset Curation
Modern Machine Learning is data-driven and relies on clean data. The Daft Python dataframe integrates with dataloading frameworks such as Ray and PyTorch to feed data to distributed model training.
# Machine Learning Model Evaluation
Evaluating the performance of machine learning systems is challenging, but Daft Python dataframes make it easy to run models and SQL-style analyses at scale.

Key Features

# User-Defined Functions
Daft supports running Python User-Defined Functions (UDF) on columns of Python objects - if Python supports it Daft can handle it!
# Interactive Computing
Daft embraces Python's dynamic and interactive nature, enabling fast, iterative experimentation on data in your notebook and on your laptop.
# Distributed Computing
Daft integrates with frameworks such as Ray to run large petabyte-scale dataframes on a cluster of machines in the cloud.