Git is a version control tool that’s used for tracking and managing changes within the source code. It is primarily designed for software development that focuses mainly on the code. For data science use cases where there’s a duality between the code and the information, Git just isn’t essentially the most efficient solution, because it just isn’t optimized to handle large data files. Hence, there arises a necessity for data versioning tools, and is one such solution.

lakeFS is an information version control tool that helps manage data as code using Git-like operations and helps achieve reproducible and high-quality data pipelines. lakeFS stores data in object stores and has support for storage services like AWS S3, Azure Blob Storage, and Google Cloud Storage, which helps users manage their data lake operations with high precision and repeatability. Moreover, lakeFS is format agnostic, i.e., it supports different data formats reminiscent of structured, unstructured, open table, etc.

lakeFS has a distributed architecture consisting of several logical services. Additionally, its server is stateless, i.e., more instances could be added easily to handle additional loads. lakeFS leverages key-value storage (with support for databases like PostgreSQL and DynamoDB) for metadata, which is used to administer data versions in a scalable manner. 


Advantages of using lakeFS

  • lakeFS allows users to roll back to previous commits in case of bad data.
  • lakeFS finds its use cases in data science, data engineering, and data operations workflows. 
  • lakeFS facilitates collaboration amongst developers.
  • It helps in robust data pre-processing, including outlier handling and filling in missing values. 
  • It makes implementing CI/CD pipelines for data easier by providing automation of checks and validations of knowledge, which could be triggered by certain data operations. 
  • Developers also can run different experiments in parallel and choose the best-performing model.
  • lakeFS branches allow the creation of test environments, which helps reduce testing time by 80%.
  • lakeFS helps reduce storage costs because it helps developers get an isolated data lake for his or her use.

Limitations of lakeFS

  • lakeFS has some problems with deleting data, with users reporting issues with removing commits and no data deduplication through which the identical files are stored with different scrambled names.
  • There’s also some complexity involved with lakeFS, as a few of its features require technical expertise.
  • Some users are also unclear in regards to the value of pre-commit and pre-merge hooks for his or her data pipelines.

In conclusion, lakeFS is an information version control tool that makes it easy to handle and manage changes in the information and helps achieve reproducible and high-quality data pipelines. It is powered by Git-like operations and finds its use case in data science, data engineering, and data operations workflows. Some users have, nevertheless, reported some limitations of the tool, reminiscent of difficulty in removing commits and the necessity for technical understanding of some features. Nevertheless, the tool has been under lively development and is predicted to enhance over time.

This article was originally published at