DataBricks Open Sources All Delta Lake

Databricks has now made all of Delta Lake open source, including all APIs. The product’s storage layer was made open source in 2019. Delta Lake can be used to create data lakehouses, which enable data warehousing and machine learning directly on the data lake.


Delta Lake manages the stage where data is fed into an organization’s data lake. It stores data in Apache Parquet format and is designed for use in HDFS-based data lakes and cloud storage.

Databricks was started as a company by the original developers of Apache Spark and specializes in business technologies that use Spark. Delta Lake is a unified analysis engine and associated table format built on top of Apache Spark, and until it was made open source it was only available as part of Databricks Delta, the business owner stack.

Since the storage layer was made open source, the project has attracted more than 190 contributors from more than 70 organizations, nearly two-thirds of which are outside of Databricks, including contributors from companies such as Apple, IBM, Microsoft, Disney, Amazon and eBay. .

Delta Lake comes with standalone readers/writers that allow any Python, Ruby, or Rust client to write data directly to Delta Lake without requiring a big data engine such as Apache Spark, as well as open connectors source, including Apache Flink, Presto, and Trino. . The open source announcement unlocks features that until now were only available in Databricks.

Delta Lake 2.0, the latest version of Delta Lake, features enhancements including support for ZOrder, Change Data Feed, Dynamic Partition Overwrites, and Dropped Columns. Z-Ordering is a technique for grouping related information into the same set of files. This co-locality is used by Delta Lake in data hopping algorithms, and the developers say it greatly reduces the amount of data Delta Lake on Apache Spark has to read.

Delta Lake 2 is available now.

More information

Databricks website

Delta Website

Related Articles

Databricks Delta Lake now open source

Databricks Delta adds faster parquet import

Databricks runtime for machine learning

Databricks adds ML model export

Spark gets the NLP library

Apache Spark with structured streaming

Spark BI gets fine-grained security

Spark 2.0 released

To be notified of new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


pythondata



comments

or send your comment to: comments@i-programmer.info



Source link

Steven L. Nielsen