Cloudera Adopts Apache Iceberg Tables to Show OS Commitment • The Register

Cloud data lake provider Cloudera has announced the general availability of Apache Iceberg in its data platform.

Developed by the Apache Software Foundation, Iceberg offers an open table format, designed for high performance on Big Data workloads while supporting query engines such as Spark, Trino, Flink, Presto, Hive, and Impala .

Iceberg started as a Netflix project before being donated to the Apache Foundation two years later in 2018.

In a blog post, Cloudera — the data platform provider with its roots in Hadoop-based systems — said its goal was to enable multi-functional analytics across data lakes, repositories that support both data structured and unstructured. The introduction of the lake house concept encourages users to use analytics and BI on data lake systems.

“However, it still remains driven by table formats tied to core engines and often single vendors. Enterprises, on the other hand, have continued to demand highly scalable and flexible analytical engines and services across the data lake. , without vendor lock-in,” Cloudera said.

Iceberg’s deployment in Cloudera Data Platform (CDP) includes Cloudera Data Warehousing, Cloudera Data Engineering, and Cloudera Machine Learning. “These tools allow analysts and data scientists to easily collaborate on the same data, with their choice of analytical tools and engines,” Cloudera said.

Benefits are set to include support for schema and partition changes in a single command, time travel with point-in-time queries for forensic visibility and regulatory compliance capabilities, and concurrent multi-function scans for meet end-to-end data lifecycle needs. Performance should also improve with aggressive sharding to handle very large-scale datasets, Cloudera said.

Arm wrestling open source technophiles

However, Cloudera isn’t the only late data provider or Lakehouse to embark on an open source path.

Databricks, the company originally as the vendor of Apache Spark, has also donated its storage format layer to the open source community. The latest iteration, Delta Lake 2.0, was announced last week at the Data and AI Summit.

“Delta Lake 2.0 will bring unparalleled query performance to all Delta Lake users and enable everyone to build a high-performing data lake on open standards. Through this contribution, Databricks customers and the open source community will benefit from all the features and improved performance of Delta Lake 2.0,” said Databricks.

Talk to The register, Joel Minnick, vice president of marketing for Databricks, said, “After the opening of Delta Lake, we continued to build many performance improvements and features within the Databricks platform. We’ve always been an open source company at heart and if we made these improvements, we really wanted to be able to give them back to the community.

Minnick said the improvements were in “data processing, data warehousing.”

Delta Lake 2.0 was donated to the Linux Foundation this week. ®


Source link

Steven L. Nielsen