put data in table

(Tee11/Shutterstock)

One of the big advances in data engineering over the past 7-8 years has been the advent of tabular formats. Typically layered on top of columnar Parquet files, table formats such as Apache Iceberg, Delta, and Apache Hudi offer important benefits for big data operations, including the introduction of transactions. However, tabular format also introduces additional costs that customers should be aware of.

Each of the three major table formats was developed by different groups, making their origin stories unique. However, they were primarily developed in response to similar technical limitations in the big data landscape that affects all types of business operations.

For example, Apache Hudi was originally created in 2016 by Uber’s data engineering team, which was a large-scale user (and large-scale developer) of big data technologies. Hudi, short for Hadoop Upserts, Deletes, Incrementals, was born out of a desire to improve file processing for his large-scale Hadoop data lake.

Apache Iceberg, on the other hand, emerged in 2017 from Netflix, which is also a big user of big data technology. The company’s engineers were frustrated by the limitations of the Apache Hive metastore. This limitation could lead to corruption and incorrect answers when the same file was accessed by different query engines.

Image source: Apache Software Foundation

Similarly, the guys at Databricks developed Delta in 2017 when too many data lakes were turning into data swamps. A key component of Databricks’ Delta Lake, the Delta table format allows users to get data warehousing-like quality and precision for data stored in their S3 or HDFS data lakes, or lakehouses. .

As a data engineering automation provider, Nexla supports all three table formats. As our client’s big data repository grew, we realized we needed to improve data management for analytics use cases.

A big advantage of all table formats is the ability to see how records have changed over time. This is a feature that has been commonly used for transactional use cases for decades, but is fairly new for analytical use cases, says his CTO, Avinash Shahdadpuri. Co-founder of Nexura.

“The marquetry format didn’t really have any history,” he says. Data Nami In an interview. “If you had a record and you wanted to see how this record changed over a period of time in his two versions of the Parquet file, it was very difficult to do that.”

Adding a new metadata layer within the tabular format allows users to gain ACID transaction visibility over data stored in Parquet files. Parquet files have become the dominant format for storing columnar data in S3 and HDFS data lakes (other big data include ORC and Avro). format).

“That’s where ACID comes in a little bit. It gives you a history of how this record has changed over time, so you can roll back with more certainty,” says Shahdadpuri. “You can now essentially version control your data.”

Image source: Snowflake

This ability to roll back data to a previous version is useful in certain situations, such as data sets that are continually updated. This is not ideal if new data is appended to the end of the file.

“When data is not just appended (which is the case in 95% of these traditional Parquet file use cases), it can be deleted, merged, and updated much more efficiently than traditional methods. , this method tends to be better. It could be done using classic parquet files,” says Shahdadpuri.

The table format allows users to perform more data operations directly on the data lake, similar to a database. This saves customers time and money by taking data out of the lake, manipulating it, and putting it back into the lake, Shahdapuri said.

Of course, users can leave their data in the database, but traditional databases cannot scale to petabytes. Distributed file systems like HDFS and object stores like S3 can easily scale to petabytes. Additionally, the addition of a table format means users no longer have to compromise on transactionality and accuracy.

It’s not without its disadvantages. There are always tradeoffs in computer architecture, and tabular formats come with their own costs. According to Shahdadpuri, costs come in the form of increased storage and complexity.

Image source: Databricks

In terms of storage, storing metadata in table format can increase storage overhead by as little as 10%, and can incur up to a 2x penalty for continuously changing data. Yes, says Shahdadpuri.

“It can increase your storage costs quite a bit because before you were just storing Parquet. Now you’re storing versions of Parquet,” he says. “Now we’re preserving the meta files we already had in Parquet, which increases costs and ultimately forces us to make trade-offs.

Customers should ask themselves whether they really need the additional functionality that table format brings. If you don’t need the transactional or time-travel capabilities that ACID brings, for example because most of your data is append-only, he says, a traditional guy might be better off sticking with Parquet.

“Having this additional layer definitely adds complexity and adds complexity in a number of ways,” Shahdadpuri says. “So Delta can be a little more performance intensive than Parquet. All of these formats cost a little more performance. But you’re paying for that somewhere, right? ”

There is no single best table format, he says. Instead, the most suitable format emerges after analyzing each client’s specific needs. “It’s up to the customer. It depends on the use case,” says Shahdadpuri. “We want to be independent. As a solution, we support each of these things.”

That said, Nexla officials have observed certain trends in table format adoption. A big factor is how customers align around the big data giants Databricks and Snowflake.

As the creators of Delta, Databricks is firmly in that camp, with Snowflake backing Iceberg. Hudi is not backed by a major big data player, but by Onehouse, a startup founded by Hudi creator Vinoth Chandar. Iceberg is backed by Tabular, co-founded by Ryan Blue, who helped create Iceberg at Netflix.

Larger companies will likely have a mix of different table formats, Shahdapuri said. That leaves room for companies like Nexla to step in and provide tools to automate these forms of integration, or for consulting firms to manually piece them together.

Related products:

Demystifying big data file formats

Lakehouse Data Smackdown moves open table format to square off

Data lakehouses are on the horizon, but it’s not smooth sailing yet

tag:
Acid, ACID transactions, Apache Hudi, Apache Iceberg, big data, data management, delta, delta lake, delta table format, Hadoop, rollback, s3, table format

Source link

Subscribe to Updates

What's Hot

Related Posts