What is Confluent TableFlow?

Data Lakes and Data Warehouses: A Holy Matrimony

A data warehouse is a collection of raw data, stored in the infrastructure of a relational database. One of the main benefits of employing this data management system is that it allows for organisations to perform powerful analytics on incredibly large sets of data, as well as reporting and data mining. ETL (Extract, Load Transform) tools are used to extract data from a system, edit or enrich the data with other data which transforms it into new data, which in turn is loaded onto your data lake. Essentially, ETL is the process of taking data from many different sources and linking it together in such a way that it is in a suitable format to be loaded onto a data warehouse.

You Might Also Like:

How Does Kora Enhance Kafka & Confluent Cloud?

A data lake is a collection of original-format-data, stored without structure, in one huge centralised repository, at any scale. One of the key differences between data lakes and warehouses is that data lakes don’t require ETL tools to clean/structure data. While this makes data lakes a lot simpler, it also means that the data quality is significantly reduced (unless you perform data cleaning/filtering prior to uploading the data to the data lake).

IBM defines the typical data warehouse as having “four main components: a central database, ETL tools, metadata tools and access tools” that are “engineered for speed so that you can get results quickly and analyse data on the fly.” While true, there are obviously some limits, such as time to analyse. The main way the data is organised in a data warehouse is through schemas, and there are two main different types of schemas, we have the Star Schema and the Snowflake Schema.

But how do these mix together? Well, a Data Lakehouse combines the best capabilities of a data lake and warehouse, allowing you to store your data in either a structured or an unstructured format. This means that data can be stored for analytical purposes in an organised format, while also enabling the use of the raw data in machine learning models. If a company has both the need to train a system to perform a given task using data, and also create reports and gain insights from data, the lakehouse is the ideal technology. With a Dat Lakehouse, you don’t need to maintain both a data lake and data warehouse, reducing costs and providing the data scientists/analysts with an easy-to-navigate system, where all the data they need is in one single, unified system.

Snowflake Schemas store data in a more complex, branching system, often visualised as a tree of tables stemming from a central table - these tables are also called the “fact table” and the “dimension tables”.

Due to the simplicity of the star schema structure, it is much easier to implement and configure, and searching/querying the schema can often be very efficient due to the small amount of branches between tables. The key issue with these tables and the star schema however, is that it contains data that is denormalised, meaning that because of the smaller and less specific tables, we store a lot of data that can be considered redundant alongside unique data. While denormalised data leads to faster querying, updating tables and troubleshooting is often a lot harder than when using normalised data.

Apache Iceberg: Building on Data Lakes

In order to define what Apache Iceberg is, the best place to look is Apache’s official documentation pages - “Iceberg is a high-performance format for huge analytic tables” where it “brings the reliability and simplicity of SQL tables to big data” - it is designed to help simplify data processing of datasets stored in data lakes, through efficient and reliable storage and integrations with stream/data processing frameworks, of which Apache Flink is the most relevant. Open-sourced in 2018, it was originally developed by Netflix due to their issues with using tables that struggled with efficiency, reliability and scalability when the tables they were using reached petabyte levels of storage.

Typically, an organisation making use of a data lake uses a Metadata Catalogue, which is used to define the tables within the data lake - this is key to ensuring that across a data lake, all users (across the user base, such as an organisation) have a common, predefined view of all data stored in the data lake. The official Apache Iceberg website states “the first step when using an Iceberg client is almost always initialising and configuring a catalogue” which then allows the user to perform tasks like “creating, dropping, and renaming tables”.

The best way to visualise the way Apache Iceberg works is by thinking about how typical tables/data warehouses function: we have a branching list of different folders and topics, and within those folders/topics (named according to what we expect to find in the folder) we’ll find a certain type of data. Iceberg gets past the potential issues with this form of data tracking by keeping a complete list of all files in a given table, a level deeper than folders. This aids an array of different problems seen in pre-existing huge table infrastructure, by allowing the data engineer to: perform queries of the table much faster due to the redundancy of expensive list operations, reduce chance of data to “appear missing when file list operations are performed on an eventually consistent object store” (as Christine Mathiesen explains in her blog), ensure a consistent view of the data by using snapshots and atomic writes and perform a variety of other actions through Iceberg’s Java/Python API.

While this was a relatively quick and simple explanation of what Iceberg is and how it functions, for a more detailed view, Christine Mathiesen (as mentioned above) has an excellent blog that dives a little further into what Apache Iceberg can offer its adopters, covering topics such as Iceberg’s in-built schema evolution tools and how it also manages partitions, etc.

Confluent TableFlow: How Does This Link Together?

WIth all of the background knowledge and additional context covered, we now need to understand how this links back to Confluent’s offering of their new technology, TableFlow. Confluent TableFlow is currently in private early access, but their vision is to allow all Confluent customers to make TableFlow a “push-button” simple feature to “take Apache Kafka data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg tables” - so let’s dive right into what this means, and cover a little more background info required to see the whole vision.

Data in organisations are typically stored in two different estates, with Confluent defining these as:

1) Configure the infrastructure to consume data from Apache Kafka, by setting up Consumers/Connectors and sizing them accurately for the intended topic that needs to be streamed to Iceberg
2) Feed the data through a system that ensures the data is in a single format and ensuring the data is consistent across all formatted files (using the Schema Registry)
3) Compact and clean up any additional/smaller files to maintain “acceptable read performance”
4) Edit and apply changes based on the type of data being sent

TableFlow: Unification of the Operational and Analytical Estates

TableFlow is a new way to continuously optimise read performance with file compaction, maintain efficient data storage and retrieval by sizing files, and managing your data flow in Confluent. It allows users to “easily materialise” their Kafka Topics, and associated schemas, into Apache Iceberg table” in a simple, efficient manner.

TableFlow utilises Confluent’s Kafka Kora engine and its storage layer to convert Kafka segments into other formats, as well as using “Confluent’s Schema Registry to generate Apache Iceberg metadata while handling schema mapping, schema, evolution, and type conversions”.

This means that your data can flow from Confluent, where it can be enriched and filtered, directly into a lakehouse, ready for analysis and reporting. One of the minor issues some users have with Confluent is visualising the data flowing through a topic, and with Apache Iceberg and Confluent converting topics into tables, it introduces a new way for users to see their data and manage it in a more accessible, user-friendly way.

Cookie	Duration	Description
language	1 month 1 hour	This cookie is used to store the language preference of the user.
li_gc	6 months	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
ADRUM_BT1	past	This cookie is used to optimize the visitor experience on the website by detecting errors on the website and share the information to support staff.
ADRUM_BTa	past	This cookie is used to optimize the visitor experience on the website by detecting errors on the website and share the information to support staff.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_*	1 minute	Google Analytics sets this cookie to store a unique user ID.
_gat_gtag_UA_1170872_23	1 minute	Set by Google to distinguish users.
_gat_gtag_UA_99925054_1	1 minute	Set by Google to distinguish users.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_lfa	1 year	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
CONSENT	16 years 2 months 24 days 11 hours 26 minutes	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
nQ_cookieId	1 year	Information about Albacross’ processing of your personal data We inform you regarding the processing of personal data on behalf of Albacross Nordic AB (“Albacross”). Information collected from cookies set in your device that qualify as personal data will be processed by Albacross, a platform offering visitor identification and ad targeting services with offices in Stockholm and Krakow. Please see below for full contact details. The purpose for the processing of the personal data is that it enables Albacross to improve a service rendered to us and our website (e.g “Intent” service), by adding data to their database about companies. The Albacross database will in addition to “Intent Data” be used for targeted advertising purposes towards companies and for this purpose data will be transferred to third-party data service providers. For the purpose of clarity, targeted advertising regards companies, not towards individuals. The data that is collected and used by Albacross to achieve this purpose is information about the IP address from which you visited our website and technical information that enables Albacross to tell apart different visitors from the same IP address. Albacross stores the domain from form input in order to correlate the IP address with your employer. For full information about our processing of personal data, please see Albacross’ Privacy Policy. Albacross Nordic AB Companyreg. no 556942-7338 Tegelbacken 4A 111 52 Stockholm, Sweden www.albacross.com - contact@albacross.com
nQ_visitId	1 year	Information about Albacross’ processing of your personal data We inform you regarding the processing of personal data on behalf of Albacross Nordic AB (“Albacross”). Information collected from cookies set in your device that qualify as personal data will be processed by Albacross, a platform offering visitor identification and ad targeting services with offices in Stockholm and Krakow. Please see below for full contact details. The purpose for the processing of the personal data is that it enables Albacross to improve a service rendered to us and our website (e.g “Intent” service), by adding data to their database about companies. The Albacross database will in addition to “Intent Data” be used for targeted advertising purposes towards companies and for this purpose data will be transferred to third-party data service providers. For the purpose of clarity, targeted advertising regards companies, not towards individuals. The data that is collected and used by Albacross to achieve this purpose is information about the IP address from which you visited our website and technical information that enables Albacross to tell apart different visitors from the same IP address. Albacross stores the domain from form input in order to correlate the IP address with your employer. For full information about our processing of personal data, please see Albacross’ Privacy Policy. Albacross Nordic AB Companyreg. no 556942-7338 Tegelbacken 4A 111 52 Stockholm, Sweden www.albacross.com - contact@albacross.com
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser IDs.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__Secure-ROLLOUT_TOKEN	6 months	Description is currently not available.
_lfa_test_cookie_stored	less than a minute	Description is currently not available.
cookie-test	past	No description
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
guest	1 month 1 hour	No description available.
is_bot	session	Description is currently not available.
jcm	past	No description
jcmc	past	No description
JOTFORM_SESSION	1 month	No description available.
nQ_userVisitId	1 hour	No description available.
SameSite	past	No description available.
theme	1 month 1 hour	No description available.
userReferer	1 month 1 hour	No description available.

What is Confluent TableFlow?

Understanding the Fundamentals: Apache Iceberg and Data Structures

Data Lakes and Data Warehouses: A Holy Matrimony

Data Warehouse Schema: Star/Schema Additional Context

Apache Iceberg: Building on Data Lakes

Confluent TableFlow: How Does This Link Together?

Confluent TableFlow: How Does This Link Together?

TableFlow: Unification of the Operational and Analytical Estates

More Resources like this one:

The Somerford Podcast: Complete Confluent Mini-Series

Confluent Short Video | The Rise of Data in Motion

Interested in TableFlow?