Forem: Adi Polak

Delta Lake essential Fundamentals: Part 2 - The DeltaLog

Adi Polak — Mon, 08 Mar 2021 07:54:50 +0000

this blog post was originally posted on blog.adipolak.com.

In the previous part, you learned what ACID transactions are.

In this part, you will understand how Delta Transaction Log, named DeltaLog, is achieving ACID.

Transaction Log

A transaction log is a history of actions executed by a (TaDa 💡) database management system with the goal to guarantee ACID properties over a crash.

DeltaLake transaction log - DetlaLog

DeltaLog is a transaction log directory that holds an ordered record of every transaction committed on a Delta Lake table since it was created.
The goal of DeltaLog is to be the single source of truth for readers who read from the same table at the same time. That means, parallel readers read the exact same data.
This is achieved by tracking all the changes that users do: read, delete, update, etc. in the DeltaLog.

DeltaLog can also contain statistics on the data; depending on the type of the data/field/column, each column can have min/max values. Having this extra metadata can help with faster querying. DeltaTable read mechanism uses a simplified push down predict.

Here is a simplification of DeltaLog on the file systems from Databricks site:

The DeltaLog itself is a folder that consists of multiple JSON files. When it reaches 10 files, DeltaTable does a checkpoint and compaction operations (we will dive into it in the next chapter).

Here is an example of a DeltaLog JSON file from the code source test resources, each entry in the file is on JSON:

{"remove":{"path":"part-00001-f1cb1cf9-7a73-439c-b0ea-dcba5c2280a6-c000.snappy.parquet","dataChange":true}}
{"remove":{"path":"part-00000-f4aeebd0-a689-4e1b-bc7a-bbb0ec59dce5-c000.snappy.parquet","dataChange":true}}

There was a total of two commits captured in this file:
remove -it can be a delete operation on a whole column or only specific values in it. In this operation the metadata field dataChange is set to true.

Here is a more complex JSON file example, each entry in the file is on JSON:

{"metaData":{"id":"2edf2c02-bb63-44e9-a84c-517fad0db296","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"value\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["id"],"configuration":{}}}
{"remove":{"path":"part-00001-6d252218-2632-416e-9e46-f32316ec314a-c000.snappy.parquet","dataChange":true}}
{"remove":{"path":"part-00000-348d7f43-38f6-4778-88c7-45f379471c49-c000.snappy.parquet","dataChange":true}}
{"add":{"path":"id=5/part-00000-f1e0b560-ca00-409e-a274-f1ab264bc412.c000.snappy.parquet","partitionValues":{"id":"5"},"size":362,"modificationTime":1501109076000,"dataChange":true}}
{"add":{"path":"id=6/part-00000-adb59f54-6b8f-4bfd-9915-ae26bd0f0e2c.c000.snappy.parquet","partitionValues":{"id":"6"},"size":362,"modificationTime":1501109076000,"dataChange":true}}
{"add":{"path":"id=4/part-00001-36c738bf-7836-479b-9cc1-7a4934207856.c000.snappy.parquet","partitionValues":{"id":"4"},"size":362,"modificationTime":1501109076000,"dataChange":true}}

In this example, there is the metadata object entry - it represents a change in the table columns either an update to the table schema or that a new table was created.
Later we see two remove operations, followed by three add operations. These operation objects can have a stat field, which contains statistical information, such as the number of records, minValues, maxValues, and more.

These JSON files might also contain operation objects with fields such as - "STREAMING UPDATE", "NOTEBOOK" if the operation took place from a notebook, isolationLevel, etc.

This information is valuable for managing the table and avoiding redundant full scan on the storage.

To simplify the connection between DeltaTable and DeltaLog, it's easier to think about DeltaTable as a direct result of a set of actions audited by the DeltaLog.

DeltaLog and Atomicity

From part one, you already know that atomicity means that a transaction, either happened or not. The DeltaLog itself consists of atomic operations; each line in the log (like the ones you saw above) represents an action, which is an atomic unit; These are called commits.
The transactions that took place on the data can be broken into multiple components in which each one individually represents a commit in the DeltaLog. These breaking complex operations into small transactions help with ensuring atomicity.

DeltaLog and Isolation

Operations such as Update, Delete, Add can harm isolation; Hence, since we want to guarantee isolation with DeltaTable, readers only get access to the table snapshot. This guarantees all parallel readers read the exact data. For handling deletion operations, Delta postpones the actual delete operation on the files; it first tags the files as deleted and later, remove them when considered safe (similar to Cassandra, and ElasticSearch delete operations with a tombstone).

In DeltaLake 0.8.1 source code, there is a comment saying that it's recommended to have the delete retention set to at least 2 weeks or longer than a duration of a job.

Note: This will impact streaming workload as well, because there will be a need to delete the actual files at some point, which might result in blocking the stream.

DeltaLog and Consistency

Delta Lake solves the problem of consistency by solving conflicts with an optimistic concurrency algorithm.
The class in charge of this algorithm is the OptimisticTransaction class. It achieves it by using Java 8 ReentrantLock that is controlled from a DeltaLog instance.

Here is the code snippet:

A DeltaTable instance actively uses the ReentrantLock in the OptimisticTransaction under the doCommitRetryIteratively function.
The optimistic approach was chosen here because in the big data world there is a tendency to add more data than to update existing records.
It's rare to find and update a specific record, it is usually done when there was some data corruption on necessary data.

Here is the code snippet for the optimistic algorithm:

Notice that in line 572, the program records the attempted version as the commitVersion instance which is of type var.
var in Scala represents a mutable object instance, which means we should expect its value to change.

In line 575, we start the algorithm:
it starts the while(true) loop and maintains an attemptNumber counter; if it's ==0, it will try to commit; if it fails here, that means that a file with this commitVersion was already written/committed into the table and it will throw an exception. That exception is being caught in lines 592+593. From there, with each failure, the algorithm is increasing the attemptNumber by 1.
After the first failure, the program won't go into the first if statement on line 577; it will go straights into the else if on line 579.
If the program reached the state where attemptNumber is bigger than the maximum allowed/configured, it will throw a DeltaErrors.maxCommitRetriesExceededException exception.
maxCommitRetriesExceededException exception will provide information about the commit version, the first commit version attempt, the number of attempted commits, and total time spent attempting this commit in ms.
Otherwise, it will try to record this update with checkForConflict functionality in line 588.
Multiple scenarios can bring us to this state.

High-level pseudo-code:

while(tryCommit)
    if first attempt:
        do commit
    else if: attempt number > max retries
            throw an exception - exit loop
        else:
            record retry operation
            try fixing logical conflicts - return valid commit version or throw an exception
            do commit
    retry on exceptions and attempt version +1
    if no exception - end loop
end

To support the users, DeltaLake introduces a set of conflict exceptions that provide more information about the data and the conflicts:

Let's look at some of the conflict scenarios.

Two Writers:

This is the case of two writers who appends data to the same table simultaneously, without reading anything. In this scenario, one writer will commit, and the second writer will read the first one's updates before adding their own updates. Suppose it was only an append operation, like a counter which both are incrementing. In that case, there is no need to redo all computations, and it will automatically commit; if that's not the case, writer number two will need to redo the computation given the new information from writer one.

Delete and Read:

In a more complex scenario like this one, there is no automated solution. For concurrent Delete-Read, there is a dedicated ConcurentDeleteReadException.
That means that if there is a request to delete a file that at the same time is being used for a read, the program throws an exception.

Delete and Delete:

When two operations delete the same file, it might be due to a compaction mechanism or other operation, here too an exception will occur.

DeltaLog and Durability

Since all transactions made on a DeltaTable are being stored directly to the disk/file system, durability is a given. All commits are being persisted to disk. In case of a system failure, they can be restored from the disk.
(Unless there is a true disaster like fire etc and damage to the actual disks holding the information).

For exploring and learning about Delta, I did a deep dive into the code source itself. If you are interested in joining me, I captured it through videos, let me know if that is useful for you.

What's next?

Next, we will see more examples, scenarios, and use cases for DeltaLake! We will learn about the compaction mechanism, schema enforcement, and how it can enforce exactly once operation.

As always, I would love to get your comments and feedback on Adi Polak 🐦.

If you would like to get monthly updates, consider subscribing.

Delta Lake essential Fundamentals: Part 1 - ACID

Adi Polak — Mon, 22 Feb 2021 07:49:53 +0000

Multi-part series that will take you from beginner to expert in Delta Lake.

🎉 Welcome to the first part of Delta Lake essential fundamentals! 🎉

What is Delta Lake ?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.

DeltaLake open source consists of 3 projects:

detla - Delta Lake core, written in Scala.
delta-rs - Rust library for binding with Python and Ruby.
connectors - Connectors to popular big data engines outside Spark, written mostly in Scala.

Delta provides us the ability to "travel back in time" into previous versions of our data, scalable metadata - that means if we have a large set of raw data stored in a data lake, having metadata provides us with the flexibility needed for analytics and exploration of the data. It also provides a mechanism to unify streaming and batch data.

Schema enforcement - handel schema variations to prevent insertion of bad/non-compliant records, and ACID transactions to ensure that the users/readers never see inconsistent data.

It's important to remember that Delta Lake is not a DataBase (DB), yes, just like Apache Kafka is not a DB.
It might 'feel' like one due to the support of ACID transactions, schema enforcements, etc.
But it's not.

Part 1 focuses on ACID Fundamentals:

ACID Fundamentals in Delta Lake:

Let's break it down to understand what each means and how it translates in Delta:

Atomicity

The transaction succeeded or not, all changes, updates, deletes, and other operations either happened as a single unit or not. Think Binary, there is only yes or no - 1 or 0. In Delta, it means that a commit of a transaction happened, and a new transaction log file was written. Transaction log file name example - 000001.json, the number represents the commit number.

Consistency

A transaction can only bring the DB from one state to another; data is valid according to all the rules, constraints, triggers, etc. The transaction itself can be consistent but incorrect. To achieve consistency, DeltaLake relay on the commit timestamp that comes from the storage system modification timestamps. If you are using cloud provider storage such as Azure blob or AWS S3, the timestamp will come from the storage server.

Isolation

Transactions taking place concurrently result in an equals state as if transactions would have been executed sequentially. This is the primary goal of Concurrency control strategies. In Delta, after 10 commits, there is a merging mechanism that marges these commits into a checkpoint file. The checkpoint file has a timestamp. 1 second is being added to the modification timestamp to avoid flakiness. This is how it looks in the code base of Delta:

Durability

Once a transaction has been committed, it will remain committed even if the system fails. Think about writing to disk vs. writing to Ram memory. A machine can fail, but if the commit data was written to disk, it could be restored. Delta writes all the commits in a JSON file directly to the storage; it is not left floating in RAM space for too long.

What's next?

After understanding ACID basics and a bit about the Transaction Log (aka DeltaLog), you are ready to take the next chapter!
In Diving deeper into to DeltaLog, how it looks on disk, and the open-source code you need to be familiar with.

Tech Exceptions Show: Accelerating Data Engineering with Azure

Adi Polak — Tue, 02 Feb 2021 07:49:13 +0000

Data Engineers is the new hotness, many developers have been already working with distributed data technologies, such as Apache Kafka, Apache Spark, Apache Cassandra as backend developers, building infrastructure for analytics and enabling a healthy flow of data in the organization.

The reason we see more media coverage for that topic is the maturity of Data Science and Machine Learning. ML used to be most hyped. Following the hype, companies realized that they need to enable smarter products, So what did they do? they started hiring Data Scientists, although, they didn't have the infra to support them.

Due to that, many companies are now focused on building in-house ML platforms to enable their Data Scientists to get more value out of the data.
Yes, you read correctly, more value out of the data.
Let's take a look at the Data Science needs pyramid:

Data science layers towards AI by Monica Rogati:

You can see the clear need for Data Infrastructure Engineers and Data Engineers, they are at the base of the pyramid. This means, without them, data scientists won't be able to do their job efficiently. Think about it as similar to Maslow's Human needs pyramid. At the base, there are the physical needs, without them, we won't exist.

you are probably curious, why do I share this with you, well I had a wonderful conversation with Sheel Choksi, Solution Architect at Ascend and we talked exactly about that. How we can help Data Engineers do more and accelerate development and ML in organizations. Watch Now 📺!

Tech Exceptions new Episode - The Importance of Testing data Applications

Adi Polak — Tue, 26 Jan 2021 14:02:37 +0000

It was a true pleasure to have Angie Jones on the Microsoft Tech Exceptions show.

Angie shared the Tech Univesity she built, the importance of software testing too, the role of a QA, and how you can leverage AI and Data to help you with testing.
And last but not least, how you can automate this whole process and make it an integral part of your CI/CD process using GitHub Actions.

A GitHub actions workflow helps you automate the build, test, package, release or deployment of any project on GitHub using a dedicated workflow.

Each workflow is made up o individual actions that run after a specific event, trigger, for example, git commit or a pull request. Each action is a packaged script that is part of the whole automated software development tasks.

Many Microsoft services have support for GitHub actions.

For example, if you work with AKS (Azure Kubernetes service), you can customize and automate the deployment process to your cluster. Same with Azure SQL.

Watch NOW! 📺

Tech Exceptions - The Importance of Testing data Applications

Curious to learn more? Check out these free online courses:

Implement a coding workflow in your build pipeline by using Git and GitHub

Collaborate with others and merge only the highest quality code.

Migrate your repository by using GitHub best practices

Learn to move your existing project to GitHub from a legacy version control system.

Monitor GitHub events by using a webhook with Azure Functions

Webhooks offer a lightweight mechanism for your app to be notified by another service when something of interest happens. In this module. you'll learn how to trigger an Azure Function with a GitHub webhook and parse the payload for insights.

I hope you gained value and learn something new, as always I'm happy to take your career and startup questions at @adipolak and @TechExceptions

Tech Exceptions new Episode - Data Management and External Organization Collaboration

Adi Polak — Tue, 19 Jan 2021 08:04:40 +0000

Remote work increases the chances of data breaches for organizations.

A data breach is the intentional or unintentional release of secure or private/confidential information to an untrusted environment. Other terms for this phenomenon include unintentional information disclosure, data leak, information leakage, and data spill.

Microsoft is highly investing in that space to allow every customer to get maximum protection for their data.
For example, the Microsoft GDPR Breach Notification that helps you set notification and better track and audit your data to avoid breaking compliance.

Microsoft also provides dedicated services for tracking and adhering to multiple compliances. These services help your organization comply with national, regional, and industry-specific requirements governing the collection and use of data. For various industries and regions.

Microsoft 365 can be used together with Azure Purview.
Using them together will give you better control of your data, will help you understand who has control of your data, the various iterations it went through, including data lineage tracking.

Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

Data Lineage View Graph example:

You can see how the information is represented in a graph, showing you exactly what happened, including data extraction, transformation, analytics, visualization, etc.

Learn more about it here.

Data Breaches when collaborating with People outside of your Organization

Managing and avoiding data breaches get more complicated when you want to share your business-sensitive data with a stakeholder outside your organization. You share this data as part of a collaboration and partnership, but you can't control it anymore.

The external sharing capabilities in Microsoft 365 provide an opportunity for people in your organization to collaborate with partners, vendors, customers, and others who don't have an account in your directory. You can share entire teams or sites with people outside your organization or just individual files.

Why External Data Management is important? Watch Tech Exceptions episode:

Grab your favorite ☕ and Learn More.

I hope you will enjoy it and happy to take your career and startup questions at @adipolak and @TechExceptions

Apache Spark Ecosystem, Jan 2021 Highlights

Adi Polak — Thu, 14 Jan 2021 09:39:49 +0000

If you've been reading here for a while, you know that I'm a big fan of Apache Spark and have been using it for many years.
Apache Spark is continually growing. It started as part of the Hadoop family,
With the slow death of hadoop and the fast growth of Kubernetes, many new tools, connectors, and open source have emerged.
This highlights more about the data and ML pieces that will help you build your platform on your preferable resource manager.

Ray:

Ray is an open-source, python-based framework for building distributed applications.
Their main audience is ML developers and Data Scientists who would like to accelerate their machine learning workloads using distributed computing.
Ray was open-sourced by UC Berkly RISELab, the same lab who created the AMPLab project, where Apache Spark was created.
If you are curious, their next big 5 years project is all about Real-time Intelligence with Secure Explainable decision.

RayOnSpark is a feature that was recently added to Analytic Zoo, end to end data analytics + AI open-sourced platform, that helps you unified multiple analytics workloads like a recommendation, time series, computer vision, NLP and more into one platform running on Spark, Yarn or K8S.

"RayOnSpark allows users to run Ray programs on Apache Hadoop*/YARN directly so that users can easily try various emerging AI applications on their existing Big Data clusters in a distributed fashion. Instead of running big data applications and AI applications on two separate systems, which often introduces expensive data transfer and long end-to-end learning latency, RayOnSpark allows Ray applications to seamlessly integrate into Apache Spark* data processing pipeline and directly run on in-memory Spark RDDs or DataFrames." Jason Dai.

To learn more about Ray and RayOnSpark, checkout Jason Dai article from RISELab publication.

Koalas:

Koalas is Pandas scalable Sibling:

From the Pandas docs: " pandas is an open-source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python programming language."

From the Koalas docs: "The Koalas project makes data scientists more productive when interacting with big data,
by implementing the pandas DataFrame API on top of Apache Spark."

If you are familiar with exploring and running analytics on data with panads,
Koalas provides a similar API for running the same analytics on Apache Spark DataFrames.
Which makes it easier for Pandas user to run their workloads at scale.
When using it, notice the different Koalas versions; many new versions are NOT available with Spark 2.4 and require Spark 3.0 cluster.

Koalas is built with an internal frame to hold indexes and information on top of Spark DataFrame.

To learn more about it, check out Tim Hunter talk on Koalas from Spark Summit 2019.

Delta Lake:

Delta Lake is nothing new with the Spark ecosystem, but still, many confuse Delta Lake to be a ... DataBase! (DB) well.. delta lake is NOT a database.
Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency,
isolation, and durability) transactions to Apache Spark and Big data workloads but is not a DB! Just like Azure Blog storage and AWS S3 are not acting as databases, they are defined as storage.
Delta helps with ACID that is hard to achieve and a great pain point with distributed storage.
It provides scalable metadata handling on the data itself.

When combined with Spark, this is highly useful due to the nature of Spark SQL engine
the catalyst which uses this metadata to better plan and executed big data queries.

There is also data versioning through a snapshot of the storage named Time Travel feature.
I recommend being mindful with using this feature as saving snapshots, and later using them might create an overhead to the size and compute of your data.

If you are curious to learn more about it, read here.

That's it.

I hope you enjoyed reading this short recap on open sources for January 2021.
If you are interested in learning more and getting updates, follow Adi Polak on Twitter..

This post originally appeared on Adi Polak's Personal Blog.

Tech Exceptions new Episode -The Data Behind MLOps

Adi Polak — Tue, 12 Jan 2021 09:55:08 +0000

Many times, when an organization is asked to add more ML-based features or products, they immediately start with hiring ML experts or experienced data scientists.
After finding the best talent in the market, the new data scientists are facing the challenge of access to data and no existing infra to support their needs.
The new data scientists team is now facing a data and MLOps challenge rather than a pure data science one. They spend 70% of their time collecting, cleaning, and preparing the data for ingesting them to the machine learning algorithms.
Later they face the challenge of validating the models and communicating the outcomes to the engineering teams to code the model and deploy to production.

This full cycle can take months, and by the time the ML model is being deployed to productions, the competitors already developed a better one, and the company losses its technical edge.

This is why data scientists need a platform to create and build their state of the art machine learning models.

Microsoft Azure is offering Azure Machine Learning to help with this task.

Here is a free 1 hour course for you to learn about it:

To understand the importance of such a platform, we've met with Yaron Haviv, Co-founder and CTO of Iguazio for a chat on how they work with Microsoft and the evolving world of productionizing Machine Learning.

Watch here:

We hope you will enjoy it and happy to take your career and startup questions at @adipolak and @TechExceptions

Tech Exceptions new Episode - From Open Source to Multi B$$$ IPO

Adi Polak — Tue, 12 Jan 2021 09:21:44 +0000

When JFrog started 12 years ago, they were built on the founders' open source project. They were among the first startups to believe that a company can be built and make money while sharing free code. Some years later, they IPOed at a valuation of $6.36 Billion. which is an incredible number for the startup community.

JFrog is collaborating closely with Microsoft, on both the business side and the technical side.
JFrog Artifactory hosted on Microsoft Azure is a solution for developers and DevOps engineers that provides complete control, insight, and binary management throughout the software development lifecycle. DevOps teams have transparency and control of their entire build and release process, all with the power of cloud-based development.

Read here to learn more about Azure Artifacts.

Recently we interviewed Casey O'Mara who is an active board member of the CNCF (Cloud Native Computing Foundation), previously worked for AWS and Microsoft as an architect, and today, he is the JFrog VP of Business Development.

Curious what Casey had to say about Open Source? Watch the interview here:

We hope you will enjoy it and happy to take your career and startup questions at @adipolak and @TechExceptions

Tech Exceptions new Episode - Stop wasting your time with Old-School Logging

Adi Polak — Wed, 09 Dec 2020 09:45:40 +0000

🎙️ Jonah Kowall, CTO of Logz.io, joins Adi Polak to discuss his career path transition from engineering, an analyst at Gartner and all the way to becoming CTO of Logz.io.

Logz.io recently closed the latest funding round which brings the total capital raised to a whopping $120 + million greens 🤑 .
That includes $74 million raised over the last 18 months. That amount of available capital allows the company to accelerate product development initiatives around AI and Machine Learning, expand headcount and propel go-to-market activities around the globe.

📊 Fun fact, Logz.io is built on Open Source alone, they built the platform to be completely cloud-agnostic.
However, for their machine learning purposes, they use Bing Search API to enrich the data, which gives you 1000 transactions free per month to search for images, news, videos, visuals, and general web search.

If you are interested to learn how to deploy ELK (Elastic, Logstash, and Kibana) stack on Azure, check this link here.

Alright,

Grab your favorite drink and get ready to be enlightened ✨

Watch Now!

We hope you will enjoy it and happy to take your career and startup questions at @adipolak and @TechExceptions

Panel Discussion: Data for Good [Create: Data]

Adi Polak — Tue, 01 Dec 2020 11:21:56 +0000

Panel: Data for Good

Update!

Few, what an experience, alright.
Want to catch up with the panel session? You can watch it on demand -> here.

For watching more sessions from Create: Data on-demand, click here.

Provably one of the most exciting topics of the year for everyone who works with data. How can we leverage data to do good for our communities and improve people's lives?

Join four inspiring and accomplished leaders in the data space for this Data For Good panel. They share their perspectives about what good data is and how it relates to actually doing good. Take away learnings on how data tools can help us in keeping our data healthy.

Register now for free; please share your questions, ideas, and thoughts in the comment section. I'm happy to bring it with the panel experts to make sure all your questions are answered!

Driving a Data culture in a world of Remote Everything [Create: Data]

Adi Polak — Tue, 01 Dec 2020 11:17:30 +0000

You know data is important to create a data-driven decision that will create value for you and your colleagues.
But, what if no one from the team? or you, never collected any data? how do you drive that data culture especially now when many in the tech sector shifted to work remotely?

Join Arun Ulag and Heather Newman to learn how to build an organizational data culture with a remote workforce, How to integrate BI into the fabric of the organization, and how individuals can quickly adapt to working digitally remotely and possess a data-driven culture that will lead to better, more suitable data-driven decisions.

All this goodness, in the free Create: Data event.

See you there!

The Journey to Modern Warehousing [Create: Data]

Adi Polak — Tue, 01 Dec 2020 11:06:55 +0000

This is so very true, and just right on time for you to learn about Modern Data Warehousing.

Please join us for Create: Data on a unique session about the journey of Big Data.

In this session, Saveen Reddy and Simon Whiteley will walk us through how Microsoft goes on a wild Big Data journey to get to the technology shift that is Azure Synapse Analytics. Join Simon and Saveen as they talk through the journey to modern analytics and how the barriers have been lowered for big data adoption.

Forem: Adi Polak

Delta Lake essential Fundamentals: Part 2 - The DeltaLog

Transaction Log

DeltaLake transaction log - DetlaLog

DeltaLog and Atomicity

DeltaLog and Isolation

DeltaLog and Consistency

Two Writers:

Delete and Read:

Delete and Delete:

DeltaLog and Durability

What's next?

Delta Lake essential Fundamentals: Part 1 - ACID

What is Delta Lake ?

ACID Fundamentals in Delta Lake:

Atomicity

Consistency

Isolation

Durability

What's next?

Tech Exceptions Show: Accelerating Data Engineering with Azure

Tech Exceptions new Episode - The Importance of Testing data Applications

Watch NOW! 📺

Curious to learn more? Check out these free online courses:

Implement a coding workflow in your build pipeline by using Git and GitHub

Migrate your repository by using GitHub best practices

Monitor GitHub events by using a webhook with Azure Functions

Tech Exceptions new Episode - Data Management and External Organization Collaboration

Data Lineage View Graph example:

Data Breaches when collaborating with People outside of your Organization

Why External Data Management is important? Watch Tech Exceptions episode:

Apache Spark Ecosystem, Jan 2021 Highlights

Ray:

Koalas:

Delta Lake:

That's it.

Tech Exceptions new Episode -The Data Behind MLOps

Tech Exceptions new Episode - From Open Source to Multi B$$$ IPO

Tech Exceptions new Episode - Stop wasting your time with Old-School Logging

Alright,

Watch Now!

Panel Discussion: Data for Good [Create: Data]

Update!

Driving a Data culture in a world of Remote Everything [Create: Data]

All this goodness, in the free Create: Data event.

The Journey to Modern Warehousing [Create: Data]

Remember!

Tech Exceptions Show: Accelerating Data Engineering with Azure