Forum Discussion
Brian_169
Jan 03, 2025Copper Contributor
Data archiving of delta table in Azure Databricks
Hi all,
Currently I am researching on data archiving for delta table data on Azure platform as there is data retention policy within the company.
I have studied the documentation from Databricks official (https://docs.databricks.com/en/optimizations/archive-delta.html) which is about archival support in Databricks. It said "If you enable this setting without having lifecycle policies set for your cloud object storage, Databricks still ignores files based on this specified threshold, but no data is archived."
Therefore, I am thinking how to configure the lifecycle policy in azure storage account. I have read the documentation on Microsoft official (https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview)
Let say the delta table data are stored in "test-container/sales" and there are lots of "part-xxxx.snappy.parquet" data file stored in that folder. Should I simply specify "tierToArchive", "daysAfterCreationGreaterThan: 1825", "prefixMatch: ["test-container/sales"]?
However, I am worried that will this archive mechanism impact on normal delta table operation?
Besides, I am worried that what if the parquet data file moved to archive tier contains both data created before 5 years and after 5 years, it is possible? Will it by chance move data earlier to archive tier before 5 years?
Highly appreciate if someone could help me out with the questions above. Thanks in advance.
Try below on configure Lifecycle Policy in Azure Storage
{ "rules": [ { "enabled": true, "name": "ArchiveOldData", "type": "Lifecycle", "definition": { "actions": { "tierToArchive": { "daysAfterCreationGreaterThan": 1825, "prefixMatch": ["test-container/sales"] } } } } ] }