Key Concepts
The Orbitra Lake Environment
The SDK operates within a pre-configured ecosystem of Azure Storage Accounts and Databricks Workspaces. By using the SDK, you don’t need to worry about JDBC strings or storage keys; the library handles authentication and protocol translation automatically.Data lake basics
A data lake is a centralized storage layer for data in many formats (raw files, semi-structured, and fully curated datasets). Unlike traditional warehouses, data lakes don’t require every dataset to fit a rigid, pre-defined schema, so teams can iterate while still enforcing governance where it matters. Important: Data lakes are designed for immutability. This means you cannot update individual rows or cells in a table as you would in a traditional database. Instead, data is always overwritten at the file or partition level. When you need to change data, you overwrite the relevant partition or the entire table. This approach ensures consistency, supports large-scale analytics, and aligns with how modern data lake formats (like Iceberg) manage data.Iceberg Tables
Apache Iceberg is an open table format designed for large-scale analytics on data lakes. It brings many of the reliability and performance guarantees of data warehouses to object storage, without locking you into a specific engine or vendor. One of Iceberg’s most powerful features is table partitioning. Partitioning allows you to split large tables into smaller, logical segments—such as by date, region, or data source—while still presenting them as a single table to your applications. This makes queries much faster and enables efficient overwrites: instead of rewriting an entire table, you can overwrite just the affected partitions. Iceberg manages all the metadata, tracking schemas, partitions, and snapshots, so you can evolve your data model and scale analytics with confidence. To learn more about Iceberg tables, visit: https://iceberg.apache.orgCases of use of table partition
- Partitioning by date: For time-series data such as logs, transactions, or daily reports, partitioning tables by date (e.g., year, month, day) allows for efficient querying and overwriting of specific time periods without scanning the entire dataset.
- Partitioning by region or business unit: When data is naturally segmented by geography (e.g., country, state) or organizational unit, partitioning by these fields enables teams to manage, update, or analyze data for specific segments independently.
- Partitioning by data source or type: In scenarios where data comes from multiple sources or represents different categories (e.g., device type, product line), partitioning by source or type helps isolate and efficiently process relevant subsets of data.
Example of Table partition by date
In Orbitra Lake, you mark partition columns by settingkind="partition" in ColumnSchema.
Orbitra Lake SDK in 2 minutes
The SDK handles the “plumbing” (configuration + authentication) so you can start from a client and focus on data.1) Create a client
environment="dev".
2) Create a table
3) Overwrite data
- Note: This method will automatically identify partition columns in the dataframe and will overwrite them. If there are no partition columns, the entire table is overwritten.
4) Read data
- Note: In most cases, you’ll use
engine="local"for the SQL method, as it runs queries on your local machine at no extra cost. If you need more powerful hardware, you can chooseengine="remote"to use databricks’s hardware, but this may incur additional usage costs. Thelocalengine is only available on Linux systems or within Dev Container environments.
Working with raw files (blobs)
Orbitra Lake SDK also supports saving and reading raw data blobs (for example, landing-zone files before they’re curated into tables). Some of the methods available aresave_raw_df_to_blob, save_raw_bytes_to_blob, read_raw_df_from_blob, and read_raw_bytes_from_blob.