- Practice from these notebooks throroughly and below pdf
- Incremental processing
- ETL patterns
- Data Privacy
- Performnace Optimization
- DLT practices
- Automate prod workflows
- Knowledge check
- Udemy Practice[I personally opted this which is MUST. Use Udemy for Bussiness for free]
Repo link
I was able to note down these topics memory based.
- How can you read parameters using
dbutils.widgets.text
and retrieve their values?- Hint : Focus on using
dbutils.widgets.get
to retrieve the values.
- Hint : Focus on using
- How do you provide read access for a production notebook to a new data engineer for review?
- Hint : The answer involves setting the notebook's permissions to "Can Read."
- When attaching a notebook to a cluster, which permission allows you to run the notebook?
- Hint : The user needs "Can Restart" permission.
- Should production DLT pipelines be run on a job cluster or an all-purpose cluster?
- Does a CTAS (CREATE TABLE AS SELECT) operation execute the load every time or only during table creation?
- How can you control access to read production secrets using scope access control?
- Hint : The answer involves setting "Read" permissions on the scope or secret.
- Where does the
%sh
command run in Databricks?- Hint : It runs on the driver node.
- If a query contains a filter, how does Databricks use file statistics in the transaction log?
- What happens when you run a
VACUUM
command on a shallow clone table?- Hint : Running
VACUUM
on a shallow clone table will result in an error?
- Hint : Running
- Which type of join (left, inner, right) is not possible when performing a join between a static DataFrame and a streaming DataFrame?
- Hint : Consider the limitations of streaming joins.
- When the source is a CDC (Change Data Capture), should you use
MERGE INTO
or leverage the Change Data Feed (CDF) feature? - How can you find the difference between the previous and present commit in a Delta table?
- What is the best approach for nightly jobs that overwrite a table for the business team with the least latency?
- Hint : Should you write to the table nightly or create a view?
- What does the
OPTIMIZE TABLE
command do, and what is the target file size?- Hint : Focus on the target file size of 1GB.
- In a streaming scenario, what does the
.withWatermark
function do with a delay of 10 minutes? - How does aggregating on the source and then overwriting/appending to the target impact data load?
- Why did you receive three email notifications when a job was set to trigger an email if
mean(temp) > 120
?- Hint : Investigate multiple triggers for the email alert.
- Why should the checkpoint directory be unique for each stream in a streaming job?
- How would you set up an Autoloader scenario to load data into a bronze table with history and update the target table?
- How can you handle streaming deduplication based on a given code scenario?
- For batch loading, what happens if the load is set to overwrite or append?
- Hint : Consider the impact on the target table.
- In a Change Data Feed (CDF) scenario, if
readChangeFeed
starts at version 0 and append is used, will there be deduplication? - How can you identify whether a table is SCD Type 1 or 2 based on an upsert operation?
- To avoid performance issues, should you decrease the trigger time or not?
- How does Delta Lake decide on file skipping based on columns in a query, and what are the implications for nested columns?
- What does granting "Usage" and "Select" permissions on a Delta table allow a user to do?
- How do you create an unmanaged table in Databricks?
- What makes a date column a good candidate for partitioning in a Delta table?
- What happens in the transaction log when you rename a Delta table using
ALTER TABLE xx RENAME xx
? - How would you handle an error with a check constraint and what would you recommend?
- When using
DESCRIBE
commands, how can you retrieve table properties, comments, and partition details?- Hint : Use
DESCRIBE HISTORY
,DESCRIBE EXTENDED
, orDESCRIBE DETAIL
.?
- Hint : Use
- How are file statistics used in Delta Lake, and why are they important?
- In the Ganglia UI, how can you detect a spill during query execution?
- If a repo branch is missing locally, how can you retrieve that branch with the latest code changes?
- After deleting records with a query like
DELETE FROM A WHERE id IN (SELECT id FROM B)
, can you time travel to see the deleted records, and how can you prevent their permanent deletion? - What are the differences between DBFS and mounts in Databricks?
- If the API
2.0/jobs/create
is executed three times with the same JSON, what will happen? Will it execute or create three jobs? - What is DBFS in Databricks?
- How do you install a Python library using
%pip
in a Databricks notebook? - If Task 1 has downstream Task 2 and Task 3 running in parallel, and Task 1 and Task 2 succeed while Task 3 fails, what will be the final job status?
- Hint : The job may show as partially completed.
- How do you handle streaming job retries in production, specifically with job clusters, unlimited retries, and a maximum of one concurrent run?
- How can you clone an existing job and version it using the Databricks CLI?
- When converting a large JSON file (1TB) to Parquet with a partition size of 512 MB, what is the correct order of steps? Should you read, perform narrow transformations, repartition (2048 partitions), then convert to Parquet?
- What happens in the target table when duplicates are dropped during a batch read and append operation?
- If a column was missed during profiling from Kafka, how can you ensure that the data is fully replayable in the future?
- Hint : Consider writing to a bronze table.
- How do you handle access control for users in Databricks?
- What is the use of the
pyspark.sql.functions.broadcast
function in a Spark job?- Hint : It distributes the data to all worker nodes.
- What happens when performing a join on
orders_id
with a condition "when not matched, insert *"?- Hint : The operation will insert records that don’t have a match.
- Given a function definition for loading bronze data, how would you write a silver load function to transform and update downstream tables?
- If the code includes
CASE WHEN is_member("group") THEN email ELSE 'redacted' END AS email
, what will be the output if the user is not a member of the group? - How can you use the Ganglia UI to view logs and troubleshoot a Databricks job?
- When working with multi-task jobs, how do you list or get the tasks using the API
2.0/jobs/list
or2.0/jobs/run/list
? - What is unit testing, and how is it applied in a Databricks environment?
- What happens when multiple
display()
commands are executed repeatedly in development, and what is the impact in production? - Will the
option("readChangeFeed")
work on a source Delta table with no CDC enabled? - How can you identify whether a tumbling or sliding window is being used based on the code provided?
- What performance tuning considerations are involved with
spark.sql.files.maxPartitionBytes
andspark.sql.shuffle.partitions
?
No matter what, please read these databricks docs. Note the Important tags in these pages and questions at the end on some pages.
- Data skipping with Z-order indexes for Delta Lake
- Clone a table on Databricks
- Delta table streaming reads and writes
- Structured Streaming Programming Guide - Spark 3.5.0 Documentation
- Configure Structured Streaming trigger intervals
- Configure Delta Lake to control data file size
- Introducing Stream-Stream Joins in Apache Spark 2.3
- Best Practices for Using Structured Streaming in Production - The Databricks Blog
- What is Auto Loader?
- Upsert into a Delta Lake table using merge
- Use Delta Lake change data feed on Databricks
- Apply watermarks to control data processing thresholds
- Use foreachBatch to write to arbitrary data sinks
- How to Simplify CDC With Delta Lake's Change Data Feed
- VACUUM
- Jobs access control
- Cluster access control
- Secret access control
- Hive metastore privileges and securable objects (legacy)
- Data objects in the Databricks lakehouse
- Constraints on Databricks
- When to partition tables on Databricks
- Manage clusters
- Export and import Databricks notebooks
- Unit testing for notebooks
- Databricks SQL Statement Execution API – Announcing the Public Preview
- Transform data with Delta Live Tables
- Manage data quality with Delta Live Tables
- Simplified change data capture with the APPLY CHANGES API in Delta Live Tables
- Monitor Delta Live Tables pipelines
- Load data with Delta Live Tables
- What is Delta Live Tables?
- Solved: Re: What is the difference between Streaming live ... - Databricks - 17121
- What are all the Delta things in Databricks?
- Parameterized queries with PySpark
- Recover from Structured Streaming query failures with workflows
- Jobs API 2.0
- OPTIMIZE
- Adding and Deleting Partitions in Delta Lake tables
- What is the Databricks File System (DBFS)?
- Mounting cloud object storage on Databricks
- Databricks widgets
- Performance Tuning