Databricks Certified Data Engineer Professional Questions

Suggestions

Practice from these notebooks throroughly and below pdf
- Incremental processing
- ETL patterns
- Data Privacy
- Performnace Optimization
- DLT practices
- Automate prod workflows
- Knowledge check
- Udemy Practice[I personally opted this which is MUST. Use Udemy for Bussiness for free]

Topics

I was able to note down these topics memory based.

How can you read parameters using dbutils.widgets.text and retrieve their values?
- Hint : Focus on using dbutils.widgets.get to retrieve the values.
How do you provide read access for a production notebook to a new data engineer for review?
- Hint : The answer involves setting the notebook's permissions to "Can Read."
When attaching a notebook to a cluster, which permission allows you to run the notebook?
- Hint : The user needs "Can Restart" permission.
Should production DLT pipelines be run on a job cluster or an all-purpose cluster?
Does a CTAS (CREATE TABLE AS SELECT) operation execute the load every time or only during table creation?
How can you control access to read production secrets using scope access control?
- Hint : The answer involves setting "Read" permissions on the scope or secret.
Where does the %sh command run in Databricks?
- Hint : It runs on the driver node.
If a query contains a filter, how does Databricks use file statistics in the transaction log?
What happens when you run a VACUUM command on a shallow clone table?
- Hint : Running VACUUM on a shallow clone table will result in an error?
Which type of join (left, inner, right) is not possible when performing a join between a static DataFrame and a streaming DataFrame?
- Hint : Consider the limitations of streaming joins.
When the source is a CDC (Change Data Capture), should you use MERGE INTO or leverage the Change Data Feed (CDF) feature?
How can you find the difference between the previous and present commit in a Delta table?
What is the best approach for nightly jobs that overwrite a table for the business team with the least latency?
- Hint : Should you write to the table nightly or create a view?
What does the OPTIMIZE TABLE command do, and what is the target file size?
- Hint : Focus on the target file size of 1GB.
In a streaming scenario, what does the .withWatermark function do with a delay of 10 minutes?
How does aggregating on the source and then overwriting/appending to the target impact data load?
Why did you receive three email notifications when a job was set to trigger an email if mean(temp) > 120?
- Hint : Investigate multiple triggers for the email alert.
Why should the checkpoint directory be unique for each stream in a streaming job?
How would you set up an Autoloader scenario to load data into a bronze table with history and update the target table?
How can you handle streaming deduplication based on a given code scenario?
For batch loading, what happens if the load is set to overwrite or append?
- Hint : Consider the impact on the target table.
In a Change Data Feed (CDF) scenario, if readChangeFeed starts at version 0 and append is used, will there be deduplication?
How can you identify whether a table is SCD Type 1 or 2 based on an upsert operation?
To avoid performance issues, should you decrease the trigger time or not?
How does Delta Lake decide on file skipping based on columns in a query, and what are the implications for nested columns?
What does granting "Usage" and "Select" permissions on a Delta table allow a user to do?
How do you create an unmanaged table in Databricks?
What makes a date column a good candidate for partitioning in a Delta table?
What happens in the transaction log when you rename a Delta table using ALTER TABLE xx RENAME xx?
How would you handle an error with a check constraint and what would you recommend?
When using DESCRIBE commands, how can you retrieve table properties, comments, and partition details?
- Hint : Use DESCRIBE HISTORY, DESCRIBE EXTENDED, or DESCRIBE DETAIL.?
How are file statistics used in Delta Lake, and why are they important?
In the Ganglia UI, how can you detect a spill during query execution?
If a repo branch is missing locally, how can you retrieve that branch with the latest code changes?
After deleting records with a query like DELETE FROM A WHERE id IN (SELECT id FROM B), can you time travel to see the deleted records, and how can you prevent their permanent deletion?
What are the differences between DBFS and mounts in Databricks?
If the API 2.0/jobs/create is executed three times with the same JSON, what will happen? Will it execute or create three jobs?
What is DBFS in Databricks?
How do you install a Python library using %pip in a Databricks notebook?
If Task 1 has downstream Task 2 and Task 3 running in parallel, and Task 1 and Task 2 succeed while Task 3 fails, what will be the final job status?
- Hint : The job may show as partially completed.
How do you handle streaming job retries in production, specifically with job clusters, unlimited retries, and a maximum of one concurrent run?
How can you clone an existing job and version it using the Databricks CLI?
When converting a large JSON file (1TB) to Parquet with a partition size of 512 MB, what is the correct order of steps? Should you read, perform narrow transformations, repartition (2048 partitions), then convert to Parquet?
What happens in the target table when duplicates are dropped during a batch read and append operation?
If a column was missed during profiling from Kafka, how can you ensure that the data is fully replayable in the future?
- Hint : Consider writing to a bronze table.
How do you handle access control for users in Databricks?
What is the use of the pyspark.sql.functions.broadcast function in a Spark job?
- Hint : It distributes the data to all worker nodes.
What happens when performing a join on orders_id with a condition "when not matched, insert *"?
- Hint : The operation will insert records that don’t have a match.
Given a function definition for loading bronze data, how would you write a silver load function to transform and update downstream tables?
If the code includes CASE WHEN is_member("group") THEN email ELSE 'redacted' END AS email, what will be the output if the user is not a member of the group?
How can you use the Ganglia UI to view logs and troubleshoot a Databricks job?
When working with multi-task jobs, how do you list or get the tasks using the API 2.0/jobs/list or 2.0/jobs/run/list?
What is unit testing, and how is it applied in a Databricks environment?
What happens when multiple display() commands are executed repeatedly in development, and what is the impact in production?
Will the option("readChangeFeed") work on a source Delta table with no CDC enabled?
How can you identify whether a tumbling or sliding window is being used based on the code provided?
What performance tuning considerations are involved with spark.sql.files.maxPartitionBytes and spark.sql.shuffle.partitions?

Must Read hyperlinks

No matter what, please read these databricks docs. Note the Important tags in these pages and questions at the end on some pages.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
files		files
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Certified Data Engineer Professional Questions

Suggestions

Topics

Must Read hyperlinks

About

Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions

Folders and files

Latest commit

History

Repository files navigation

Databricks Certified Data Engineer Professional Questions

Suggestions

Topics

Must Read hyperlinks

About

Topics

Resources

Stars

Watchers

Forks