Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Get Unlimited Access

Databricks Discussions

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Actual exam question from Databricks's Certified Data Engineer Professional

Question #: 30
Topic #: 1

[All Certified Data Engineer Professional Questions]

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():

A. return spark.readStream.table("bronze")
B. return spark.readStream.load("bronze")
C.
D. return spark.read.option("readChangeFeed", "true").table ("bronze")
E.

Show Suggested Answer

Suggested Answer: D 🗳️

by hammer_1234_h at Sept. 14, 2023, 12:35 a.m.

Comments

Submit Cancel

AzureDE2522

Highly Voted 7 months, 4 weeks ago

Selected Answer: D

# not providing a starting version/timestamp will result in the latest snapshot being fetched first spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .table("myDeltaTable") Please refer: https://docs.databricks.com/en/delta/delta-change-data-feed.html

upvoted 5 times

...

Laraujo2022

Highly Voted 7 months, 3 weeks ago

In my opinion E is not correct because we do not see parameters pass within to the function (year, month and day)... the function is def new_records():

upvoted 5 times

...

zhiva

Most Recent 1 week, 3 days ago

Selected Answer: A

Both E and A can be correct but in the definition of the function there are no input parameters. This means we can't use them correctly in returned statement only with the given information in the question. This is why I vote for A

upvoted 1 times

...

imatheushenrique

1 month ago

The E option makes more sense because all the partition would be filtered. Can't be the options that use CDF because theres no readChangeFeed option in dataframe read

upvoted 1 times

...

arik90

3 months, 1 week ago

Selected Answer: E

Since the ingest_daily_batch function writes to the "bronze" table in batch mode using spark.read and write operations, we should not use readStream to read from it in the subsequent function.

upvoted 1 times

...

alexvno

3 months, 3 weeks ago

Selected Answer: E

Probable E, but still filename not specified only folder path

upvoted 1 times

...

vikram12apr

4 months ago

Selected Answer: E

Please read the question again . it is asking to get the data from bronze table to the some downstream table. Now as its a append only daily nightly job the filter on file name will give the new data available in bronze table which is still not flown down the pipeline.

upvoted 2 times

...

agreddy

4 months, 2 weeks ago

D is correct. https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ CDF can be enabled on non-streaming Delta table.. "delta" is default table format.

upvoted 1 times

...

ojudz08

4 months, 3 weeks ago

Selected Answer: D

the question here is how to manipulate new records that have not yet been processed to the next table, since the data has been ingested into the bronze table you need to check whether or not the data ingested daily is already there in the silver table, so I think answer is D. Enabling change data feed allows to track row-level changes between delta table versions https://docs.databricks.com/en/delta/delta-change-data-feed.html

upvoted 1 times

...

guillesd

5 months ago

the problem here is that both A and E are correct. E just follows the previous filtering logic while A uses the readStream method which will have to maintain a checkpoint. But both can work

upvoted 1 times

...

DAN_H

5 months, 1 week ago

Selected Answer: A

A as Structured Streaming incrementally reads Delta tables. While a streaming query is active against a Delta table, new records are processed idempotently as new table versions commit to the source table.

upvoted 3 times

...

adenis

5 months, 1 week ago

Selected Answer: A

A is Correct

upvoted 1 times

...

adenis

5 months, 1 week ago

A is Correct

upvoted 1 times

...

Jay_98_11

5 months, 3 weeks ago

Selected Answer: E

can't be D since no read option in CDF. https://docs.databricks.com/en/delta/delta-change-data-feed.html

upvoted 1 times

mht3336

5 months, 2 weeks ago

spark.read.format("delta") \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .option("endingVersion", 10) \ .table("myDeltaTable")

upvoted 1 times

...

RafaelCFC

6 months ago

Selected Answer: E

E addresses the desired filtering, while keeping with the logic of the first step being a batch job, and has no code errors.

upvoted 1 times

...

alaverdi

6 months, 3 weeks ago

Selected Answer: A

In my opinion A is the correct answer. You read delta table as a stream and process only newly arrived records. This is maintained while writing the stream with the state stored in checkpoint location. spark.readStream.table("bronze") .writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/path/to/checkpoints/") .toTable("silver")

upvoted 3 times

...

chokthewa

8 months, 3 weeks ago

E is correct. D use invalid option refer to see sample in https://docs.databricks.com/en/delta/delta-change-data-feed.html . A , B didn't filter ,so it will gather whole table data. E uses the knew value to filter .

upvoted 1 times

...

Load full discussion...

Unlimited Access

Exam Certified Data Engineer Professional topic 1 question 30 discussion

Comments

AzureDE2522

Laraujo2022

zhiva

imatheushenrique

arik90

alexvno

vikram12apr

agreddy

ojudz08

guillesd

DAN_H

adenis

adenis

Jay_98_11

mht3336

RafaelCFC

alaverdi

chokthewa

Get IT Certification

New Version GCP Professional Cloud Architect Certificate & Helpful Information

The 5 Most In-Demand Project Management Certifications of 2019