Welcome to ExamTopics
ExamTopics Logo
- Expert Verified, Online, Free.

Unlimited Access

Get Unlimited Contributor Access to the all ExamTopics Exams!
Take advantage of PDF Files for 1000+ Exams along with community discussions and pass IT Certification Exams Easily.

Exam Professional Data Engineer topic 1 question 296 discussion

Actual exam question from Google's Professional Data Engineer
Question #: 296
Topic #: 1
[All Professional Data Engineer Questions]

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

  • A. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
  • B. Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
  • C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
  • D. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.
Show Suggested Answer Hide Answer
Suggested Answer: B 🗳️

Comments

Chosen Answer:
This is a voting comment (?) , you can switch to a simple comment.
Switch to a voting comment New
Anudeep58
2 days, 1 hour ago
Selected Answer: C
Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps. Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.
upvoted 1 times
...
anushree09
2 months, 3 weeks ago
per the text below at https://cloud.google.com/dataflow/docs/kafka-dataflow - "Alternatively, you might have an existing Kafka cluster that resides outside of Google Cloud. For example, you might have an existing workload that is deployed on-premises or in another public cloud."
upvoted 1 times
...
Moss2011
4 months, 1 week ago
Selected Answer: C
From my point of view, the best option is C taking into account this doc: https://cloud.google.com/dataflow/docs/kafka-dataflow
upvoted 2 times
...
MaxNRG
4 months, 1 week ago
Selected Answer: D
Based on the key requirements highlighted: • Interconnect link between GCP and on-prem Kafka • High throughput streaming pipeline • Minimal latency • Data to be stored in BigQuery D - The key reasons this meets the requirements: • Kafka connect provides a reliable bridge to Pub/Sub over the interconnect • Reading from Pub/Sub minimizes latency vs reading directly from Kafka • Dataflow provides a high throughput streaming engine • Writing to BigQuery gives scalable data storage By leveraging these fully managed GCP services over the dedicated interconnect, a low latency streaming pipeline from on-prem Kafka into BigQuery can be implemented rapidly. Options A/B/C have higher latencies or custom code requirements, so do not meet the minimal latency criteria as well as option D.
upvoted 2 times
MaxNRG
4 months, 1 week ago
Why choose option D over A? The key advantage with option D is that by writing a custom Dataflow pipeline rather than using a Google provided template, there is more flexibility to customize performance tuning and optimization for lowest latency. • Some potential optimizations: • Fine tuning number of workers, machine types to meet specific throughput targets • Custom data parsing/processing logic if applicable • Experimenting with autoscaling parameters or triggers
upvoted 1 times
MaxNRG
4 months, 1 week ago
The Google template may be easier to set up initially, but a custom pipeline provides more control over optimizations specifically for low latency requirements stated in the question. That being said, option A would still work reasonably well - but option D allows squeezing out that extra bit of performance if low millisecond latency is absolutely critical in the pipeline through precise tuning. So in summary, option A is easier to implement but option D provides more optimization flexibility for ultra low latency streaming requirements.
upvoted 1 times
...
...
MaxNRG
4 months, 1 week ago
Why not C: At first option C (using a Dataflow pipeline to directly read from Kafka and write to BigQuery) seems reasonable. However, the key requirement stated in the question is to have minimal latency for the streaming pipeline. By reading directly from Kafka within Dataflow, there can be additional latency and processing overhead compared to reading from Pub/Sub, for a few reasons: 1. Pub/Sub acts as a buffer and handles scaling/reliability of streaming data automatically. This reduces processing burden on the pipeline. 2. Network latency can be lower by leveraging Pub/Sub instead of making constant pull requests for data from Kafka within the streaming pipeline. 3. Any failures have to be handled within the pipeline code itself when reading directly from Kafka. With Pub/Sub, reliability is built-in.
upvoted 2 times
SanjeevRoy91
3 months, 1 week ago
You are adding an intermediate hop in between on prem kafka and Dataflow ( pubsub ). Why won't this add additional latency.
upvoted 2 times
...
MaxNRG
4 months, 1 week ago
So in summary, while option C is technically possible, option D introduces Pub/Sub as a streaming buffer which reduces overall latency for the pipeline, allowing the key requirement of minimal latency to be better satisfied.
upvoted 1 times
...
...
...
JyoGCP
4 months, 2 weeks ago
A Vs C -- Not sure which one would have low latency. Points related to option C: "Yes, Dataflow can read events from Kafka. Dataflow is a fully-managed, serverless streaming analytics service that supports both batch and stream processing. It can read events from Kafka, process them, and write the results to a BigQuery table for further analysis. " "Dataflow supports Kafka support, which was added to Apache Beam in 2016. Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK. "
upvoted 1 times
JyoGCP
4 months, 2 weeks ago
Going with C
upvoted 1 times
DarkLord2104
4 months, 2 weeks ago
Final???
upvoted 1 times
...
...
...
T2Clubber
5 months ago
Selected Answer: C
Option C makes more sense to me because of the "minimal latency as possible". I would have chosen option A if it were "less CODING as possible".
upvoted 3 times
...
Matt_108
5 months, 3 weeks ago
Selected Answer: A
Option A, leverage dataflow template for Kafka https://cloud.google.com/dataflow/docs/kafka-dataflow
upvoted 3 times
AllenChen123
5 months, 2 weeks ago
Agree. "Google provides a Dataflow template that configures a Kafka-to-BigQuery pipeline. The template uses the BigQueryIO connector provided in the Apache Beam SDK."
upvoted 1 times
...
ML6
4 months, 2 weeks ago
But it includes setting up a Kafka Connect bridge while an interconnect link has already been set up. https://cloud.google.com/dataflow/docs/kafka-dataflow#connect_to_an_external_cluster
upvoted 1 times
...
...
scaenruy
6 months ago
Selected Answer: C
C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
upvoted 4 times
...
rahulvin
6 months, 1 week ago
Selected Answer: C
Dataflow has templates to read from Kafka. Other options are too complicated https://cloud.google.com/dataflow/docs/kafka-dataflow
upvoted 3 times
Sofiia98
5 months, 4 weeks ago
so, this is the answer A, whe C?
upvoted 2 times
Matt_108
5 months, 3 weeks ago
Yeah, the answer is A. C requires you to develop the pipeline yourself and ensure minimal latency, which means that you perform better than a pre-built template from Google...not really the case most of the times :)
upvoted 1 times
saschak94
5 months, 1 week ago
but Option A introduces additional replication into Pub/Sub and the question states with minimal latency. In my opinion subscribing to Kafka via Dataflow has a lower latency than replicating the messages first to Pub/Sub and subscribing with Dataflow to it.
upvoted 6 times
...
...
...
...
Community vote distribution
A (35%)
C (25%)
B (20%)
Other
Most Voted
A voting comment increases the vote count for the chosen answer by one.

Upvoting a comment with a selected answer will also increase the vote count towards that answer by one. So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.

SaveCancel
Loading ...
ex Want to SAVE BIG on Certification Exam Prep?
close
ex Unlock All Exams with ExamTopics Pro 75% Off
  • arrow Choose From 1000+ Exams
  • arrow Access to 10 Exams per Month
  • arrow PDF Format Available
  • arrow Inline Discussions
  • arrow No Captcha/Robot Checks
Limited Time Offer
Ends in