Loading ...

📚 Chapters

Persistent Data Engineering Interview Q&A

✍️ By MONU SINGH | 11/18/2025


1. Explain broadcast join in PySpark.

 

Answer:

A broadcast join is used when one of the DataFrames is small enough to fit in memory. Spark broadcasts the smaller DataFrame to all executors, avoiding a full shuffle operation.

Benefits:

• Reduces data movement

• Faster than shuffle joins

• Ideal when joining a large dataset with a small lookup table

Example:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "id")

 

 

 

2. How to create a rank column using the Window function in PySpark?

 

Answer:

Use Window and rank() or dense_rank() functions to create a rank column.

Example:

from pyspark.sql.window import Window from

pyspark.sql.functions import rank

 windowSpec = Window.partitionBy("department").orderBy("salary") df.withColumn("rank", rank().over(windowSpec)).show()

 

 

 

3. What is the binary copy method in ADF, and when is it used?

 

Answer:

Binary copy in ADF transfers files as-is, without parsing or transformation. It’s useful when: • File formats are unknown or unsupported

• You want to move images, zip files, videos, etc.

• You need fast point-to-point transfer between file-based systems

You can enable binary copy by setting "binary copy" in the Copy activity.

 

 

 

4. How do you monitor and optimize performance in Azure Synapse?

 

Answer:

Monitoring tools:

Monitor Hub in Synapse Studio

DMVs (Dynamic Management Views) for query insights

• SQL Activity logs

Optimization tips:

• Use result set caching

• Choose proper distribution methods (hash, round-robin)

• Use materialized views

• Avoid excessive shuffling

• Partition large tables appropriately

 

 

 

5. Write Python code to identify duplicates in a list and count their occurrences.

Answer:

 from collections import Counter

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple'] counter

= Counter(data)

duplicates = {item: count for item, count in counter.items() if count > 1} print(duplicates) Output:

{'apple': 3, 'banana': 2}

 

 

 

6. What are the key features of Azure DevOps?

 

Answer:

Version control with Git

Pipelines for CI/CD

Boards for Agile project tracking

Artifacts for package management

Test Plans for automated/manual testing

 

Integration with tools like GitHub, Slack, VS Code

 

 

 

7. How do you handle schema drift in ADF?

 

Answer:

Schema drift = changes in source schema (e.g., added/removed columns).

How to handle:

• Enable “Allow schema drift” in mapping data flows

• Use auto-mapping or dynamic column handling

• Use parameterized datasets to pass schema info

• Combine with schema projection for flexibility

 

 

 

8. Explain the concept of denormalization and when it should be used.

 

Answer:

Denormalization is the process of combining normalized tables into fewer tables to improve read performance. When to use:

• In OLAP systems or data warehouses

• When query speed is more important than storage efficiency

• For simplifying complex joins

Pros: Faster queries, easier reporting

Cons: Data redundancy, maintenance overhead 

💬 Comments

logo

Comments (0)

No comments yet. Be the first to share your thoughts!