Persistent & Cognizant Data Engineering Interview Prep – 15 Key Questions on PySpark, ADF, Azure & SQL

1. Explain broadcast join in PySpark.

Answer:

A broadcast join is used when one of the DataFrames is small enough to fit in memory. Spark broadcasts the smaller DataFrame to all executors, avoiding a full shuffle operation.

Benefits:

• Reduces data movement

• Faster than shuffle joins

• Ideal when joining a large dataset with a small lookup table

Example:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "id")

2. How to create a rank column using the Window function in PySpark?

Answer:

Use Window and rank() or dense_rank() functions to create a rank column.

Example:

from pyspark.sql.window import Window from

pyspark.sql.functions import rank

windowSpec = Window.partitionBy("department").orderBy("salary") df.withColumn("rank", rank().over(windowSpec)).show()

3. What is the binary copy method in ADF, and when is it used?

Answer:

Binary copy in ADF transfers files as-is, without parsing or transformation. It’s useful when: • File formats are unknown or unsupported

• You want to move images, zip files, videos, etc.

• You need fast point-to-point transfer between file-based systems

You can enable binary copy by setting "binary copy" in the Copy activity.

4. How do you monitor and optimize performance in Azure Synapse?

Answer:

Monitoring tools:

• Monitor Hub in Synapse Studio

• DMVs (Dynamic Management Views) for query insights

• SQL Activity logs

Optimization tips:

• Use result set caching

• Choose proper distribution methods (hash, round-robin)

• Use materialized views

• Avoid excessive shuffling

• Partition large tables appropriately

5. Write Python code to identify duplicates in a list and count their occurrences.

Answer:

from collections import Counter

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple'] counter

= Counter(data)

duplicates = {item: count for item, count in counter.items() if count > 1} print(duplicates) Output:

{'apple': 3, 'banana': 2}

6. What are the key features of Azure DevOps?

Answer:

• Version control with Git

• Pipelines for CI/CD

• Boards for Agile project tracking

• Artifacts for package management

• Test Plans for automated/manual testing

• Integration with tools like GitHub, Slack, VS Code

7. How do you handle schema drift in ADF?

Answer:

Schema drift = changes in source schema (e.g., added/removed columns).

How to handle:

• Enable “Allow schema drift” in mapping data flows

• Use auto-mapping or dynamic column handling

• Use parameterized datasets to pass schema info

• Combine with schema projection for flexibility

8. Explain the concept of denormalization and when it should be used.

Answer:

Denormalization is the process of combining normalized tables into fewer tables to improve read performance. When to use:

• In OLAP systems or data warehouses

• When query speed is more important than storage efficiency

• For simplifying complex joins

Pros: Faster queries, easier reporting

Cons: Data redundancy, maintenance overhead

📚 Chapters

Persistent Data Engineering Interview Q&A

💬 Comments

Comments (0)