📚 Chapters
Persistent Data Engineering Interview Q&A
✍️ By MONU SINGH | 11/18/2025
1. Explain
broadcast join in PySpark.
Answer:
A broadcast
join is used when one of the DataFrames is small enough to fit in memory.
Spark broadcasts the smaller DataFrame to all executors, avoiding a full
shuffle operation.
Benefits:
• Reduces data
movement
• Faster than
shuffle joins
• Ideal when
joining a large dataset with a small lookup table
Example:
from
pyspark.sql.functions import broadcast
result =
large_df.join(broadcast(small_df), "id")
2. How to
create a rank column using the Window function in PySpark?
Answer:
Use Window and
rank() or dense_rank() functions to create a rank column.
Example:
from
pyspark.sql.window import Window from
pyspark.sql.functions
import rank
windowSpec = Window.partitionBy("department").orderBy("salary") df.withColumn("rank", rank().over(windowSpec)).show()
3. What is
the binary copy method in ADF, and when is it used?
Answer:
Binary copy
in ADF transfers
files as-is, without parsing or transformation. It’s useful when: • File
formats are unknown or unsupported
• You want to
move images, zip files, videos, etc.
• You need
fast point-to-point transfer between file-based systems
You can enable
binary copy by setting "binary copy" in the Copy activity.
4. How do
you monitor and optimize performance in Azure Synapse?
Answer:
Monitoring
tools:
• Monitor
Hub in Synapse Studio
• DMVs (Dynamic
Management Views) for query insights
• SQL Activity
logs
Optimization
tips:
• Use result
set caching
• Choose
proper distribution methods (hash, round-robin)
• Use
materialized views
• Avoid
excessive shuffling
• Partition
large tables appropriately
5. Write
Python code to identify duplicates in a list and count their occurrences.
Answer:
from collections import Counter
data =
['apple', 'banana', 'apple', 'orange', 'banana', 'apple'] counter
=
Counter(data)
duplicates =
{item: count for item, count in counter.items() if count > 1}
print(duplicates) Output:
{'apple': 3,
'banana': 2}
6. What are
the key features of Azure DevOps?
Answer:
• Version
control with Git
• Pipelines
for CI/CD
• Boards for
Agile project tracking
• Artifacts
for package management
• Test
Plans for automated/manual testing
• Integration with tools like GitHub, Slack, VS Code
7. How do
you handle schema drift in ADF?
Answer:
Schema drift =
changes in source schema (e.g., added/removed columns).
How to
handle:
• Enable “Allow
schema drift” in mapping data flows
• Use auto-mapping
or dynamic column handling
• Use parameterized
datasets to pass schema info
• Combine with
schema projection for flexibility
8. Explain
the concept of denormalization and when it should be used.
Answer:
Denormalization
is the process of
combining normalized tables into fewer tables to improve read performance. When
to use:
• In OLAP systems
or data warehouses
• When query
speed is more important than storage efficiency
• For
simplifying complex joins
Pros: Faster queries, easier reporting
Cons: Data redundancy, maintenance overhead
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!