📚 Chapters
PwC Data Engineering Interview Q & A
✍️ By MONU SINGH | 11/18/2025
Here is the 8 important Questions and answers asked in PwC Data Engineering Interview
1. Write a
PySpark code to join two DataFrames and perform
Answer:
aggregation.
from pyspark.sql
import SparkSession from
pyspark.sql.functions
import sum
spark = SparkSession.builder.appName("JoinAgg").getOrCreate()
# Sample data
df1 =
spark.createDataFrame([
(1, "Sales", 1000),
(2, "Marketing", 1500),
(3, "Sales", 2000)
], ["id", "department",
"salary"])
df2 = spark.createDataFrame([
(1, "John"),
(2, "Alice"),
(3, "Bob")
], ["id", "name"])
# Join and aggregate joined_df = df1.join(df2,
"id") agg_df =
joined_df.groupBy("department").agg(sum("salary").alias("total_salary"))
agg_df.show()
2. What is
the difference between wide and narrow transformations in Spark?
Answer:
• Narrow
Transformation:
• Each input
partition contributes to only one output partition.
• Examples:
map(), filter(), union()
• No shuffle, faster.
• Wide
Transformation:
• Requires
data to be shuffled across partitions. o Examples:
groupBy(),
join(), reduceByKey()
• Involves a shuffle, more
expensive in
terms of performance.
3. How do
you integrate Azure Synapse Analytics with other Azure services?
Answer:
Azure Synapse
integrates with:
• ADLS Gen2:
Store and access data via linked services and workspaces.
• Azure
Data Factory: Use ADF pipelines to load data into Synapse.
• Power BI:
For real-time reporting and dashboards directly on Synapse data.
• Azure Key Vault: To manage
credentials securely.
• Azure ML:
To run ML models on Synapse using Spark pools or SQL analytics.
4. How do
you monitor and troubleshoot data pipelines in Azure Data Factory?
Answer:
• Use Monitoring
tab in ADF UI for real-time run history.
• Check activity
run details (duration, errors, input/output).
• Enable diagnostic
logging to log analytics.
• Set up Alerts and Metrics in Azure Monitor.
• Implement
custom logging in pipelines (store status/errors in log tables).
• Use integration runtime logs to analyze data movement issues.
5. Write
Python code to generate prime numbers.
Answer:
def
is_prime(n):
if n < 2:
return False
for i in
range(2,
int(n**0.5)+1): if n
% i == 0:
return False
return
True
# Generate
first 10 primes
primes = []
num = 2 while
len(primes)
< 10: if
is_prime(num):
primes.append(num)
num += 1
print(primes)
6. How do
you optimize Python code for better performance?
Answer:
• Use built-in
functions and list comprehensions.
• Avoid
unnecessary loops; prefer vectorized operations using libraries like
NumPy or Pandas.
• Use generators instead of lists for memory efficiency.
• Profile code
using cProfile or line_profiler.
• Use multi-threading
or multi-processing for I/O-bound or CPU-bound tasks.
• Use
efficient data structures (set, dict over list when appropriate).
7. Explain
the concept of list comprehensions and provide an example.
Answer:
List
Comprehension is a
concise way to create lists.
# Example:
Create a list of squares from 1 to 5
squares =
[x**2 for x in range(1, 6)] print(squares)
# Output: [1,
4, 9, 16, 25]
It’s
equivalent to
squares = [] for x in
range(1, 6):
squares.append(x**2)
List comprehensions are faster and more readable.
8. How do
you implement disaster recovery and backup strategies for data in Azure?
Answer:
• Use Geo-redundant storage (GRS) for automatic
replication across regions.
• Enable Azure Backup and Recovery
Services Vault.
•
Schedule regular backups for Azure SQL, Synapse, and Blob/ADLS.
• Use Soft
Delete and Point-in-Time Restore features.
• Implement RA-GRS
(Read-Access Geo-Redundant Storage) for high availability.
• Document and test a DR plan using Azure Site Recovery.
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!