PwC Interview Guide: PySpark, Azure, Python & Data Engineering Q&A

Here is the 8 important Questions and answers asked in PwC Data Engineering Interview

1. Write a PySpark code to join two DataFrames and perform

Answer:

aggregation. from pyspark.sql import SparkSession from

pyspark.sql.functions import sum

spark = SparkSession.builder.appName("JoinAgg").getOrCreate()

# Sample data

df1 = spark.createDataFrame([

(1, "Sales", 1000),

(2, "Marketing", 1500),

(3, "Sales", 2000)

], ["id", "department", "salary"])

df2 = spark.createDataFrame([

(1, "John"),

(2, "Alice"),

(3, "Bob")

], ["id", "name"])

# Join and aggregate joined_df = df1.join(df2, "id") agg_df =

joined_df.groupBy("department").agg(sum("salary").alias("total_salary"))

agg_df.show()

2. What is the difference between wide and narrow transformations in Spark?

Answer:

• Narrow Transformation:

• Each input partition contributes to only one output partition.

• Examples: map(), filter(), union()
• No shuffle, faster.

• Wide Transformation:

• Requires data to be shuffled across partitions. o Examples:

groupBy(), join(), reduceByKey()
• Involves a shuffle, more

expensive in terms of performance.

3. How do you integrate Azure Synapse Analytics with other Azure services?

Answer:

Azure Synapse integrates with:

• ADLS Gen2: Store and access data via linked services and workspaces.

• Azure Data Factory: Use ADF pipelines to load data into Synapse.

• Power BI: For real-time reporting and dashboards directly on Synapse data.
• Azure Key Vault: To manage credentials securely.

• Azure ML: To run ML models on Synapse using Spark pools or SQL analytics.

4. How do you monitor and troubleshoot data pipelines in Azure Data Factory?

Answer:

• Use Monitoring tab in ADF UI for real-time run history.

• Check activity run details (duration, errors, input/output).

• Enable diagnostic logging to log analytics.

• Set up Alerts and Metrics in Azure Monitor.

• Implement custom logging in pipelines (store status/errors in log tables).
• Use integration runtime logs to analyze data movement issues.

5. Write Python code to generate prime numbers.

Answer:

def is_prime(n):

if n < 2:

return False for i in

range(2, int(n**0.5)+1): if n

% i == 0:

return False return

True

# Generate first 10 primes

primes = [] num = 2 while

len(primes) < 10: if

is_prime(num):

primes.append(num)

num += 1

print(primes)

6. How do you optimize Python code for better performance?

Answer:

• Use built-in functions and list comprehensions.

• Avoid unnecessary loops; prefer vectorized operations using libraries like NumPy or Pandas.
• Use generators instead of lists for memory efficiency.

• Profile code using cProfile or line_profiler.

• Use multi-threading or multi-processing for I/O-bound or CPU-bound tasks.
• Use efficient data structures (set, dict over list when appropriate).

7. Explain the concept of list comprehensions and provide an example.

Answer:

List Comprehension is a concise way to create lists.

# Example: Create a list of squares from 1 to 5

squares = [x**2 for x in range(1, 6)] print(squares)

# Output: [1, 4, 9, 16, 25]

It’s equivalent to

squares = [] for x in

range(1, 6):

squares.append(x**2)

List comprehensions are faster and more readable.

8. How do you implement disaster recovery and backup strategies for data in Azure?

Answer:

• Use Geo-redundant storage (GRS) for automatic replication across regions.
• Enable Azure Backup and Recovery Services Vault.

• Schedule regular backups for Azure SQL, Synapse, and Blob/ADLS.

• Use Soft Delete and Point-in-Time Restore features.

• Implement RA-GRS (Read-Access Geo-Redundant Storage) for high availability.
• Document and test a DR plan using Azure Site Recovery.

📚 Chapters

PwC Data Engineering Interview Q & A

💬 Comments

Comments (0)