25 Must-Know Azure & PySpark Interview Questions for Data Engineers

These are all the questions asked in Tiger Analytics' Data Engineering Interview.

1. Explain lazy evaluation in PySpark?

ANS:-

Lazy evaluation means transformations (like map, filter) are not executed immediately. Instead, they’re only evaluated when an action (like collect, count) is triggered. This approach optimizes execution by minimizing data passes and enabling the Spark engine to build an efficient execution plan.

2. How does caching work in PySpark?

ANS:-

When you cache a DataFrame using .cache() or .persist(), Spark stores it in memory (or disk if needed) so repeated actions can reuse the same data instead of recomputing. It’s useful when you use the same dataset multiple times.

3. What is the difference between wide and narrow transformations?

ANS:-

• Narrow transformations (e.g., map, filter) operate on a single partition, no shuffling required.

• Wide transformations (e.g., reduceByKey, join) involve data shuffling across nodes and are more expensive.

4. How do you optimize query performance in Azure SQL Database?

ANS:-

• Create proper indexes (clustered, non-clustered)

• Use query hints and execution plans

• Avoid SELECT *

• Optimize with partitioning and statistics updates

• Monitor via Query Performance Insight and DMVs

5. Describe the process of integrating ADF with Azure Synapse Analytics.

ANS:-

• Use Linked Services in ADF to connect to Synapse

• Create pipelines to move or transform data

• You can run stored procedures, execute Spark notebooks, or use copy activity to load data • Monitor with ADF's Monitor tab

6. How do you handle schema evolution in Azure Data Lake?

ANS:-

Use formats like Delta Lake or Parquet that support schema evolution. In ingestion, tools like ADF or Databricks Auto Loader can be configured to merge schema changes (mergeSchema option).

7. Write a SQL query to find the nth highest salary in a table.

ANS:-

SELECT DISTINCT salary

FROM employees e1

WHERE N-1 = (

SELECT COUNT(DISTINCT salary)

FROM employees e2

WHERE e2.salary > e1.salary );

8. How do you implement CI/CD pipelines for deploying ADF and Databricks solutions?

ANS:-
• Use Azure DevOps/GitHub for source control

• Integrate ADF with Git repository

• Use ARM templates for ADF deployment

• Use Databricks Repos, notebook export/import, and databricks-cli

• Use release pipelines for deployment automation

9. Write PySpark code to calculate the total sales for each product category.

ANS:-
df.groupBy("category").agg(sum("sales").alias("total_sales")).show()

10. Explain how broadcast joins improve performance in PySpark.

ANS:-

Broadcast joins send a small dataset to all worker nodes, avoiding costly shuffles. Use broadcast() when one of the tables is small enough to fit in memory:

from pyspark.sql.functions import broadcast df.join(broadcast(small_df),

"id")

11. Describe the role of the driver and executor in Spark architecture.

ANS:-

• Driver: Coordinates the Spark application; maintains metadata and DAG • Executors: Run tasks on worker nodes, perform computations, and return results

12. How do you manage and monitor ADF pipeline performance?

ANS:-

• Use Monitor tab for activity runs, trigger runs

• Enable logging via Log Analytics

• Use Activity Duration and Output metrics

• Implement retry policies, alerts, and timeouts

13. Write a SQL query to find employees with salaries greater than the department average.

ANS:-

SELECT e.*

FROM employees e

JOIN (

SELECT department_id, AVG(salary) AS avg_salary

FROM employees

GROUP BY department_id

) d ON e.department_id = d.department_id

WHERE e.salary > d.avg_salary;

14. Explain the concept of Delta Lake and its advantages.

ANS:-

Delta Lake is an open-source storage layer that brings ACID transactions, time travel, schema evolution, and concurrent writes to data lakes using formats like Parquet.

15. How do you implement schema drift handling in ADF?

ANS:-

Enable “Auto Mapping” and check “Allow schema drift” in Copy Activity. Use dynamic column mapping when the schema can vary over time.

16. Write Python code to check if a number is a palindrome.

ANS:-

def is_palindrome(n):

return str(n) == str(n)[::-1]

print(is_palindrome(121)) # True

17. What is the significance of Z-ordering in Delta tables?

ANS:-

Z-ordering organizes data to improve query performance by clustering related data together. It reduces I/O during filtering and improves data skipping.

18. How do you handle incremental data load in Databricks?

ANS:-

• Use watermarking or timestamp columns

• Filter records where last_updated > last_processed

• Use merge (upsert) logic with Delta Lake

19. Explain Adaptive Query Execution (AQE) in Spark.

ANS:-

AQE dynamically optimizes query plans based on runtime stats, enabling:

• Dynamically switching join strategies

• Re-optimizing skewed partitions

• Coalescing shuffle partitions

20. How do you optimize data partitioning in ADLS?

ANS:-

• Partition by frequently queried columns (e.g., date, region)

• Avoid too many small files (“file size tuning”)

• Use tools like Azure Data Explorer, Databricks, or Partition Discovery

21. Describe the process of creating a data pipeline for real-time analytics.

ANS:-
• Use Event Hubs / IoT Hub for ingestion

• Use Stream Analytics or Structured Streaming (Databricks)

• Process and write to Delta Lake / Cosmos DB / Synapse

• Visualize using Power BI

22. Write PySpark code to perform a left join between two DataFrames.

ANS:-

df1.join(df2, on="id", how="left").show()

23. What are the security best practices for Azure Data Lake?

ANS:-

• Use RBAC + ACLs

• Enable Data Encryption (at rest and in transit)

• Use Managed Identities for secure access

• Monitor with Azure Defender and Log Analytics

24. Explain the use of Integration Runtime (IR) in ADF.

ANS:-

Integration Runtime (IR) is the compute infrastructure in ADF. It handles:

• Data movement

• Data transformation

• Supports cloud, self-hosted, and SSIS runtimes

25. How do you design a fault-tolerant architecture for big data processing?

ANS:-
• Use retry logic and checkpointing

• Design for idempotency

• Use Delta Lake for ACID compliance

• Implement monitoring, alerting, and disaster recovery strategies

📚 Chapters

Tiger Analytics Q & A

💬 Comments

Comments (0)