Deloitte Data Engineering Interview Prep – 20 Key Questions on Spark, ADF, Delta Lake & SQL

These are all the questions asked in 'Deloitte ' Data Engineering Interview. 2022

1. What is Z-ordering in Spark?

Answer:

Z-ordering is a technique used in Delta Lake (on Databricks) to optimize the layout of data on disk. It helps improve the performance of queries, especially when you're filtering on multiple columns. Imagine you're organizing your bookshelf so that you can find a book faster — that's what Z ordering does for your data.

It works by co-locating related information close together, so Spark doesn't have to scan the whole dataset. You typically use it during OPTIMIZE with ZORDER BY like this:

OPTIMIZE my_table ZORDER BY (customer_id, order_date)

2. Explain the difference between Spark SQL and PySpark DataFrame APIs.

Answer:

Both are used to work with structured data in Spark, but they offer different interfaces:

• Spark SQL: Lets you write SQL queries as strings. It’s helpful if you're familiar with SQL and want to run traditional queries.

• spark.sql("SELECT * FROM orders WHERE total > 1000")

• PySpark DataFrame API: Uses Python methods to manipulate data, which is more programmatic and integrates better with Python code.

• orders.filter(orders.total > 1000).show()

Under the hood, they both use the same execution engine — so performance-wise, they're similar.

3. How to implement incremental load in ADF?

Answer:

Incremental load means only loading new or changed data instead of reloading everything. You can implement it in ADF using:

• Watermark columns like LastModifiedDate

• Use the Lookup or Stored Procedure activity to get the last load time.

• In your source dataset, use a filter with dynamic content:

• SELECT * FROM source_table WHERE LastModifiedDate >

@pipeline().parameters.lastLoadTime

• Then update the watermark value after the load is done using a Stored Procedure or sink logic.

4. How do you handle large-scale data ingestion into ADLS?

Answer:

For massive data loads into ADLS, best practices include:

• Use parallelism in ADF — set up partitioning or multiple concurrent source queries. • Use compression formats like Parquet or Avro for better performance and cost. • Ingest data using ADF

Copy Activity, Dataflows, or even Azure Databricks for heavy lifting.

• For real-time ingestion, use Azure Event Hub or IoT Hub with Stream Analytics.

5. Write Python code to split a name column into firstname and lastname.

Answer:

Here’s a simple snippet using Python and Pandas:

import pandas as pd

df = pd.DataFrame({'name': ['John Smith', 'Alice Johnson']}) df[['first_name', 'last_name']] = df['name'].str.split(' ', 1, expand=True) print(df)

Output:

name first_name last_name 0

John Smith John Smith

1 Alice Johnson Alice Johnson

6. What are fact and dimension tables in data modeling?

Answer:

Think of fact and dimension tables as the heart of a data warehouse design:

• Fact Table:

Stores measurable data — like sales, revenue, quantity, etc. It usually has foreign keys pointing to dimension tables. Example: Sales_Fact with fields like product_id, store_id, 10 sales_amount.

• Dimension Table:

Stores descriptive data — like product name, store location, customer details. They help give context to the facts. Example: Product_Dim with product_id, product_name, category.

* Think of it like this: Fact = "What happened?" and Dimension = "Who/What/Where?"

7. How do you design and implement data pipelines using Azure Data Factory?

Answer:

Designing a data pipeline in ADF involves a few key steps:

1. Source Dataset – Define where your data is coming from (e.g., SQL, Blob, API).

2. Activities – Use Copy activity, Data Flow, Stored Proc, etc.

3. Transformations – Apply mapping, filtering, joins in Mapping Data Flows if needed.

4. Sink Dataset – Where your data lands (Azure SQL, ADLS, Synapse, etc.).

5. Triggers – To run the pipeline on schedule or event-based.

6. Monitoring – Track pipeline runs using the Monitor tab in ADF.

Bonus: Use parameters, variables, and metadata-driven pipelines to make it dynamic and reusable!

8. Explain the concept of PolyBase in Azure SQL Data Warehouse.

Answer:

PolyBase lets you run SQL queries on external data stored in files like CSV or Parquet on Azure Blob or ADLS, as if it were in a table.

Example use case: Query a huge CSV file stored in ADLS directly from Azure Synapse using T-SQL, without importing it first: SELECT * FROM ExternalTable

This is super useful when:

• Ingesting external data into Synapse

• Running ELT workloads

• Avoiding extra copy operations

9. Write a SQL query to calculate the cumulative sum of a column.

Answer:

Here’s a SQL example to calculate cumulative (running) total of salaries:

SELECT

employee_id,

salary,

SUM(salary) OVER (ORDER BY employee_id) AS cumulative_salary

FROM employees;

You can also partition it (e.g., by department):

SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id)

10. How do you manage partitioning in PySpark?

Answer:

Partitioning in PySpark helps Spark distribute data and parallelize processing efficiently. How to manage it:

• Check current partitions:

• df.rdd.getNumPartitions()

• Repartitioning (increases or balances partitions):

• df = df.repartition(8)

• Coalescing (reduces partitions, more efficient than repartition):

• df = df.coalesce(4)

Tip: Use repartition() when increasing partitions for parallelism, and coalesce() when writing to storage and reducing file count.

11. Explain the use of Delta Lake for data versioning.

Answer:

Delta Lake brings ACID transactions to data lakes. One of its coolest features? Time travel and versioning.

Every time you write data (append, overwrite, merge), Delta logs the change — allowing you to: |

• Query old versions like:

• SELECT * FROM table_name VERSION AS OF 5

• Or use a timestamp:

• SELECT * FROM table_name TIMESTAMP AS OF '2024-04-01T00:00:00'

Great for rollback, audits, debugging, and historical analysis.

12. How do you monitor and troubleshoot Spark jobs?

Answer:

You can monitor Spark jobs using:

• Spark UI

Accessed via Databricks or Spark History Server. It

shows: o Stages and tasks o Shuffle reads/writes o

Execution DAGs

• Cluster Metrics

Check CPU, memory usage, and executor health.

• Logs

Application logs can be viewed in the Spark UI or exported to Azure Log Analytics. Common issues to watch for:

• Skewed joins

• Out-of-memory errors

• Long GC times

Pro tip: Enable Adaptive Query Execution (AQE) for dynamic optimizations.

13. Write a SQL query to find employees with the highest salary in each department.

Answer:

Using window functions:

SELECT *

FROM (

SELECT *,

RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank FROM employees

) AS ranked

WHERE rank = 1;

This handles ties too (e.g., if two employees share the top salary in a department).

14. How do you optimize joins in PySpark for large datasets?

Answer:

You’ve got a few smart options:

1. Broadcast Join

Broadcast the smaller dataset:

2. from pyspark.sql.functions import broadcast

3. df.join(broadcast(small_df), "id")

4. Partitioning

Ensure both DataFrames are partitioned properly on join keys.

5. Skew Handling

Use salting techniques if one key has a lot more rows.

6. Join Type

Choose wisely — avoid cross joins unless necessary.

7. Caching

Cache reused datasets if memory allows.

Rule of thumb: Avoid shuffles where possible!

15. Describe the process of setting up CI/CD for Azure Data Factory.

Answer:

CI/CD in ADF typically uses Azure DevOps (or GitHub Actions):

1. Connect ADF to Git repo (Azure Repos or GitHub).

2. Develop in Git mode (not live).

3. Publish from collaboration branch (e.g., feature/dev) to ‘adf_publish’ branch.

4. CI Pipeline – Triggers on commit to ‘adf_publish’, creates ARM template artifacts.

5. CD Pipeline – Deploys those templates to QA/Prod using az deployment. Bonus: Use parameterized Linked Services and datasets for environment flexibility.

16. Write Python code to reverse a string.

Answer:

You can reverse a string in a single line using Python slicing:

text = "Hello Deloitte" reversed_text = text[::-1]

print(reversed_text) # Output: etioletoD olleH

Or, using a loop (for interviews that ask for it manually): def

reverse_string(s):

result = "" for char

in s: result = char +

result return result

17. What are the key features of Databricks notebooks?

Answer:

Databricks notebooks are like your coding cockpit — and they come packed with features:

• Multi-language support (%python, %sql, %scala, %bash)

• Rich visualizations: Line charts, bar charts, maps, etc.

• Job scheduling and parameterization

• Collaboration: Multiple users can edit the same notebook simultaneously • MLflow integration for tracking machine learning models

• Widgets: Create input boxes and dropdowns for dynamic notebook interaction

• Role-based access control (RBAC) for security

Super helpful for prototyping, debugging, and presenting work!

18. How do you handle late-arriving data in ADF?

Answer:

Late-arriving data = data that arrives after the scheduled pipeline run. Here’s how you can handle it in Azure Data Factory:

• Watermarking: Maintain a watermark (e.g., max modified date) to track what's already loaded.

• Reprocessing window: Load data for a range like last 7 days to catch any laggards.

• Trigger re-runs: Use tumbling window triggers with "retry" and "dependency" settings.

• Delta loads: Make sure your sink supports upserts (like Delta Lake or SQL with MERGE).

• Logging and alerts: Track and notify when data doesn’t arrive as expected.

19. Explain the concept of Data Lakehouse.

Answer:

The Data Lakehouse combines the flexibility of a Data Lake with the reliability of a Data Warehouse. Key traits:

• Uses open file formats (like Parquet, Delta)

• Supports ACID transactions (thanks to Delta Lake or Apache Iceberg)

• Enables BI and ML workloads on the same data

• Reduces data duplication and movement

• Supports schema enforcement and governance

In Databricks, Delta Lake powers the Lakehouse model — enabling fast, reliable analytics directly on raw data.

20. How do you implement disaster recovery for ADLS?

Answer:

Disaster recovery for Azure Data Lake Storage (Gen2) focuses on resilience and data availability:

1. Geo-Redundant Storage (GRS)

Stores copies of your data in a secondary region automatically.

2. Snapshots

Capture point-in-time versions of files/folders for recovery.

3. Versioning (for hierarchical namespaces)

Keeps historical versions of files in case of accidental deletion or overwrite.

4. Soft delete

Enables you to recover deleted blobs within a retention period.

5. Replication strategy

For critical workloads, use Azure Data Factory to mirror data across regions.

6. Automated backup via Azure Backup or third-party tools like Veeam or Rubrik. And of course, test your recovery plan regularly!

📚 Chapters

Deloitte interview Q & A

💬 Comments

Comments (0)