EY Data Engineering Interview Questions – 28 Must-Know Topics on PySpark, ADF, Azure & Databricks

Here is the questions asked in Interview 'Ernst & Young Global' Limited 2020

1. How to handle null values in PySpark (drop/fill)?

Answer:

In PySpark, you can handle nulls using dropna() and fillna() methods:

# Drop rows with null values df_cleaned

= df.dropna()

# Fill null values df_filled = df.fillna({'column1':

'default', 'column2': 0})

You can specify subsets of columns and threshold of non-null values as needed.

2. What is AQE (Adaptive Query Execution) in Databricks?

Answer:

AQE is a Spark optimization that adjusts query execution plans at runtime. It helps with: • Dynamically switching join strategies (e.g., shuffle to broadcast)

• Optimizing skewed joins

• Coalescing shuffle partitions

AQE improves performance for unpredictable data patterns.

3. How do you handle error handling in ADF using retry, try-catch blocks, and failover mechanisms?

Answer:

• Retry: Configure retry policies in activity settings (number of retries and intervals). • Try-Catch: Use If Condition, Switch, or Until activities with custom logic.

• Failover: Use global parameters or alternative execution paths to redirect processing when failure occurs.

Logging and alerts via Log Analytics or Azure Monitor are also key.

4. How to track file names in the output table while performing copy operations in ADF?

Answer:

Use the @dataset().path or @item().name expressions in a Copy Data activity's sink mapping. You can also use the Get Metadata activity to fetch file names beforehand and pass them through a pipeline variable into the sink (e.g., SQL column).

5. Write a SQL query to display the cumulative sum of a column.

Answer:

SELECT

employee_id,

department,

salary,

SUM(salary) OVER (PARTITION BY department ORDER BY employee_id) AS cumulative_salary FROM employees;

This gives a running total of salary per department.

6. Explain the role of Azure Key Vault in securing sensitive data.

Answer:

Azure Key Vault:
• Manages secrets (like passwords, keys, connection strings)

• Provides secure access through managed identities

• Ensures encryption at rest and in transit

• Integrates with ADF, Databricks, and other Azure services to eliminate hardcoding secrets in code

7. How do you manage and automate ETL workflows using Databricks Workflows?

Answer:

• Use Databricks Workflows to define job clusters, tasks, and dependencies. • Chain tasks using notebooks, Python scripts, or SQL commands.

• Schedule jobs using cron expressions or triggers.

• Monitor runs with built-in logging and alerts.

• Integrate with CI/CD pipelines via REST APIs or GitHub actions.

8. Describe the process of setting up disaster recovery for ADLS.

Answer:

• Geo-redundant storage (GRS) ensures data is replicated across regions. • Soft delete and versioning help recover from accidental deletes.

• Implement automated backups via Data Factory or third-party tools.

• Monitor with Azure Monitor and ensure access control via RBAC/ACLs.

9. Explain the difference between narrow and wide transformations in PySpark.

Answer:

• Narrow: Each partition depends on a single partition (e.g., map, filter)

• Wide: Involves shuffling data between partitions (e.g., groupByKey, join, reduceByKey) • Wide transformations are more expensive and often trigger shuffles.

10. How do you optimize PySpark jobs for large datasets?

Answer:

• Use cache/persist smartly

• Avoid shuffles when possible

• Use broadcast joins for small lookup tables

• Tune partition size and memory usage

• Leverage DataFrame API over RDD

• Enable AQE for dynamic optimizations

11. Write a SQL query to find the average salary for each department.

SELECT department_id, AVG(salary) AS avg_salary

FROM employees

GROUP BY department_id;

This will return the average salary grouped by each department.

12. What is the role of Delta Lake in modern data architectures?

Answer:

Delta Lake brings ACID transactions, schema enforcement, time travel, and unified batch & streaming processing to data lakes. It bridges the gap between data lakes and data warehouses, forming the foundation for the Lakehouse architecture.

Key benefits:

• Reliable data pipelines

• Easier data governance

• Simplified ETL and analytics

• Scalable with open format (Parquet-based)

13. How do you monitor and debug ADF pipelines?

Answer:

You can monitor and debug using:

• Monitor tab: Check activity run status, duration, errors

• Activity output logs: Inspect input/output/error messages

• Azure Monitor and Log Analytics: Track performance and trigger alerts
• Integration Runtime metrics: View resource utilization

• Enable diagnostic settings to log to storage, Event Hubs, or Log Analytics for centralized monitoring

14. Write PySpark code to perform an inner join between two DataFrames.

Answer:

df_joined = df1.join(df2, df1.id == df2.id, "inner") df_joined.show()

This joins df1 and df2 where the id columns match, keeping only matched rows.

15. How do you manage schema drift in ADF?

Answer:

Schema drift refers to changes in the source schema (like new columns). To handle it in ADF: • In Copy Data activity, enable "Auto Mapping" to allow dynamic schema mapping. • Use dataset parameters for dynamic datasets.

• Set "Allow schema drift" in the source/sink settings.

• Use data flows to dynamically map columns and use expressions like byName().

16. Describe the concept of fault tolerance in Spark.

Answer:

Spark achieves fault tolerance by:

• Using RDD lineage to recompute lost partitions

• Storing intermediate data in memory/disk

• Writing checkpoints for long lineage chains

• Using speculative execution to run slow tasks on alternate nodes

• Ensuring executors can recover from node failures

17. Write Python code to calculate the factorial of a number.

Answer:

def factorial(n): if

n == 0 or n == 1:

return 1

return n * factorial(n - 1)

print(factorial(5)) # Output:

120 Or iteratively: def

factorial(n):

result = 1 for i in

range(2, n+1):

result *= i

return result

18. How do you handle incremental data loads in ADLS?

Answer:

Approaches:

• Use watermarking or lastModifiedDate filters in source queries

• Implement delta loads with Copy Activity and a metadata table

• Track changes using Change Data Capture (CDC) or file naming conventions • Maintain audit columns (e.g., updated_at) for identifying new/changed records

19. What are the security features in Azure Synapse Analytics?

Answer:

Key features:

• Managed Identity for secure resource access

• Azure Active Directory (AAD) integration

• Role-Based Access Control (RBAC)

• Network isolation with Private Endpoints and Firewalls

• Data encryption (in transit and at rest)

• Integration with Azure Key Vault

• Auditing and threat detection via Defender for Cloud

20. Explain the concept of partitioning in PySpark.

Answer:

Partitioning in PySpark refers to how data is distributed across worker nodes. Why it's important:

• Impacts parallelism and performance

• Helps reduce shuffles

Use:

• repartition(n) → reshuffles data for equal-sized partitions

• coalesce(n) → reduces number of partitions (faster, no shuffle)

• Partitioning columns are critical for operations like joins or aggregations

21. How do you implement real-time data processing in Databricks?

Answer:

You can implement real-time processing using Structured Streaming in Databricks: • Read streaming data from sources like Kafka, Event Hubs, or Azure IoT Hub • Use structured streaming APIs in PySpark

• Write results to sinks like Delta Lake, Synapse, or ADLS

• Use trigger options for micro-batching or continuous processing

Example: df =

spark.readStream.format("kafka") \

.option("kafka.bootstrap.servers", "<broker>") \

.option("subscribe", "topic") \

.load()

processed = df.selectExpr("CAST(value AS STRING)")

query = processed.writeStream \

.format("delta") \

.option("checkpointLocation", "/checkpoints/") \

.start("/output/path/")

22. Write a SQL query to find duplicate records in a table.

Answer:

SELECT column1, column2, COUNT(*)

FROM table_name

GROUP BY column1, column2

HAVING COUNT(*) > 1;

This finds duplicates based on combinations of columns.

23. How do you integrate Azure Key Vault with ADF pipelines?

Answer:

To use Key Vault secrets in ADF:

1. Create a Linked Service for Key Vault.

2. Access secrets in linked services or parameters using @Microsoft.KeyVault(...). 3. Set Managed Identity for ADF and give it Key Vault access with "Get" secret permissions. 4. Example in JSON or expression:

5. "@Microsoft.KeyVault(SecretName='sqlPassword')"

24. What are the best practices for optimizing storage costs in ADLS?

Answer:

• Lifecycle policies to auto-delete or move old/unused data

• Use Hierarchical Namespace (HNS) for directory-level management

• Choose appropriate storage tiers (Hot, Cool, Archive)

• Compress files (e.g., Parquet, Avro) to reduce size

• Avoid storing small files; consolidate using compaction

• Monitor with Azure Cost Management

25. How do you implement CI/CD for Azure Synapse Analytics?

Answer:

Steps:

1. Source control: Integrate Synapse workspace with Git (Azure DevOps or GitHub). 2. Develop in Git mode: Code notebooks, pipelines, SQL scripts, etc.

3. Use ARM templates or Synapse deployment scripts for infrastructure as code. 4. Set up a CI/CD pipeline using Azure DevOps:

o Validate templates o Use tasks like "Azure Resource Group

Deployment" o Deploy notebooks and artifacts using Synapse REST

API or Synapse CLI

26. Explain the role of Integration Runtime in ADF.

Answer:

Integration Runtime (IR) is the compute infrastructure used by ADF for:

• Data movement (copy data between sources)

• Activity dispatching (e.g., data flow, stored proc)

• SSIS execution (for lifted SSIS packages)

Types:

• Azure IR: For cloud-native data movement

• Self-hosted IR: For on-prem or VNet-restricted sources

• Azure-SSIS IR: For running SSIS packages in ADF

27. How do you secure sensitive data in Azure?

Answer:

• Use Azure Key Vault for managing secrets, keys, and certificates

• Apply encryption at rest (default) and in transit (TLS/HTTPS)

• Leverage RBAC and AAD integration for access control • Configure Private Endpoints to isolate traffic

• Use network security groups, firewalls, and VNet rules

• Enable auditing, logging, and threat detection

28. Describe the process of creating a data pipeline for real-time analytics.

Answer:

Typical steps:

1. Ingest data using Kafka/Event Hubs/IoT Hub into Databricks

2. Process it using Structured Streaming with low-latency logic

3. Write to a Delta Lake table (streaming sink)

4. Query the table with Power BI or Synapse Serverless SQL

5. Use monitoring and alerting for latency and freshness

6. Ensure checkpointing, idempotency, and scalability

📚 Chapters

EY Data Engineering Interview Q & A 2022

Here is the questions asked in Interview 'Ernst & Young Global' Limited 2020

💬 Comments

Comments (0)