📚 Chapters
EY Data Engineering Interview Q & A 2022
✍️ By MONU SINGH | 11/18/2025
Here is the questions asked in Interview 'Ernst & Young Global' Limited 2020
1. How to handle null values in PySpark (drop/fill)?
Answer:
In PySpark,
you can handle nulls using dropna() and fillna() methods:
# Drop rows
with null values df_cleaned
= df.dropna()
# Fill null
values df_filled = df.fillna({'column1':
'default',
'column2': 0})
You can
specify subsets of columns and threshold of non-null values as needed.
2. What is
AQE (Adaptive Query Execution) in Databricks?
Answer:
AQE is a Spark
optimization that adjusts query execution plans at runtime. It helps with: •
Dynamically switching join strategies (e.g., shuffle to broadcast)
• Optimizing
skewed joins
• Coalescing
shuffle partitions
AQE improves
performance for unpredictable data patterns.
3. How do
you handle error handling in ADF using retry, try-catch blocks, and failover
mechanisms?
Answer:
• Retry:
Configure retry policies in activity settings (number of retries and
intervals). • Try-Catch: Use If Condition, Switch, or Until
activities with custom logic.
• Failover:
Use global parameters or alternative execution paths to redirect processing
when failure occurs.
Logging and
alerts via Log Analytics or Azure Monitor are also key.
4. How to
track file names in the output table while performing copy operations in ADF?
Answer:
Use the
@dataset().path or @item().name expressions in a Copy Data activity's sink
mapping. You can also use the Get Metadata activity to fetch file names
beforehand and pass them through a pipeline variable into the sink (e.g., SQL
column).
5. Write a
SQL query to display the cumulative sum of a column.
Answer:
SELECT
employee_id,
department,
salary,
SUM(salary) OVER (PARTITION BY
department ORDER BY employee_id) AS cumulative_salary FROM employees;
This gives a running total of salary per department.
6. Explain
the role of Azure Key Vault in securing sensitive data.
Answer:
Azure Key
Vault:
• Manages secrets (like passwords, keys, connection strings)
• Provides
secure access through managed identities
• Ensures
encryption at rest and in transit
• Integrates
with ADF, Databricks, and other Azure services to eliminate hardcoding secrets
in code
7. How do
you manage and automate ETL workflows using Databricks Workflows?
Answer:
• Use Databricks Workflows to define job clusters,
tasks, and dependencies. • Chain tasks using notebooks, Python scripts, or SQL
commands.
• Schedule
jobs using cron expressions or triggers.
• Monitor runs
with built-in logging and alerts.
• Integrate
with CI/CD pipelines via REST APIs or GitHub actions.
8. Describe
the process of setting up disaster recovery for ADLS.
Answer:
• Geo-redundant
storage (GRS) ensures data is replicated across regions. • Soft delete and
versioning help recover from accidental deletes.
• Implement
automated backups via Data Factory or third-party tools.
• Monitor with
Azure Monitor and ensure access control via RBAC/ACLs.
9. Explain
the difference between narrow and wide transformations in PySpark.
Answer:
• Narrow:
Each partition depends on a single partition (e.g., map, filter)
• Wide:
Involves shuffling data between partitions (e.g., groupByKey, join,
reduceByKey) • Wide transformations are more expensive and often trigger
shuffles.
10. How do
you optimize PySpark jobs for large datasets?
Answer:
• Use cache/persist
smartly
• Avoid shuffles
when possible
• Use broadcast
joins for small lookup tables
• Tune partition
size and memory usage
• Leverage DataFrame
API over RDD
• Enable AQE
for dynamic optimizations
11. Write a
SQL query to find the average salary for each department.
SELECT
department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id;
This will return the average salary grouped by each
department.
12. What is
the role of Delta Lake in modern data architectures?
Answer:
Delta Lake
brings ACID transactions, schema enforcement, time travel,
and unified batch & streaming processing to data lakes. It bridges
the gap between data lakes and data warehouses, forming the foundation for the Lakehouse
architecture.
Key benefits:
• Reliable
data pipelines
• Easier data
governance
• Simplified
ETL and analytics
• Scalable
with open format (Parquet-based)
13. How do
you monitor and debug ADF pipelines?
Answer:
You can
monitor and debug using:
• Monitor
tab: Check activity run status, duration, errors
• Activity
output logs: Inspect input/output/error messages
• Azure
Monitor and Log Analytics: Track performance and trigger alerts
• Integration Runtime metrics: View resource utilization
• Enable diagnostic
settings to log to storage, Event Hubs, or Log Analytics for centralized
monitoring
14. Write
PySpark code to perform an inner join between two DataFrames.
Answer:
df_joined = df1.join(df2, df1.id == df2.id,
"inner") df_joined.show()
This joins df1
and df2 where the id columns match, keeping only matched rows.
15. How do
you manage schema drift in ADF?
Answer:
Schema drift
refers to changes in the source schema (like new columns). To handle it in ADF:
• In Copy Data activity, enable "Auto Mapping" to allow
dynamic schema mapping. • Use dataset parameters for dynamic datasets.
• Set "Allow
schema drift" in the source/sink settings.
• Use data
flows to dynamically map columns and use expressions like byName().
16.
Describe the concept of fault tolerance in Spark.
Answer:
Spark achieves
fault tolerance by:
• Using RDD
lineage to recompute lost partitions
• Storing
intermediate data in memory/disk
• Writing checkpoints for long lineage chains
• Using speculative
execution to run slow tasks on alternate nodes
• Ensuring
executors can recover from node failures
17. Write
Python code to calculate the factorial of a number.
Answer:
def
factorial(n): if
n == 0 or n ==
1:
return 1
return n *
factorial(n - 1)
print(factorial(5))
# Output:
120 Or
iteratively: def
factorial(n):
result = 1 for
i in
range(2, n+1):
result *= i
return result
18. How do
you handle incremental data loads in ADLS?
Answer:
Approaches:
• Use watermarking
or lastModifiedDate filters in source queries
• Implement delta
loads with Copy Activity and a metadata table
• Track
changes using Change Data Capture (CDC) or file naming conventions •
Maintain audit columns (e.g., updated_at) for identifying new/changed
records
19. What
are the security features in Azure Synapse Analytics?
Answer:
Key features:
• Managed
Identity for secure resource access
• Azure
Active Directory (AAD) integration
• Role-Based
Access Control (RBAC)
• Network
isolation with Private Endpoints and Firewalls
• Data
encryption (in transit and at rest)
• Integration
with Azure Key Vault
• Auditing and
threat detection via Defender for Cloud
20. Explain the concept of partitioning in PySpark.
Answer:
Partitioning
in PySpark refers to how data is distributed across worker nodes. Why it's
important:
• Impacts
parallelism and performance
• Helps reduce
shuffles
Use:
•
repartition(n) → reshuffles data for equal-sized partitions
• coalesce(n)
→ reduces number of partitions (faster, no shuffle)
• Partitioning
columns are critical for operations like joins or aggregations
21. How do
you implement real-time data processing in Databricks?
Answer:
You can
implement real-time processing using Structured Streaming in Databricks:
• Read streaming data from sources like Kafka, Event Hubs, or Azure
IoT Hub • Use structured streaming APIs in PySpark
• Write
results to sinks like Delta Lake, Synapse, or ADLS
• Use trigger
options for micro-batching or continuous processing
Example: df =
spark.readStream.format("kafka")
\
.option("kafka.bootstrap.servers",
"<broker>") \
.option("subscribe",
"topic") \
.load()
processed =
df.selectExpr("CAST(value AS STRING)")
query =
processed.writeStream \
.format("delta")
\
.option("checkpointLocation",
"/checkpoints/") \
.start("/output/path/")
22. Write a
SQL query to find duplicate records in a table.
Answer:
SELECT
column1, column2, COUNT(*)
FROM
table_name
GROUP BY
column1, column2
HAVING
COUNT(*) > 1;
This finds
duplicates based on combinations of columns.
23. How do you integrate Azure Key Vault with ADF
pipelines?
Answer:
To use Key
Vault secrets in ADF:
1. Create a
Linked Service for Key Vault.
2. Access
secrets in linked services or parameters using @Microsoft.KeyVault(...). 3.
Set Managed Identity for ADF and give it Key Vault access with "Get"
secret permissions. 4. Example in JSON or expression:
5.
"@Microsoft.KeyVault(SecretName='sqlPassword')"
24. What
are the best practices for optimizing storage costs in ADLS?
Answer:
• Lifecycle
policies to auto-delete or move old/unused data
• Use Hierarchical
Namespace (HNS) for directory-level management
• Choose
appropriate storage tiers (Hot, Cool, Archive)
• Compress
files (e.g., Parquet, Avro) to reduce size
• Avoid
storing small files; consolidate using compaction
• Monitor with
Azure Cost Management
25. How do
you implement CI/CD for Azure Synapse Analytics?
Answer:
Steps:
1. Source
control: Integrate Synapse workspace with Git (Azure DevOps or GitHub). 2. Develop
in Git mode: Code notebooks, pipelines, SQL scripts, etc.
3. Use ARM
templates or Synapse deployment scripts for infrastructure as code.
4. Set up a CI/CD pipeline using Azure DevOps:
o Validate
templates o Use tasks like "Azure Resource Group
Deployment"
o Deploy notebooks and artifacts using Synapse REST
API or Synapse
CLI
26. Explain
the role of Integration Runtime in ADF.
Answer:
Integration
Runtime (IR) is the
compute infrastructure used by ADF for:
• Data
movement (copy data between sources)
• Activity
dispatching (e.g., data flow, stored proc)
• SSIS
execution (for lifted SSIS packages)
Types:
• Azure IR:
For cloud-native data movement
• Self-hosted
IR: For on-prem or VNet-restricted sources
• Azure-SSIS IR: For running SSIS packages in ADF
27. How do
you secure sensitive data in Azure?
Answer:
• Use Azure
Key Vault for managing secrets, keys, and certificates
• Apply encryption
at rest (default) and in transit (TLS/HTTPS)
• Leverage RBAC
and AAD integration for access control • Configure Private
Endpoints to isolate traffic
• Use network
security groups, firewalls, and VNet rules
• Enable auditing,
logging, and threat detection
28.
Describe the process of creating a data pipeline for real-time analytics.
Answer:
Typical steps:
1. Ingest data
using Kafka/Event Hubs/IoT Hub into Databricks
2. Process it
using Structured Streaming with low-latency logic
3. Write to a Delta
Lake table (streaming sink)
4. Query the
table with Power BI or Synapse Serverless SQL
5. Use monitoring
and alerting for latency and freshness
6. Ensure checkpointing,
idempotency, and scalability
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!