📚 Chapters
Infosys data engineering interview Q & A 2022
✍️ By MONU SINGH | 11/18/2025
These are all the questions asked in ' Infosys' Data Engineering Interview. 2022
1. What is
the difference between a job cluster and an interactive cluster in Databricks?
Answer:
• Job
Cluster: Created temporarily for the duration of a job and terminated
afterward. It is ideal for production workloads where resources are not needed
all the time.
• Interactive
Cluster: Stays alive until manually terminated. Used for development and
exploration where you need repeated access for testing and debugging.
2. How to
copy all tables from one source to the target using metadata-driven pipelines
in ADF?
Answer:
Use a metadata-driven approach
• Store metadata (source and destination table names, schema,
etc.) in a control table or config file.
• Loop through
the metadata using a ForEach activity in ADF.
• Use Lookup
to fetch metadata, then a Copy Activity to perform the actual data
movement using parameters from the metadata.
• Dynamic
datasets and parameterized linked services enable flexibility for source and
sink.
3. How do
you implement data encryption in Azure SQL Database?
Answer:
• At rest:
Azure SQL provides Transparent Data Encryption (TDE) by default to
protect data stored on disk.
• In
transit: Uses TLS (Transport Layer Security) for securing data being
transferred over the network.
• You can
manage keys using Azure Key Vault to rotate and manage encryption keys
securely.
4. Write
Python code to generate Fibonacci numbers.
Answer:
def
generate_fibonacci(n):
fib = [0, 1]
for i in range(2,
n): fib.append(fib[i-1]
+
fib[i-2])
return fib[:n]
print(generate_fibonacci(10))
# Output: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
5. What are
the best practices for managing and optimizing storage costs in ADLS?
Answer:
• Use lifecycle
policies to automatically move infrequently used data to cool/archive
tiers. • Partition data efficiently to reduce scanning costs.
• Store compressed
formats like Parquet or Avro.
• Delete
obsolete or temporary files regularly.
• Monitor and
audit usage to detect cost anomalies early.
6. How do
you implement security measures for data in transit and at rest in Azure?
Answer:
At Rest:
• Use Azure
Storage encryption (enabled by default) with Microsoft or customer-managed
keys.
• For SQL
databases, use Transparent Data Encryption (TDE).
• Store
secrets (like connection strings, passwords) in Azure Key Vault.
In Transit:
• Always
enable HTTPS for secure communication.
• Use Private
Endpoints or VPN/ExpressRoute for secure access to Azure services
• For services like ADF, enable managed identity and
secure connections between services using role-based access control (RBAC).
7. Describe
the role of triggers and schedules in Azure Data Factory.
Answer:
• Triggers are used to start pipelines based on
events or schedules.
o Schedule
Trigger: Executes pipelines on a time-based schedule (e.g., every night at
1 AM).
o Event
Trigger: Starts pipeline when a file arrives in blob storage.
o Manual
Trigger: Triggered on-demand or from a REST API.
• Triggers
help automate data pipelines, reducing the need for manual intervention.
8. How do
you optimize data storage and retrieval in Azure Data Lake Storage?
Answer:
• Store data in columnar formats like Parquet or
Delta for efficient querying.
• Partition
data by frequently queried columns (e.g., date, region) to speed up read
operations.
• Use Z-ordering
(in Delta Lake) for better data skipping.
• Enable hierarchical
namespace to organize data logically.
• Compress
data to reduce storage size and I/O costs.
9. Write a
SQL query to find employees with no manager assigned.
Answer:
Assuming the
manager_id column refers to the employee’s manager:
SELECT *
FROM employees
WHERE
manager_id IS NULL;
This query
pulls all employees who don't have a manager—possibly the top-level execs!
10. How do
you implement data deduplication in PySpark?
Answer:
You can use
the dropDuplicates() method:
from
pyspark.sql import SparkSession
spark =
SparkSession.builder.appName("Deduplication").getOrCreate()
df =
spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Drop
duplicate rows based on all columns dedup_df
=
df.dropDuplicates()
# Or drop
duplicates based on specific columns dedup_df
=
df.dropDuplicates(["employee_id"])
dedup_df.show()
This helps you
eliminate repeated records efficiently before further transformations.
11. Explain the concept of Delta Lake compaction.
Answer:
Delta Lake
compaction is the
process of optimizing the number and size of small files in your Delta
10 table.
• In streaming
or frequent batch writes, Delta tables can accumulate lots of small files,
hurting performance.
• Compaction
helps by combining many small files into larger ones, improving:
o Read
performance o
Query
planning efficiency
o Data
skipping
How to
perform compaction:
from
delta.tables import DeltaTable
deltaTable =
DeltaTable.forPath(spark, "/mnt/delta/sales")
deltaTable.optimize().executeCompaction()
Or using SQL
in Databricks:
OPTIMIZE
delta.`/mnt/delta/sales`
12. How do
you monitor ADF pipeline performance?
Answer:
You can
monitor performance using:
• Monitor
tab in ADF Studio: Check pipeline, activity run status, execution time, and
errors.
• Azure
Monitor + Log Analytics: Enable diagnostic logging to push logs to a Log
Analytics workspace.
• Alerts:
Set up alerts based on failure, duration, or custom metrics.
• Activity
run metrics: Track time taken for data movement, transformation, or
lookups. This helps you identify bottlenecks, long-running activities, and
troubleshoot issues faster.
13. Write a
SQL query to find the second-highest salary in a table.
Answer:
SELECT
MAX(salary) AS SecondHighestSalary
FROM employees
WHERE salary
< (
SELECT
MAX(salary) FROM employees
);
This works by
getting the max salary that is less than the highest one.
Alternate
(using DENSE_RANK)
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as
rank FROM employees
) ranked
WHERE rank = 2;
14. How do
you implement incremental load in Databricks?
Answer:
Incremental
load means only processing new or updated records. Common approaches:
1. Using
Watermark (timestamp column):
2. df =
spark.read.format("delta").load("/mnt/delta/source")
3.
new_data = df.filter("last_updated > '2024-01-01 00:00:00'")
4. Using
merge for upserts:
5. from
delta.
6.tables import DeltaTable
7. delta_table
= DeltaTable.
8.forPath(spark, "/mnt/delta/target")
9. delta_table.alias("tgt").merge(
10.
new_data.alias("src"),
11.
"tgt.id = src.id"
12.
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute() This keeps your
target table in sync without reprocessing all records.
15.
Describe the role of Azure Key Vault in securing sensitive data.
Answer:
Azure Key
Vault is a centralized cloud service to manage secrets, keys, and
certificates. In data engineering, it's used to:
• Store
connection strings, passwords, API keys, etc.
• Integrate
with ADF, Databricks, Synapse securely (via linked services).
• Support access
control via Azure RBAC and Managed Identity.
• Enable audit
logs for secret access.
Using Key
Vault removes hardcoded secrets and improves overall security and compliance.
16. Write
Python code to sort a list of dictionaries by a key.
Answer:
Let’s say you
have a list of dictionaries with employee info, and you want to sort them by
salary: employees = [
{'name':
'Alice', 'salary': 70000},
{'name': 'Bob', 'salary': 50000},
{'name': 'Charlie', 'salary': 60000}
# Sort by salary in ascending order sorted_employees =
sorted(employees, key=lambda x: x['salary'])
for emp in sorted_employees:
print(emp)
To sort in descending order, just add reverse=True.
17. How do
you handle schema evolution in ADF?
Answer:
Schema
evolution in ADF refers to handling changes in source data structure like added
columns, changed datatypes, etc.
Ways to
handle it:
1. Auto
Mapping in Copy Data Activity:
. Use “Auto
Mapping” in the mapping tab.
. It adapts to
new columns if the source changes.
2. Flexible
Schema Handling in Data Flows:
. Use “Allow
schema drift” in data flows.
. This lets
ADF handle columns not explicitly defined in your transformation. 3. Parameterization
& Metadata-driven Pipelines:
. Dynamically
read schema info using metadata (from SQL or config files). o Loop through
columns and build dynamic transformations.
4. Using
select * cautiously:
. Helps in
quickly adopting new columns but can break downstream if not handled with care.
18. Explain
the concept of shuffling in Spark.
Answer:
Shuffling is the process where Spark
redistributes data across partitions, typically triggered by operations like:
• groupByKey()
• join()
•
repartition()
• distinct()
It’s expensive
because
• It involves disk I/O, network transfer, and serialization.
• It can lead
to performance bottlenecks and OOM errors if not managed
properly. Optimization Tips:
• Prefer
reduceByKey() over groupByKey().
• Use broadcast
joins when joining a small dataset with a large one.
• Minimize
wide transformations.
19. How do
you manage metadata in Azure Data Lake?
Answer:
Managing
metadata in ADLS ensures discoverability, governance, and data
lineage. Ways to manage metadata:
1. Azure
Purview (Microsoft Purview): o Automatically scans ADLS, builds data
catalog, tracks schema, and provides lineage.
2. Directory
Naming Conventions:
. Use folder
structures to store partitioned metadata like /year=2024/month=04/. 10
3. Metadata
Tables:
. Store
metadata (file paths, schema info, modified timestamps) in SQL/Delta tables. 4.
Schema registry (e.g., Azure Schema Registry or Confluent if Kafka
used): o Maintains schema versions for streaming data.
5. Tagging
and Classification:
. Classify
sensitive data (e.g., PII, financial) and add custom tags.
20. What
are the key considerations for designing scalable pipelines in ADF?
Answer:
To build
scalable and efficient pipelines in Azure Data Factory:
Parameterization
• Use
parameters in datasets and linked services to reuse pipelines.
Use of
Mapping Data Flows
• Design
transformations within ADF’s scalable Spark-based data flow engine. Optimize
parallelism
• Enable "Degree
of copy parallelism".
• Use ForEach
activities with batch count.
Fault
Tolerance
• Add Retry
policies, Timeouts, and Failure paths (via If Condition or
Until). Scalable Linked Services
• Use Azure
Integration Runtime for cloud-to-cloud, and Self-hosted IR for
on-prem connectivity.
Incremental
Loads
• Load only
the delta (new/changed records) to reduce load and speed up pipelines. Monitoring
and Alerting
• Integrate with Azure Monitor for performance and failure
alerts.
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!