Infosys Data Engineering Interview Questions 2022 – Spark, ADF, Azure & Python Explained

These are all the questions asked in ' Infosys' Data Engineering Interview. 2022

1. What is the difference between a job cluster and an interactive cluster in Databricks?

Answer:

• Job Cluster: Created temporarily for the duration of a job and terminated afterward. It is ideal for production workloads where resources are not needed all the time.

• Interactive Cluster: Stays alive until manually terminated. Used for development and exploration where you need repeated access for testing and debugging.

2. How to copy all tables from one source to the target using metadata-driven pipelines in ADF?

Answer:

Use a metadata-driven approach

• Store metadata (source and destination table names, schema, etc.) in a control table or config file.

• Loop through the metadata using a ForEach activity in ADF.

• Use Lookup to fetch metadata, then a Copy Activity to perform the actual data movement using parameters from the metadata.

• Dynamic datasets and parameterized linked services enable flexibility for source and sink.

3. How do you implement data encryption in Azure SQL Database?

Answer:

• At rest: Azure SQL provides Transparent Data Encryption (TDE) by default to protect data stored on disk.

• In transit: Uses TLS (Transport Layer Security) for securing data being transferred over the network.

• You can manage keys using Azure Key Vault to rotate and manage encryption keys securely.

4. Write Python code to generate Fibonacci numbers.

Answer:

def generate_fibonacci(n):

fib = [0, 1] for i in range(2,

n): fib.append(fib[i-1] +

fib[i-2]) return fib[:n]

print(generate_fibonacci(10)) # Output: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

5. What are the best practices for managing and optimizing storage costs in ADLS?

Answer:

• Use lifecycle policies to automatically move infrequently used data to cool/archive tiers. • Partition data efficiently to reduce scanning costs.

• Store compressed formats like Parquet or Avro.

• Delete obsolete or temporary files regularly.

• Monitor and audit usage to detect cost anomalies early.

6. How do you implement security measures for data in transit and at rest in Azure?

Answer:

At Rest:

• Use Azure Storage encryption (enabled by default) with Microsoft or customer-managed keys.

• For SQL databases, use Transparent Data Encryption (TDE).

• Store secrets (like connection strings, passwords) in Azure Key Vault.

In Transit:

• Always enable HTTPS for secure communication.

• Use Private Endpoints or VPN/ExpressRoute for secure access to Azure services

• For services like ADF, enable managed identity and secure connections between services using role-based access control (RBAC).

7. Describe the role of triggers and schedules in Azure Data Factory.

Answer:

• Triggers are used to start pipelines based on events or schedules.

o Schedule Trigger: Executes pipelines on a time-based schedule (e.g., every night at 1 AM).

o Event Trigger: Starts pipeline when a file arrives in blob storage.

o Manual Trigger: Triggered on-demand or from a REST API.

• Triggers help automate data pipelines, reducing the need for manual intervention.

8. How do you optimize data storage and retrieval in Azure Data Lake Storage?

Answer:

• Store data in columnar formats like Parquet or Delta for efficient querying.

• Partition data by frequently queried columns (e.g., date, region) to speed up read operations.

• Use Z-ordering (in Delta Lake) for better data skipping.

• Enable hierarchical namespace to organize data logically.

• Compress data to reduce storage size and I/O costs.

9. Write a SQL query to find employees with no manager assigned.

Answer:

Assuming the manager_id column refers to the employee’s manager:

SELECT *

FROM employees

WHERE manager_id IS NULL;

This query pulls all employees who don't have a manager—possibly the top-level execs!

10. How do you implement data deduplication in PySpark?

Answer:

You can use the dropDuplicates() method:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Deduplication").getOrCreate()

df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Drop duplicate rows based on all columns dedup_df

= df.dropDuplicates()

# Or drop duplicates based on specific columns dedup_df

= df.dropDuplicates(["employee_id"])

dedup_df.show()

This helps you eliminate repeated records efficiently before further transformations.

11. Explain the concept of Delta Lake compaction.

Answer:

Delta Lake compaction is the process of optimizing the number and size of small files in your Delta 10 table.

• In streaming or frequent batch writes, Delta tables can accumulate lots of small files, hurting performance.

• Compaction helps by combining many small files into larger ones, improving:

o Read performance o

Query planning efficiency

o Data skipping

How to perform compaction:

from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/delta/sales")

deltaTable.optimize().executeCompaction()

Or using SQL in Databricks:

OPTIMIZE delta.`/mnt/delta/sales`

12. How do you monitor ADF pipeline performance?

Answer:

You can monitor performance using:

• Monitor tab in ADF Studio: Check pipeline, activity run status, execution time, and errors.

• Azure Monitor + Log Analytics: Enable diagnostic logging to push logs to a Log Analytics workspace.

• Alerts: Set up alerts based on failure, duration, or custom metrics.

• Activity run metrics: Track time taken for data movement, transformation, or lookups. This helps you identify bottlenecks, long-running activities, and troubleshoot issues faster.

13. Write a SQL query to find the second-highest salary in a table.

Answer:

SELECT MAX(salary) AS SecondHighestSalary

FROM employees

WHERE salary < (

SELECT MAX(salary) FROM employees

);

This works by getting the max salary that is less than the highest one.

Alternate (using DENSE_RANK)

SELECT salary

FROM (

SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rank FROM employees

) ranked

WHERE rank = 2;

14. How do you implement incremental load in Databricks?

Answer:

Incremental load means only processing new or updated records. Common approaches:

1. Using Watermark (timestamp column):

2. df = spark.read.format("delta").load("/mnt/delta/source")
3. new_data = df.filter("last_updated > '2024-01-01 00:00:00'")

4. Using merge for upserts:

5. from delta.
6.tables import DeltaTable

7. delta_table = DeltaTable.
8.forPath(spark, "/mnt/delta/target")

9. delta_table.alias("tgt").merge(

10. new_data.alias("src"),

11. "tgt.id = src.id"

12. ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute() This keeps your target table in sync without reprocessing all records.

15. Describe the role of Azure Key Vault in securing sensitive data.

Answer:

Azure Key Vault is a centralized cloud service to manage secrets, keys, and certificates. In data engineering, it's used to:

• Store connection strings, passwords, API keys, etc.

• Integrate with ADF, Databricks, Synapse securely (via linked services).

• Support access control via Azure RBAC and Managed Identity.

• Enable audit logs for secret access.

Using Key Vault removes hardcoded secrets and improves overall security and compliance.

16. Write Python code to sort a list of dictionaries by a key.

Answer:

Let’s say you have a list of dictionaries with employee info, and you want to sort them by salary: employees = [

{'name': 'Alice', 'salary': 70000},

{'name': 'Bob', 'salary': 50000},

{'name': 'Charlie', 'salary': 60000}]

# Sort by salary in ascending order sorted_employees =

sorted(employees, key=lambda x: x['salary'])

for emp in sorted_employees:

print(emp)

To sort in descending order, just add reverse=True.

17. How do you handle schema evolution in ADF?

Answer:

Schema evolution in ADF refers to handling changes in source data structure like added columns, changed datatypes, etc.

Ways to handle it:

1. Auto Mapping in Copy Data Activity:

. Use “Auto Mapping” in the mapping tab.

. It adapts to new columns if the source changes.

2. Flexible Schema Handling in Data Flows:

. Use “Allow schema drift” in data flows.

. This lets ADF handle columns not explicitly defined in your transformation. 3. Parameterization & Metadata-driven Pipelines:

. Dynamically read schema info using metadata (from SQL or config files). o Loop through columns and build dynamic transformations.

4. Using select * cautiously:

. Helps in quickly adopting new columns but can break downstream if not handled with care.

18. Explain the concept of shuffling in Spark.

Answer:

Shuffling is the process where Spark redistributes data across partitions, typically triggered by operations like:

• groupByKey()

• join()

• repartition()

• distinct()

It’s expensive because

• It involves disk I/O, network transfer, and serialization.

• It can lead to performance bottlenecks and OOM errors if not managed properly. Optimization Tips:

• Prefer reduceByKey() over groupByKey().

• Use broadcast joins when joining a small dataset with a large one.

• Minimize wide transformations.

19. How do you manage metadata in Azure Data Lake?

Answer:

Managing metadata in ADLS ensures discoverability, governance, and data lineage. Ways to manage metadata:

1. Azure Purview (Microsoft Purview): o Automatically scans ADLS, builds data catalog, tracks schema, and provides lineage.

2. Directory Naming Conventions:

. Use folder structures to store partitioned metadata like /year=2024/month=04/. 10

3. Metadata Tables:

. Store metadata (file paths, schema info, modified timestamps) in SQL/Delta tables. 4. Schema registry (e.g., Azure Schema Registry or Confluent if Kafka used): o Maintains schema versions for streaming data.

5. Tagging and Classification:

. Classify sensitive data (e.g., PII, financial) and add custom tags.

20. What are the key considerations for designing scalable pipelines in ADF?

Answer:

To build scalable and efficient pipelines in Azure Data Factory:

Parameterization

• Use parameters in datasets and linked services to reuse pipelines.

Use of Mapping Data Flows

• Design transformations within ADF’s scalable Spark-based data flow engine. Optimize parallelism

• Enable "Degree of copy parallelism".

• Use ForEach activities with batch count.

Fault Tolerance

• Add Retry policies, Timeouts, and Failure paths (via If Condition or Until). Scalable Linked Services

• Use Azure Integration Runtime for cloud-to-cloud, and Self-hosted IR for on-prem connectivity.

Incremental Loads

• Load only the delta (new/changed records) to reduce load and speed up pipelines. Monitoring and Alerting

• Integrate with Azure Monitor for performance and failure alerts.

📚 Chapters

Infosys data engineering interview Q & A 2022

These are all the questions asked in ' Infosys' Data Engineering Interview. 2022

💬 Comments

Comments (0)

📚 Chapters

Infosys data engineering interview Q & A 2022

These are all the questions asked in ' Infosys' Data Engineering Interview. 2022﻿

💬 Comments

Comments (0)

These are all the questions asked in ' Infosys' Data Engineering Interview. 2022