Loading ...

📚 Chapters

KPMG Data Engineering Interview Q & A

✍️ By MONU SINGH | 11/18/2025

 Here is the 8 important Questions and answers asked in KPMG Data Engineering Interview 2019


1. How to create and deploy notebooks in Databricks?

 

Answer:

Creating a notebook:

1. Log in to Azure Databricks.

2. Click on Workspace > Users > your email.

3. Click Create > Notebook.

4. Name the notebook, choose a default language (Python, SQL, Scala, etc.), and attach a cluster.

Deploying notebooks:

Manual execution: Run the notebook interactively.

Scheduled job: Convert the notebook into a job and set a schedule.

CI/CD deployment: Use Git (e.g., Azure DevOps) to store notebooks and deploy using Databricks CLI or REST API in pipelines.

 

 

 

2. What are the best practices for data archiving and retention in Azure?

 

Answer:

Use Azure Data Lake/Blob for long-term storage.

• Apply Lifecycle Management Policies:

o Move old data to cool/archive tiers based on age.

o Automatically delete data after retention period ends.

Tag data with metadata for classification (e.g., creation date).

Encrypt and secure archived data (use CMK if needed).

Monitor access and compliance using Azure Monitor/Azure Purview.

Document and automate retention policies in your data governance strategy.

 

 

 

3. How do you connect ADLS (Azure Data Lake Storage) to Databricks?


Answer:

You can connect using OAuth (Service Principal) or Access Key:

Mounting ADLS Gen2 to Databricks

 configs = {

"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type":

"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "<client-id>",

"fs.azure.account.oauth2.client.secret": "<client-secret>",

"fs.azure.account.oauth2.client.endpoint":

"https://login.microsoftonline.com/<tenantid>/oauth2/token"}

dbutils.fs.mount( source = "abfss://<container name>@<storageaccount>.dfs.core.windows.net/", mount_point = "/mnt/mydata", extra_configs = configs)

 

 

 

4. Write a SQL query to list all employees who joined in the last 6 months.

 

Answer:

SELECT *

FROM employees

WHERE join_date >= DATEADD(MONTH, -6, GETDATE());

Adjust function names (GETDATE() or CURRENT_DATE) depending on SQL dialect (SQL Server, MySQL, etc.)

 

 

 

5. How do you implement data validation and quality checks in ADF?

 

Answer:

• Use Data Flow or Stored Procedure activities.

• Perform:

. Null checks, Data type checks, Range checks, etc.

• Create Validation activities with expressions (e.g., row count > 0).
• Add If Condition or Until Loop for conditional logic.

• Log validation results to a control table or send alerts.

• Use Custom Logging Framework in ADF for monitoring.

 

 

 

6. Explain the concept of Azure Data Lake and its integration with SQL-based systems.

Answer:

Azure Data Lake (ADLS) is a scalable, secure storage for big data.

Integration with SQL systems:

• Use PolyBase or OPENROWSET in Azure Synapse to query ADLS files.
External Tables in Synapse map to files in ADLS.

• Use ADF to move/transform data between ADLS and SQL DBs.

Databricks and Azure SQL DB can integrate for both ETL and analytics workloads. This hybrid integration enables flexible, scalable, and cost-effective data architecture.

 

 

 

7. How do you handle exceptions and errors in Python?

 

Answer:

Use try-except-finally blocks:

try:

# risky code x = 10 / 0

except ZeroDivisionError as e:

print(f"Error occurred: {e}") except

Exception as e:

print(f"Unexpected error: {e}") finally:

print("Cleanup actions if needed.")

• Use logging for error logging instead of print.

• You can also raise custom exceptions using raise.

 

 

 

8. What is the process of normalization, and why is it required?

 

Answer:

Normalization is organizing data to reduce redundancy and improve integrity. Steps (Normal Forms):

1NF: Remove repeating groups.

2NF: Remove partial dependency.

3NF: Remove transitive dependency.

Why it's required:

• Reduces data redundancy.

• Improves consistency and integrity.

• Makes data easier to maintain.

However, in analytics workloads, denormalization is preferred for performance. 

💬 Comments

logo

Comments (0)

No comments yet. Be the first to share your thoughts!