Loading ...

📚 Chapters

LatentView Data Engineering Interview Questions 2021

✍️ By MONU SINGH | 11/18/2025


  Here is the 8 important Questions and answers asked in LatentView  Data Engineering Interview 2020

 

1. Explain the purpose of SparkContext and SparkSession.

 

Answer:

• SparkContext:

. The entry point to Spark Core functionality. It sets up internal services and establishes a connection to a Spark execution environment.

. Used in older Spark versions (<2.0).
. Example: sc =

SparkContext(appName="MyApp")

• SparkSession:

. Introduced in Spark 2.0 as the new entry point for all Spark functionality, including Spark SQL and DataFrame APIs.

. Internally, it creates a SparkContext. o Recommended over SparkContext for all modern use. o Example: spark =

SparkSession.builder.appName("MyApp").getOrCreate()


 

 

2. How to handle incremental load in PySpark when the table lacks a last_modified column?

 

 If there's no last_modified or created_at column:

Option 1: Use checksum/hash comparison:

 Generate a hash of relevant columns and compare with the previously stored hash snapshot.

Option 2: Use change tracking tables or CDC (Change Data Capture) from the source if supported.

Option 3: If the table is small, do full load and deduplicate on the destination using surrogate keys or primary keys.

Option 4: Use record versioning if available via audit logs.

 

3. How do you use Azure Stream Analytics for real-time data processing?

 

Answer:

• Azure Stream Analytics (ASA) is a real-time event processing engine.

• Steps:

1. Define input: IoT Hub, Event Hub, or Blob Storage.

2. Define query: Use SQL-like syntax to process data (aggregations, joins, filters). 3. Define output: Azure SQL, Power BI, ADLS, Blob, etc.

• Example query

 

SELECT deviceId, AVG(temperature) AS avgTemp

INTO outputAlias

FROM inputAlias

GROUP BY deviceId, TumblingWindow(minute, 1)

 

 

 

4. What are the security features available in ADLS (e.g., access control lists, role-based access)?

Answer:

 Security features in Azure Data Lake Storage Gen2:

Role-Based Access Control (RBAC):

 Assigns permissions at subscription/resource/container level via Azure AD roles.
 •
Access Control Lists (ACLs):

 Fine-grained permissions at folder and file level.

 Supports POSIX-style permissioning.

Network Security: o Virtual Network (VNet) service endpoints, firewalls, and private endpoints.

• Encryption:

 Data is encrypted at rest with Microsoft-managed or customer-managed keys (CMK).
 
 Supports HTTPS for encryption in transit.

 

 

 

5. Write a SQL query to remove duplicate rows from a table.

Answer:

DELETE FROM my_table

WHERE id NOT IN (

SELECT MIN(id)

FROM my_table

 GROUP BY column1, column2, column3);

This assumes id is a unique identifier. Adjust columns based on what defines a duplicate.

 

 

 

6. How do you manage data lifecycle policies in ADLS?

 

Answer:

Use Azure Blob Lifecycle Management policies:

• Define rules for automatic tiering or deletion based on blob age or last modified date. Examples:

• Move blobs to cool tier after 30 days.

• Delete blobs older than 180 days.

Steps:

1. Go to storage account > Data Management > Lifecycle Management.

2. Add rule: Choose conditions and actions.

3. Apply to selected containers or blob prefixes.

 

 

 

7. What are the key considerations for designing a scalable data architecture in Azure?

Answer:

Modular design: Decouple ingestion, processing, and serving layers.

Use of scalable services: Databricks, Synapse, ADF, ADLS Gen2.

Partitioning: Effective partitioning for parallelism.

Security & governance: Use Purview, Key Vault, RBAC/ACLs.

Monitoring: Implement telemetry/logging with Azure Monitor, Log Analytics.
 • Automation: Use CI/CD and infrastructure as code (ARM/Bicep/Terraform).
 • Cost optimization: Use proper storage tiers, autoscaling, and data lifecycle rules.

 

 

 

8. How do you integrate Azure Key Vault with other Azure services?

 

Answer:

Azure Key Vault is used to store secrets, keys, and certificates securely.

Integration Examples:

ADF: Create a linked service and reference secrets using @Microsoft.KeyVault() syntax.
 • Databricks: Use secret scopes backed by Key Vault (dbutils.secrets.get().
 • App Services / Functions: Reference secrets in environment variables using

@Microsoft.KeyVault in configuration.

Synapse: Secure credentials for Linked Services.

Steps:

1. Assign Managed Identity to the service.

2. Grant access policy in Key Vault for the identity.

3. Reference the secret in service configuration.

 

💬 Comments

logo

Comments (0)

No comments yet. Be the first to share your thoughts!