LatentView & EXL Data Engineering Interview Questions – Azure, PySpark, SQL & DevOps Explained

Here is the 8 important Questions and answers asked in LatentView Data Engineering Interview 2020

1. Explain the purpose of SparkContext and SparkSession.

Answer:

• SparkContext:

. The entry point to Spark Core functionality. It sets up internal services and establishes a connection to a Spark execution environment.

. Used in older Spark versions (<2.0).
. Example: sc =

SparkContext(appName="MyApp")

• SparkSession:

. Introduced in Spark 2.0 as the new entry point for all Spark functionality, including Spark SQL and DataFrame APIs.

. Internally, it creates a SparkContext. o Recommended over SparkContext for all modern use. o Example: spark =

SparkSession.builder.appName("MyApp").getOrCreate()

2. How to handle incremental load in PySpark when the table lacks a last_modified column?

If there's no last_modified or created_at column:

• Option 1: Use checksum/hash comparison:

• Generate a hash of relevant columns and compare with the previously stored hash snapshot.

• Option 2: Use change tracking tables or CDC (Change Data Capture) from the source if supported.

• Option 3: If the table is small, do full load and deduplicate on the destination using surrogate keys or primary keys.

• Option 4: Use record versioning if available via audit logs.

3. How do you use Azure Stream Analytics for real-time data processing?

Answer:

• Azure Stream Analytics (ASA) is a real-time event processing engine.

• Steps:

1. Define input: IoT Hub, Event Hub, or Blob Storage.

2. Define query: Use SQL-like syntax to process data (aggregations, joins, filters). 3. Define output: Azure SQL, Power BI, ADLS, Blob, etc.

• Example query

SELECT deviceId, AVG(temperature) AS avgTemp

INTO outputAlias

FROM inputAlias

GROUP BY deviceId, TumblingWindow(minute, 1)

4. What are the security features available in ADLS (e.g., access control lists, role-based access)?

Answer:

Security features in Azure Data Lake Storage Gen2:

• Role-Based Access Control (RBAC):

•Assigns permissions at subscription/resource/container level via Azure AD roles.
• Access Control Lists (ACLs):

• Fine-grained permissions at folder and file level.

• Supports POSIX-style permissioning.

• Network Security: o Virtual Network (VNet) service endpoints, firewalls, and private endpoints.

• Encryption:

• Data is encrypted at rest with Microsoft-managed or customer-managed keys (CMK).
• Supports HTTPS for encryption in transit.

5. Write a SQL query to remove duplicate rows from a table.

Answer:

DELETE FROM my_table

WHERE id NOT IN (

SELECT MIN(id)

FROM my_table

GROUP BY column1, column2, column3);

This assumes id is a unique identifier. Adjust columns based on what defines a duplicate.

6. How do you manage data lifecycle policies in ADLS?

Answer:

Use Azure Blob Lifecycle Management policies:

• Define rules for automatic tiering or deletion based on blob age or last modified date. Examples:

• Move blobs to cool tier after 30 days.

• Delete blobs older than 180 days.

Steps:

1. Go to storage account > Data Management > Lifecycle Management.

2. Add rule: Choose conditions and actions.

3. Apply to selected containers or blob prefixes.

7. What are the key considerations for designing a scalable data architecture in Azure?

Answer:

• Modular design: Decouple ingestion, processing, and serving layers.

• Use of scalable services: Databricks, Synapse, ADF, ADLS Gen2.

• Partitioning: Effective partitioning for parallelism.

• Security & governance: Use Purview, Key Vault, RBAC/ACLs.

• Monitoring: Implement telemetry/logging with Azure Monitor, Log Analytics.
• Automation: Use CI/CD and infrastructure as code (ARM/Bicep/Terraform).
• Cost optimization: Use proper storage tiers, autoscaling, and data lifecycle rules.

8. How do you integrate Azure Key Vault with other Azure services?

Answer:

• Azure Key Vault is used to store secrets, keys, and certificates securely.

Integration Examples:

• ADF: Create a linked service and reference secrets using @Microsoft.KeyVault() syntax.
• Databricks: Use secret scopes backed by Key Vault (dbutils.secrets.get().
• App Services / Functions: Reference secrets in environment variables using

@Microsoft.KeyVault in configuration.

• Synapse: Secure credentials for Linked Services.

Steps:

1. Assign Managed Identity to the service.

2. Grant access policy in Key Vault for the identity.

3. Reference the secret in service configuration.

📚 Chapters

LatentView Data Engineering Interview Questions 2021

💬 Comments

Comments (0)