Top 20 Data Engineering Interview Questions for Tredence

1. Difference between groupByKey and reduceByKey in PySpark.

In PySpark, groupByKey groups all values with the same key into a single collection, which can be memory-intensive and cause data shuffling. It's less efficient and should be used when you truly need to group all values.

On the other hand, reduceByKey merges values for each key using an associative reduce function, performing the merge locally before shuffling data. This reduces data movement and is generally more efficient for aggregations like sum, count, etc.

2. How to register a User Defined Function (UDF) in PySpark?

You can define a regular Python function and register it as a UDF using pyspark. sql.functions.udf`. For example: python from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType

def square(x):

return x * x

square_udf = udf(square, IntegerType())

df.withColumn("squared", square_udf(df["value"]))

3. What are Delta logs, and how to track data versioning in Delta tables?

Delta logs are stored in the delta log directory inside a Delta Lake table folder. These logs track every change (add, remove, update) in JSON and parquet files.

You can use `DESCRIBE HISTORY table_name in Databricks or Spark SQL to view the full version history of a Delta table.

4. How do you monitor and troubleshoot ADF pipeline failures?

You can monitor pipelines in the Azure Data Factory Monitoring tab. It shows activity runs, duration, errors, and status. You can also set up alerts via Azure Monitor, and use log analytics or custom logging to capture detailed error info.

5. What is the use of Delta Lake, and how does it support ACID transactions?

Delta Lake adds ACID transaction capabilities to data lakes. It ensures consistency by using transaction logs and locking mechanisms. So even in distributed environments, reads and writes remain reliable. It also supports time travel, schema enforcement, and rollback.

6. Explain the concept of Managed Identity in Azure and its use in data engineering.

Managed Identity provides an automatically managed identity in Azure Active Directory. It allows ADF, Databricks, or Azure Functions to authenticate to Azure services like ADLS or Key Vault securely without needing credentials in code.

7. Write a SQL query to find employees earning more than their manager.

sql

SELECT e.name

FROM Employees e

JOIN Employees m ON e.manager_id = m.id

WHERE e.salary > m.salary;

8. Describe the process of migrating on-premises databases to Azure SQL Database.

You typically use the Data Migration Assistant (DMA) or Azure Database Migration Service (DMS). First, assess compatibility using DMA, then provision your Azure SQL DB, create the schema, and migrate data using DMS with minimal downtime.

9. Write PySpark code to filter records based on a condition.

python

filtered_df = df.filter(df['age'] > 30)

filtered_df.show()

10.How do you implement error handling in ADF pipelines?

You can use 'If Condition', 'Until', and 'Try-Catch'-like logic using 'Failure' dependencies and 'custom activity' with parameters to log failures. ADF also supports sending failure alerts via Logic Apps or Azure Monitor.

11.Explain the role of SparkSession in PySpark.

SparkSession is the entry point to use Spark functionality. It replaces older contexts like SQLContext and HiveContext. You use it to read/write data, execute SQL queries, and configure settings.

12.How do you optimize storage costs in ADLS?

Use lifecycle management rules to move older data to cooler storage tiers. You can also compress files (e.g., parquet, snappy), partition intelligently, and avoid small files by using batching or merge strategies.

13.Write a SQL query to find duplicate records in a table.

sql

SELECT name, COUNT(*)

FROM Employees

GROUP BY name

HAVING COUNT(*) > 1;

14.What is the purpose of caching in PySpark, and how is it implemented?

Caching helps speed up repeated access to data. You can use .cache() or .persist() to keep data in memory or on disk.

python df.cache()

df.count() # Materializes the cache

15.Describe the process of integrating ADF with Databricks for ETL workflows.

In ADF, use the 'Azure Databricks' activity to run notebooks or jobs. Pass parameters if needed. You can also link to a Databricks cluster and orchestrate complex workflows combining multiple activities (e.g., copying, transforming, loading).

16.Write Python code to count the frequency of words in a string.

python

from collections import Counter

text = "hello world hello" word_freq

= Counter(text.split())

print(word_freq)

17.How do you handle schema evolution in Delta Lake?

Delta Lake supports schema evolution using the merge Schema option during write operations. This lets you add new columns without rewriting the full dataset.

python

df.write.option("mergeSchema", "true").format("delta").mode("append").save(path)

18.Explain the difference between streaming and batch processing in Spark.

Batch processing handles fixed-size data at intervals, while streaming ingests data in real-time. Spark Structured Streaming provides a micro-batch model, making stream processing feel like continuous batch execution.

19.How do you secure data pipelines in Azure?

Use Managed Identities, RBAC, firewall rules, and private endpoints. Secure data in transit (HTTPS) and at rest (encryption). Also, monitor access via Azure Monitor and audit logs.

20.What are the best practices for managing large datasets in Databricks?

Partition data wisely, avoid small files, cache interim results, use Delta format, prune columns/rows, and leverage Z-ordering and indexing where possible. Monitor with Spark UI and optimize jobs.

📚 Chapters

Tredence Interview Q & A

💬 Comments

Comments (0)