📚 Chapters
Tredence Interview Q & A
✍️ By MONU SINGH | 11/18/2025
1. Difference between
groupByKey and reduceByKey in PySpark.
In PySpark, groupByKey groups all values with the same key into a single
collection, which can be memory-intensive and cause data shuffling. It's less
efficient and should be used when you truly need to group all values.
On the other hand, reduceByKey merges values for each key
using an associative reduce function, performing the merge locally before
shuffling data. This reduces data movement and is generally more efficient for
aggregations like sum, count, etc.
2. How to register a User Defined Function (UDF) in PySpark?
You
can define a regular Python function and register it as a UDF using pyspark. sql.functions.udf`. For example: python from pyspark.sql.functions
import udf from pyspark.sql.types import IntegerType
def square(x):
return x * x
square_udf = udf(square, IntegerType())
df.withColumn("squared",
square_udf(df["value"]))
3. What are Delta logs, and how to track data versioning in
Delta tables?
Delta
logs are stored in the delta log directory inside a Delta Lake table folder.
These logs track every change (add, remove, update) in JSON and parquet files.
You can use `DESCRIBE HISTORY table_name in Databricks or
Spark SQL to view the full version history of a Delta table.
4. How do you monitor and troubleshoot ADF pipeline failures?
You can monitor pipelines in the Azure Data Factory
Monitoring tab. It shows activity runs, duration, errors, and status. You can
also set up alerts via Azure Monitor, and use log analytics or custom logging
to capture detailed error info.
5. What is the use of Delta Lake, and how does it support
ACID transactions?
Delta
Lake adds ACID transaction capabilities to data lakes. It ensures consistency
by using transaction logs and locking mechanisms. So even in distributed
environments, reads and writes remain reliable. It also supports time travel,
schema enforcement, and rollback.
6. Explain the concept of Managed Identity in Azure and its
use in data engineering.
Managed Identity provides an automatically managed identity
in Azure Active Directory. It allows ADF, Databricks, or Azure Functions to
authenticate to Azure services like ADLS or Key Vault securely without needing
credentials in code.
7. Write a SQL query to find employees earning more than
their manager.
sql
SELECT e.name
FROM Employees e
JOIN Employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;
8. Describe the process of migrating on-premises databases to
Azure SQL Database.
You typically use the Data Migration Assistant (DMA) or Azure
Database Migration Service (DMS). First, assess compatibility using DMA, then
provision your Azure SQL DB, create the schema, and migrate data using DMS with
minimal downtime.
9. Write PySpark code to filter records based on a condition.
python
filtered_df = df.filter(df['age'] > 30)
filtered_df.show()
10.How do you implement error handling in ADF pipelines?
You can use 'If Condition', 'Until', and 'Try-Catch'-like
logic using 'Failure' dependencies and 'custom activity' with parameters to log
failures. ADF also supports sending failure alerts via Logic Apps or Azure
Monitor.
11.Explain the role of SparkSession in PySpark.
SparkSession is the entry point to use Spark functionality.
It replaces older contexts like SQLContext and HiveContext. You use it to
read/write data, execute SQL queries, and configure settings.
12.How do you optimize storage costs in ADLS?
Use lifecycle management rules to move older data to cooler
storage tiers. You can also compress files (e.g., parquet, snappy), partition
intelligently, and avoid small files by using batching or merge strategies.
13.Write a SQL query to find duplicate records in a table.
sql
SELECT name, COUNT(*)
FROM Employees
GROUP BY name
HAVING COUNT(*) > 1;
14.What is the purpose of caching in PySpark, and how is it
implemented?
Caching helps speed up repeated
access to data. You can use .cache() or .persist() to keep data in memory or on disk.
python df.cache()
df.count() # Materializes the cache
15.Describe the process of integrating ADF with Databricks
for ETL workflows.
In ADF, use the 'Azure Databricks' activity to run notebooks
or jobs. Pass parameters if needed. You can also link to a Databricks cluster
and orchestrate complex workflows combining multiple activities (e.g., copying,
transforming, loading).
16.Write Python code to count the frequency of words in a
string.
python
from collections import Counter
text = "hello world hello" word_freq
= Counter(text.split())
print(word_freq)
17.How do you handle schema evolution in Delta Lake?
Delta Lake supports schema evolution using the merge Schema
option during write operations. This lets you add new columns without rewriting
the full dataset.
python
df.write.option("mergeSchema",
"true").format("delta").mode("append").save(path)
18.Explain the difference between streaming and batch
processing in Spark.
Batch processing handles fixed-size data at intervals, while
streaming ingests data in real-time. Spark Structured Streaming provides a
micro-batch model, making stream processing feel like continuous batch
execution.
19.How do you secure data pipelines in Azure?
Use Managed Identities, RBAC, firewall rules, and private
endpoints. Secure data in transit (HTTPS) and at rest (encryption). Also,
monitor access via Azure Monitor and audit logs.
20.What are the best practices for managing large datasets in
Databricks?
Partition data wisely, avoid small files, cache interim
results, use Delta format, prune columns/rows, and leverage Z-ordering and
indexing where possible. Monitor with Spark UI and optimize jobs.
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!