📚 Chapters
Tiger Analytics Q & A
✍️ By MONU SINGH | 11/18/2025
These are all the questions asked in Tiger Analytics' Data Engineering Interview.
1. Explain
lazy evaluation in PySpark?
ANS:-
Lazy
evaluation means transformations (like map, filter) are not executed
immediately. Instead, they’re only evaluated when an action (like collect,
count) is triggered. This approach optimizes execution by minimizing data
passes and enabling the Spark engine to build an efficient execution plan.
2. How does caching work in PySpark?
ANS:-
When you cache
a DataFrame using .cache() or .persist(),
Spark stores it in memory (or disk if needed) so repeated actions can reuse the
same data instead of recomputing. It’s useful when you use the same dataset
multiple times.
3. What is
the difference between wide and narrow transformations?
ANS:-
• Narrow
transformations (e.g., map, filter) operate on a single partition, no
shuffling required.
• Wide
transformations (e.g., reduceByKey, join) involve data shuffling across
nodes and are more expensive.
4. How do
you optimize query performance in Azure SQL Database?
ANS:-
• Create
proper indexes (clustered, non-clustered)
• Use query
hints and execution plans
• Avoid SELECT
*
• Optimize
with partitioning and statistics updates
• Monitor via Query
Performance Insight and DMVs
5. Describe
the process of integrating ADF with Azure Synapse Analytics.
ANS:-
• Use Linked
Services in ADF to connect to Synapse
• Create pipelines to move or transform data
• You can run stored
procedures, execute Spark notebooks, or use copy activity to
load data • Monitor with ADF's Monitor tab
6. How do
you handle schema evolution in Azure Data Lake?
ANS:-
Use formats
like Delta Lake or Parquet that support schema evolution. In
ingestion, tools like ADF or Databricks Auto Loader can be
configured to merge schema changes (mergeSchema option).
7. Write a
SQL query to find the nth highest salary in a table.
ANS:-
SELECT
DISTINCT salary
FROM employees
e1
WHERE N-1 = (
SELECT
COUNT(DISTINCT salary)
FROM employees
e2
WHERE
e2.salary > e1.salary );
8. How do
you implement CI/CD pipelines for deploying ADF and Databricks solutions?
ANS:-
• Use Azure
DevOps/GitHub for source control
• Integrate
ADF with Git repository
• Use ARM
templates for ADF deployment
• Use Databricks
Repos, notebook export/import, and databricks-cli
• Use release
pipelines for deployment automation
9. Write
PySpark code to calculate the total sales for each product category.
ANS:-
df.groupBy("category").agg(sum("sales").alias("total_sales")).show()
10. Explain
how broadcast joins improve performance in PySpark.
ANS:-
Broadcast
joins send a small dataset to all worker nodes, avoiding costly shuffles. Use
broadcast() when one of the tables is small enough to fit in memory:
from
pyspark.sql.functions import broadcast df.join(broadcast(small_df),
"id")
11.
Describe the role of the driver and executor in Spark architecture.
ANS:-
• Driver:
Coordinates the Spark application; maintains metadata and DAG • Executors:
Run tasks on worker nodes, perform computations, and return results
12. How do
you manage and monitor ADF pipeline performance?
ANS:-
• Use Monitor tab for activity runs, trigger runs
• Enable logging
via Log Analytics
• Use Activity
Duration and Output metrics
• Implement retry
policies, alerts, and timeouts
13. Write a
SQL query to find employees with salaries greater than the department average.
ANS:-
SELECT e.*
FROM employees
e
JOIN (
SELECT
department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY
department_id
) d ON
e.department_id = d.department_id
WHERE e.salary
> d.avg_salary;
14. Explain
the concept of Delta Lake and its advantages.
ANS:-
Delta Lake is
an open-source storage layer that brings ACID transactions, time
travel, schema evolution, and concurrent writes to data lakes
using formats like Parquet.
15. How do
you implement schema drift handling in ADF?
ANS:-
Enable “Auto
Mapping” and check “Allow schema drift” in Copy Activity. Use
dynamic column mapping when the schema can vary over time.
16. Write
Python code to check if a number is a palindrome.
ANS:-
def
is_palindrome(n):
return str(n)
== str(n)[::-1]
print(is_palindrome(121)) # True
17. What is
the significance of Z-ordering in Delta tables?
ANS:-
Z-ordering
organizes data to improve query performance by clustering related data
together. It reduces I/O during filtering and improves data skipping.
18. How do
you handle incremental data load in Databricks?
ANS:-
• Use watermarking
or timestamp columns
• Filter
records where last_updated > last_processed
• Use merge
(upsert) logic with Delta Lake
19. Explain
Adaptive Query Execution (AQE) in Spark.
ANS:-
AQE dynamically optimizes query
plans based on runtime stats, enabling:
• Dynamically
switching join strategies
•
Re-optimizing skewed partitions
• Coalescing
shuffle partitions
20. How do
you optimize data partitioning in ADLS?
ANS:-
• Partition by
frequently queried columns (e.g., date, region)
• Avoid too
many small files (“file size tuning”)
• Use tools
like Azure Data Explorer, Databricks, or Partition Discovery
21.
Describe the process of creating a data pipeline for real-time analytics.
ANS:-
• Use Event Hubs / IoT Hub for ingestion
• Use Stream
Analytics or Structured Streaming (Databricks)
• Process and
write to Delta Lake / Cosmos DB / Synapse
• Visualize
using Power BI
22. Write
PySpark code to perform a left join between two DataFrames.
ANS:-
df1.join(df2,
on="id", how="left").show()
23. What
are the security best practices for Azure Data Lake?
ANS:-
• Use RBAC +
ACLs
• Enable Data
Encryption (at rest and in transit)
• Use Managed
Identities for secure access
• Monitor with
Azure Defender and Log Analytics
24. Explain
the use of Integration Runtime (IR) in ADF.
ANS:-
Integration
Runtime (IR) is the compute infrastructure in ADF. It handles:
• Data
movement
• Data
transformation
• Supports
cloud, self-hosted, and SSIS runtimes
25. How do
you design a fault-tolerant architecture for big data processing?
ANS:-
• Use retry logic
and checkpointing
• Design for idempotency
• Use Delta
Lake for ACID compliance
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!