📚 Chapters
Fractal Analytics interview Q & A
✍️ By MONU SINGH | 11/18/2025
These are all the questions asked in ' Fractal' Analytics Interview. 2022
1. How do autoscaling clusters work in Databricks?
Answer:
Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load decreases, it removes nodes. This helps save costs and ensures performance.
2. What are
the types of Integration Runtimes (IR) in ADF?
Answer:
There are
three types:
• Azure IR –
Used for data movement between cloud sources.
• Self-hosted
IR – Installed on-premises to connect on-prem data sources. • Azure-SSIS
IR – For running SSIS packages in Azure.
3.
Difference between Blob Storage and Azure Data Lake Storage (ADLS).
Answer:
• Blob Storage is general-purpose storage for
unstructured data.
• ADLS Gen2
is built on Blob but optimized for analytics. It supports hierarchical
namespace, ACLs, and better performance for big data processing.
4. How do
you integrate Databricks with Azure DevOps for CI/CD pipelines?
Answer:
• Use the Databricks
Repos feature to sync code with Azure Repos or GitHub. • Use Azure DevOps
pipelines to automate deployment using Databricks CLI or REST APIs. •
Scripts can include notebook deployment, cluster setup, and job scheduling.
5. Write a
SQL query to convert row-level data to column-level using pivot.
Answer:
SELECT department,
MAX(CASE WHEN
gender = 'Male' THEN salary END) AS Male_Salary,
MAX(CASE WHEN
gender = 'Female' THEN salary END) AS Female_Salary FROM employee
GROUP BY
department;
6. How do
you ensure data quality and validation in ADLS?
Answer:
• Use ADF
Data Flows or Databricks for validations.
• Implement
checks like null values, range validation, data types.
• Create logs
or alerts if data fails rules.
• Store
validation results separately.
7. Explain
the use of hierarchical namespaces in ADLS.
Answer:
Hierarchical
namespaces allow directories and subdirectories, like a traditional file
system. This makes operations like move, delete, and rename more efficient and
enables file-level ACLs.
8. Describe
the process of setting up and managing an Azure Synapse Analytics workspace.
Answer:
• Create Synapse workspace and link it to ADLS.
• Create pools
(SQL and Spark).
• Ingest data
using pipelines or linked services.
• Use Studio
to run queries, manage notebooks, monitor jobs, and secure access via RBAC.
9. Write PySpark code to calculate the average salary
by department.
Answer:
df.groupBy("department").agg(avg("salary").alias("avg_salary")).show()
10. How do
you implement streaming pipelines in Databricks?
Answer:
• Use
readStream to read from sources like Kafka, Event Hub, etc.
• Apply
transformations.
• Write to
sinks using writeStream with checkpointing enabled. Example: df =
spark.readStream.format("delta").load("input_path")
df.writeStream.format("delta").option("checkpointLocation",
"chkpt_path").start("output_path")
11. Explain
the purpose of Delta Lake checkpoints.
Answer:
Checkpoints
store the current state of the Delta table to speed up the recovery process.
They are created every 10 commits and help avoid reading all log files when
querying a table.
12. How do
you handle data encryption in ADLS?
Answer:
• Data is
encrypted at rest using Microsoft-managed or customer-managed keys (CMK). • For
data in transit, HTTPS is enforced.
• You can use
Azure Key Vault to manage CMKs securely.
13. Write a
SQL query to find the top 3 customers by sales.
Answer:
SELECT
customer_id, SUM(sales) as total_sales
FROM orders
GROUP BY
customer_id
ORDER BY
total_sales DESC
LIMIT 3;
14. How do
you optimize Spark jobs for better performance?
Answer:
• Use cache/persist
when reusing data.
• Use broadcast
joins for small tables.
• Tune partitioning
and shuffle operations.
• Enable Adaptive
Query Execution.
• Avoid wide
transformations when possible.
15.
Describe the role of triggers in ADF pipelines.
Answer:
Triggers
control when a pipeline runs. Types include:
• Schedule
Trigger – Based on time.
• Tumbling
Window – For periodic runs with dependency tracking.
• Event Trigger – Runs when a blob is created/modified in storage.
16. Write
Python code to find the largest number in a list.
Answer:
numbers = [4,
9, 1, 23, 6]
print(max(numbers))
17. How do
you implement parallel processing in PySpark?
Answer:
Spark
automatically parallelizes tasks across nodes. To
implement it
manually, you can:
• Use
repartition() or coalesce() to control number of partitions.
• Use
transformations like mapPartitions() for parallel execution.
18. Explain
the concept of lineage in data pipelines.
Answer:
Lineage tracks
where data came from, how it changed, and where it goes.
It helps with debugging, auditing, and compliance. Tools like Purview or ADF
monitoring show lineage visually.
19. How do
you manage access control in Azure Data Lake?
Answer:
• Use Access
Control Lists (ACLs) at file/folder level.
• Use RBAC to
control access at resource level.
• Integrate
with Azure AD for authentication.
• Combine both
for fine-grained control.
20. What
are the challenges in integrating on-premises data with Azure services?
Answer:
• Latency and
bandwidth issues.
• Firewall
and VPN configurations.
• Authentication
between on-prem and cloud.
• Keeping data
in sync during migration.
• Choosing the
right Integration Runtime in ADF.
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!