Fractal Analytics Interview Prep – 20 Essential Data Engineering Questions on Azure, Spark & Databricks

These are all the questions asked in ' Fractal' Analytics Interview. 2022

1. How do autoscaling clusters work in Databricks?

Answer:

Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load decreases, it removes nodes. This helps save costs and ensures performance.

2. What are the types of Integration Runtimes (IR) in ADF?

Answer:

There are three types:

• Azure IR – Used for data movement between cloud sources.

• Self-hosted IR – Installed on-premises to connect on-prem data sources. • Azure-SSIS IR – For running SSIS packages in Azure.

3. Difference between Blob Storage and Azure Data Lake Storage (ADLS).

Answer:

• Blob Storage is general-purpose storage for unstructured data.

• ADLS Gen2 is built on Blob but optimized for analytics. It supports hierarchical namespace, ACLs, and better performance for big data processing.

4. How do you integrate Databricks with Azure DevOps for CI/CD pipelines?

Answer:

• Use the Databricks Repos feature to sync code with Azure Repos or GitHub. • Use Azure DevOps pipelines to automate deployment using Databricks CLI or REST APIs. • Scripts can include notebook deployment, cluster setup, and job scheduling.

5. Write a SQL query to convert row-level data to column-level using pivot.

Answer:

SELECT department,

MAX(CASE WHEN gender = 'Male' THEN salary END) AS Male_Salary,

MAX(CASE WHEN gender = 'Female' THEN salary END) AS Female_Salary FROM employee

GROUP BY department;

6. How do you ensure data quality and validation in ADLS?

Answer:

• Use ADF Data Flows or Databricks for validations.

• Implement checks like null values, range validation, data types.

• Create logs or alerts if data fails rules.

• Store validation results separately.

7. Explain the use of hierarchical namespaces in ADLS.

Answer:

Hierarchical namespaces allow directories and subdirectories, like a traditional file system. This makes operations like move, delete, and rename more efficient and enables file-level ACLs.

8. Describe the process of setting up and managing an Azure Synapse Analytics workspace.

Answer:

• Create Synapse workspace and link it to ADLS.

• Create pools (SQL and Spark).

• Ingest data using pipelines or linked services.

• Use Studio to run queries, manage notebooks, monitor jobs, and secure access via RBAC.

9. Write PySpark code to calculate the average salary by department.

Answer:

df.groupBy("department").agg(avg("salary").alias("avg_salary")).show()

10. How do you implement streaming pipelines in Databricks?

Answer:

• Use readStream to read from sources like Kafka, Event Hub, etc.

• Apply transformations.

• Write to sinks using writeStream with checkpointing enabled. Example: df = spark.readStream.format("delta").load("input_path")

df.writeStream.format("delta").option("checkpointLocation", "chkpt_path").start("output_path")

11. Explain the purpose of Delta Lake checkpoints.

Answer:

Checkpoints store the current state of the Delta table to speed up the recovery process. They are created every 10 commits and help avoid reading all log files when querying a table.

12. How do you handle data encryption in ADLS?

Answer:

• Data is encrypted at rest using Microsoft-managed or customer-managed keys (CMK). • For data in transit, HTTPS is enforced.

• You can use Azure Key Vault to manage CMKs securely.

13. Write a SQL query to find the top 3 customers by sales.

Answer:

SELECT customer_id, SUM(sales) as total_sales

FROM orders

GROUP BY customer_id

ORDER BY total_sales DESC

LIMIT 3;

14. How do you optimize Spark jobs for better performance?

Answer:

• Use cache/persist when reusing data.

• Use broadcast joins for small tables.

• Tune partitioning and shuffle operations.

• Enable Adaptive Query Execution.

• Avoid wide transformations when possible.

15. Describe the role of triggers in ADF pipelines.

Answer:

Triggers control when a pipeline runs. Types include:

• Schedule Trigger – Based on time.

• Tumbling Window – For periodic runs with dependency tracking.

• Event Trigger – Runs when a blob is created/modified in storage.

16. Write Python code to find the largest number in a list.

Answer:

numbers = [4, 9, 1, 23, 6]

print(max(numbers))

17. How do you implement parallel processing in PySpark?

Answer:

Spark automatically parallelizes tasks across nodes. To

implement it manually, you can:

• Use repartition() or coalesce() to control number of partitions.

• Use transformations like mapPartitions() for parallel execution.

18. Explain the concept of lineage in data pipelines.

Answer:

Lineage tracks where data came from, how it changed, and where it goes. It helps with debugging, auditing, and compliance. Tools like Purview or ADF monitoring show lineage visually.

19. How do you manage access control in Azure Data Lake?

Answer:

• Use Access Control Lists (ACLs) at file/folder level.

• Use RBAC to control access at resource level.

• Integrate with Azure AD for authentication.

• Combine both for fine-grained control.

20. What are the challenges in integrating on-premises data with Azure services?

Answer:

• Latency and bandwidth issues.

• Firewall and VPN configurations.

• Authentication between on-prem and cloud.

• Keeping data in sync during migration.

• Choosing the right Integration Runtime in ADF.

📚 Chapters

Fractal Analytics interview Q & A

These are all the questions asked in ' Fractal' Analytics Interview. 2022

1. How do autoscaling clusters work in Databricks?

Answer:

Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load decreases, it removes nodes. This helps save costs and ensures performance.

💬 Comments

Comments (0)

📚 Chapters

Fractal Analytics interview Q & A

These are all the questions asked in ' Fractal' Analytics Interview. 2022

1. How do autoscaling clusters work in Databricks?

Answer: Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load decreases, it removes nodes. This helps save costs and ensures performance.

💬 Comments

Comments (0)

Answer:

Autoscaling in Databricks automatically adjusts the number of worker nodes in a cluster based on workload. When the load increases (e.g., more tasks or larger data), it adds nodes; when the load decreases, it removes nodes. This helps save costs and ensures performance.