Cognizant Data Engineering Interview Prep – 8 Key Questions on PySpark, Azure, SQL & DevOps

Here is the 8 important Questions and answers asked in Cognizant Data Engineering Interview 2020

1. Difference between repartition() and coalesce() in PySpark

Answer:

• repartition(n): Increases or decreases partitions by reshuffling the data. More expensive due to full shuffle.

• coalesce(n): Reduces the number of partitions by merging existing ones. More efficient for decreasing partitions since it avoids full shuffle.

Use Cases:

• Use repartition() when increasing partitions or for better data distribution. • Use coalesce() when decreasing partitions, especially before writing large data.

2. How to persist and cache data in PySpark?

Answer:

• cache(): Stores the DataFrame in memory only.

persist(): Gives you control over storage level (e.g., memory, disk, or both). Example:

df.cache() # Stores in memory

from pyspark.storagelevel import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK) Use unpersist() to free memory when done.

3. How do you use Azure Logic Apps to automate data workflows in SQL databases?

Answer:

Azure Logic Apps let you create serverless workflows using connectors for SQL Server, Azure SQL, and more.

Example Workflow:

1. Trigger: Schedule or HTTP request

2. Action 1: Run stored procedure or SQL query

3. Action 2: Send an email or push results to Power BI

Use Logic Apps for:

• Periodic data sync

• Alerts based on SQL conditions

• Event-driven SQL workflows

4. Differences between ADLS Gen1 and Gen2

Feature ADLS Gen1 ADLS Gen2

Answer:

Based on HDFS Azure Blob Storage

Cost Higher Lower (pay-as-you-go blob pricing)

Performance Moderate Better performance

Security ACL support ACL + RBAC

Integration Limited Broad integration with Azure tools

Hierarchical namespace Yes Optional (but recommended)

•ADLS Gen2 is the recommended and modern option for big data storage in Azure. 10

5. Write a SQL query to find gaps in a sequence of numbers

Answer:

SELECT curr.id + 1 AS missing_id

FROM your_table curr

LEFT JOIN your_table next ON curr.id + 1 = next.id

WHERE next.id IS NULL;

This finds missing numbers between existing sequence values.

6. How do you ensure high availability and disaster recovery for Azure SQL Databases?

Answer:

• High Availability:

. Use Premium or Business Critical tier with zone-redundant availability o Auto-failover groups for seamless failover between regions

• Disaster Recovery

.Geo-replication (readable secondary in different region) o

Backups: Point-in-time restore, long-term retention

7. Explain the role of pipelines in Azure DevOps

Answer:

• Pipelines are used to automate CI/CD processes.

• CI pipeline: Builds, tests, and validates code automatically.

• CD pipeline: Deploys artifacts to environments like ADF, Databricks, Azure SQL, etc.
Key Components:

• YAML or Classic pipelines

• Build agents

• Stages, jobs, and tasks

• Integration with Git repos and artifacts

8. How do you implement data masking in ADF for sensitive data?

Answer:

Options for data masking in ADF:

• Use Derived Column transformation in Mapping Data Flows to mask or replace values: • deriveColumn("masked_ssn", expr("substring(ssn, 1, 3) + '****'"))

Dynamic Data Masking at the Azure SQL level

• Leverage Azure Key Vault and parameterization to avoid exposing secrets • Apply row-level or column-level filters for sensitive datasets

📚 Chapters

Cognizant Data Engineering Interview Q & A 2020

💬 Comments

Comments (0)