📚 Chapters
Cognizant Data Engineering Interview Q & A 2020
✍️ By MONU SINGH | 11/18/2025
Here is the 8 important Questions and answers asked in Cognizant Data Engineering Interview 2020
1.
Difference between repartition() and coalesce() in PySpark
Answer:
• repartition(n):
Increases or decreases partitions by reshuffling the data. More
expensive due to full shuffle.
• coalesce(n):
Reduces the number of partitions by merging existing ones. More
efficient for decreasing partitions since it avoids full shuffle.
Use Cases:
• Use
repartition() when increasing partitions or for better data distribution. • Use
coalesce() when decreasing partitions, especially before writing large data.
2. How to
persist and cache data in PySpark?
Answer:
• cache():
Stores the DataFrame in memory only.
persist(): Gives you control over storage
level (e.g., memory, disk, or both). Example:
df.cache() #
Stores in memory
from
pyspark.storagelevel import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK) Use unpersist() to free memory when
done.
3. How do you use Azure Logic Apps to automate data
workflows in SQL databases?
Answer:
Azure Logic
Apps let you create serverless workflows using connectors for SQL
Server, Azure SQL, and more.
Example
Workflow:
1. Trigger:
Schedule or HTTP request
2. Action 1:
Run stored procedure or SQL query
3. Action 2:
Send an email or push results to Power BI
Use Logic Apps
for:
• Periodic
data sync
• Alerts based
on SQL conditions
• Event-driven
SQL workflows
4.
Differences between ADLS Gen1 and Gen2
Feature
ADLS Gen1 ADLS Gen2
Answer:
Based on HDFS
Azure Blob Storage
Cost Higher
Lower (pay-as-you-go blob pricing)
Performance
Moderate Better performance
Security ACL
support ACL + RBAC
Integration
Limited Broad integration with Azure tools
Hierarchical
namespace Yes Optional (but recommended)
•ADLS Gen2 is the recommended and modern option for big data storage in Azure. 10
5. Write a
SQL query to find gaps in a sequence of numbers
Answer:
SELECT curr.id
+ 1 AS missing_id
FROM your_table
curr
LEFT JOIN
your_table next ON curr.id + 1 = next.id
WHERE next.id
IS NULL;
This finds
missing numbers between existing sequence values.
6. How do
you ensure high availability and disaster recovery for Azure SQL Databases?
Answer:
• High Availability:
. Use Premium
or Business Critical tier with zone-redundant availability o
Auto-failover groups for seamless failover between regions
• Disaster
Recovery
.Geo-replication (readable secondary in different region) o
Backups: Point-in-time restore, long-term
retention
7. Explain
the role of pipelines in Azure DevOps
Answer:
• Pipelines
are used to automate CI/CD processes.
• CI
pipeline: Builds, tests, and validates code automatically.
• CD
pipeline: Deploys artifacts to environments like ADF, Databricks, Azure
SQL, etc.
Key Components:
• YAML or
Classic pipelines
• Build agents
• Stages,
jobs, and tasks
• Integration
with Git repos and artifacts
8. How do
you implement data masking in ADF for sensitive data?
Answer:
Options for
data masking in ADF:
• Use Derived
Column transformation in Mapping Data Flows to mask or replace values: •
deriveColumn("masked_ssn", expr("substring(ssn, 1, 3) +
'****'"))
Dynamic
Data Masking at the
Azure SQL level
• Leverage Azure
Key Vault and parameterization to avoid exposing secrets • Apply row-level
or column-level filters for sensitive datasets
💬 Comments
Comments (0)
No comments yet. Be the first to share your thoughts!