We are seeking a Data Engineer to design, develop, and optimize data pipelines, storage solutions, and processing frameworks. The ideal candidate will have expertise in ETL workflows, big data processing, cloud platforms, and API integrations to ensure efficient data handling and high-performance analytics.
This role requires a strong foundation in data architecture, security, and performance optimization, supporting business intelligence, machine learning, and operational data needs.
Design and develop scalable data pipelines using Apache NiFi, Apache Airflow, and PySpark.
Work with Apache Hudi for incremental and real-time data processing within a data lake.
Implement batch and real-time data processing solutions using Apache Flink and Apache Spark.
Optimize data querying and federation using Trino (Presto) and PostgreSQL.
Ensure data security, governance, and access control using RBAC and cloud-native security best practices.
Automate and monitor data pipeline performance, latency, and failures.
Collaborate with data scientists, AI/ML engineers, and backend teams to optimize data availability and insights.
Implement observability and monitoring using Prometheus, OpenTelemetry, or similar tools.
Support cloud-based data lake solutions (AWS, GCP, or Azure) with best practices for storage, partitioning, and indexing.
3-4 years of experience in Data Engineering, Big Data, or Cloud Data Solutions.
Strong programming skills in Python and SQL for data transformation and automation.
Experience with Apache NiFi for data ingestion and orchestration.
Hands-on expertise with Apache Spark (PySpark) and Apache Flink for large-scale data processing.
Knowledge of Trino (Presto) for federated querying and PostgreSQL for analytical workloads.
Experience with Apache Hudi for data lake versioning and incremental updates.
Proficiency in Apache Airflow for workflow automation and job scheduling.
Understanding of data governance, access control (RBAC), and security best practices.
Experience with observability and monitoring tools such as Prometheus, OpenTelemetry, or equivalent.
Experience working with real-time data streaming frameworks (Kafka, Pulsar, or similar).
Exposure to cloud data lake services (AWS S3, Azure Data Lake, Google Cloud Storage).
Familiarity with Infrastructure as Code (Terraform, CloudFormation, or similar) for provisioning data lake resources.
Knowledge of containerized environments (Docker, Kubernetes).