Modern Data Engineering in Cloud Platform
Data engineering is more of a common term today for what is traditionally being called ETL, BI, and Big data developers. Modern Data Engineering platform ought to support fully developed and operationalized data pipelines which should be cloud-capable and run on modern, distributed execution platforms. It further supports machine learning, analytics, data quality, security, and adhere to securing PII information, for compliance with GDPR and other regulatory frameworks.
Big Data and AI Executive Survey 2019
NewVanage Partners has surveyed Fortune 1000 business & technology and found that an immense amount of increase in the pace of investment being spent between 50 million to 500 million and above on BigData/AI initiatives. Despite the investment in Bigdata/AI has increased by over 66% in 2019, but the confidence among organization in measuring themselves in business results has decreased to 19%. More info can be read in the following survey summary.
Who is a Data Engineer?
A professional Data Engineer ‘helps’ to collect, curate, and aggregate data from various data sources and create a consolidated data platform that empowers data-driven decisions for an organization to leverage business.
The consolidated data in data lake/data warehouse satisfies hundreds of consumer use cases, feeds the BI reports, empowers the data scientist, trains the machine algorithms, and so on.
Bottom line: The main goal of data engineers is to provide a reliable data infrastructure
How is Data Engineering related to Big data?
A typical Data engineer will have the skillset of a Software Engg + Business Intelligence (BI) engineer including Big Data capabilities.
“The data engineering discipline also integrates specialization around the operation of so-called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and computation at scale.”
Book – The Data Gaze: Capitalism, Power and Perception | Author – David Beer
Data science vs Data engineering?
The data architect and data engineer work one behind the other. Meaning, Data architect will conceptualize & design the Data platform Framework and the Data Engineer will take care of the implementation part. The data architect & data engineer roles will concentrate on designing data platforms and consolidating the data (Model, Collect, Cleanse, Transform).
Data science is a research portion and Data engineering is a development portion that definitely works hand in hand. Data science (more kind of exploratory type of a problem) will look for patterns in data and predicts a better solution about what is the value of the data for leveraging the business. In other words, data scientists will focus on finding new insights from the curated data.
Modern Data Engineering in cloud
The only constant in the software industry is Change. But the fundamental principles remain the same. That is true in the case of Data engineering. Today data engineering always differs from the old methodologies in terms of collect, curate, and aggregate data. When I started practicing big data years back, we were using tools like Informatica, IBM Datastage, MapR, SAP tools, Pentaho ETL, Talend….etc. But today this Isn’t the choice for modern Data Engineers today.
Let’s discuss with an example with respect to Storage layers (Data lake/Datawarehouse) and Transformation layer (ETL).
A. Storage Layer
Years ago data warehouse maintained inside the organization on-premises and managing data connection to applications that are running on-premise as well. Generally, the organization used traditional ETL tools for connecting through the ERP systems/CRM systems and delivering the data to the warehouse. With this model, many organizations are not satisfied because they were not getting the value at scale in a responsive way from a data warehouse/data lake.
So naturally, Paradigm Shift happened from on-prem to cloud and that model has changed quite dramatically with the cloud today. With the cloud providers, you can manage your services easier at a lesser cost, and you can get the value of data at scale from the data warehouse in the cloud and following are storage options more frequently used in cloud – Amazon s3, Azure Data lake storage, Google cloud storage.
B. Transformation Layer – (ETL)
Traditional ETL tools are largely obsolete because the logic cannot be expressed using code and these tools have been replaced by programmatic or configuration driven tools provided by cloud providers like Google Cloud, Azure, AWS in the industry. Each cloud provider has plenty of options to perform all operations in terms of collect, curate, and aggregate data.
Also, organizations today prefer cloud-based solutions to get quicker results with less overhead and maintenance at a later point in time. Meaning, You can perform Data Extraction using custom code written in Python, Go, Java.. etc and on the other hand using managed options like Amazon Kinesis (AWS), has the capability to do the same job and you can get timely insights and react quickly to new information.
Also, you don’t need to worry about scalability and availability just configure the endpoints and get the data with less turnaround time. It also supports sudden traffic spike burst and scale in/out data pipelines automatically with ease. Those are the things that make modern data engineering today.
Today most Data Engineering job opportunities are associated with the Top 3 cloud providers and the job title would look like below
- Google Cloud Data engineer
- AWS cloud data engineer
- Azure cloud data engineer.
Data Pipeline Stages
Data Engineer will primarily work on the below-mentioned Data pipeline stages. From my perspective, more complex scenarios will be addressed in the Data Extraction and Data Transformation Stages. Below are the data pipeline stages explained at a High level.
A. Data Extraction
The goal of this stage is to extract data from internal and external sources. In a typical use case, the following would be data sources that you might be extracting the data (Relational, XML, File, CSV, Flastfiles, APIs, Legacy systems, Proprietary software, text files … etc).
The data will be extracted into a repository called Data Lake. In Industry terms, the extracted data is called as Unprocessed data (or) Raw data (or) Golden data. The extraction will fall into two categories Real-time data processing and Batch data processing.
B. Data Transformation – High level
At a high level, This stage involves the validation and transformation of the data and storing the end results in formatted output in any of the following formats – JSON, CSV, AVRO, ORC, and Parquet.
The typical transformation will be similar to performing Anomaly detection, Metadata management, Deduplication, Data Validation & Cleansing, Data Aggregation, and Change Data Capture with the collected data.
C. Data Curation
The goal of this stage is to enhance the quality of data before the data is stored into the target by applying various techniques to curate the data, some of the common techniques are
- Deduplication – Remove identical duplicates
- Missing/Null values – Rows/Columns with missing values will be eliminated from processing further or have the default values from Master data.
- Standardize values – If you columns contain values such as data123 then it has be considered as categorical or numerical depending on the business use case and similarly normalizing numbers, text values according to the business requirements.
- Failed Records for reprocessing/review
- Auditing Data Processed.
- Total Count/Processed Count/Failed Count and thereby enriching the overall quality of data.
Note: Some time based on business use case we used to convert row to columns and vice versa as part of transformation & sub transformation.
D. Data Transformation
The goal of this stage is to convert data types, define how different data fields will be mapped, aggregated, filtered, or joined to make sense out of structured and unstructured data for analysis. Add more information such as date/timestamps, geographic locations to the data to enable efficient lookups and form a better hierarchy.
These transformations activities are performed in a staging area called as Data Staging (typically another storage bucket on the cloud) before the data gets loaded into the target database/data warehouse.
E. Error handling & Retries
Errors can occur in various phases starting from data extraction, validation, and or while parsing the data, or during the transformation, or during the final insert into the database/data warehouse phase will be recorded in a separate collection.
This will help us in various purposes like backfilling delta data (CDC) at a later point in time with the right values and also finding & eliminating bad records as well.
F. Data Validation/Auditing
The goal of this stage is to ensure all the required dataset is present by performing comparisons between the source and target. In addition, schema validation/conversion is performed to ensure the overall health of the data, and changes (if any) that will be required in the source data to match the target schema.
Subsequently, a review and audit (can be done programmatically or using enterprise/open source tools) are performed to check if there are no anomalies or discrepancies. Based on the result, a decision on whether to move forward or re-run one of the previous stages will be decided.
G. Change data Capture
The CDC methods will enable you to load only the delta data or changed records from the data source rather than loading the entire data.
Let me explain with a simple scenario, Consider the ETL pipeline has processed 1 million records and out of that, you haven’t processed 100 records due to missing values in the records. At a later point in time, if you are receiving the delta records, your ETL pipeline will be designed efficiently to identify and capture the incremental data that has to be Updated/Added/Removed from appropriate database tables.
This refers to Change Data Capturing. Delta data will be further processed and used in the Data analytics stage to get more accurate insights into the data.
H. Data Storage
Once the data is Curated, it will be consolidated into a data warehouse (persistent store), which will contain processed data. Generally, the transformed data would be saved in any of the following formats JSON, CSV, AVRO, ORC, and Parquet
I. Data Visualization
Data visualization is nothing but viewing the processed data in the form of graphs and charts and maps. BI tools will provide different views like trends, outliers, and error patterns of the data. The main goal is to analyze massive amounts of data and find useful insights in the data.
When is a Data Engineer Needed?
There is no requirement to have data engineers in every organization.His/Her skills are mostly required if the company either:
- Having a large amount of data.
- An organization should need or desire to analyze the data from various kinds of sources.
- Last but not least, An org should have half a million to do all this 🙂
Skills required to become a Data Engineer
What does a data engineer need to know? What areas does she/he require to learn and improve the skills? – In the first place, Data engineers need to have good exposure in different big data technologies and they should have the capability to trade off the right toolset and technologies to get the job done, based on the business goals and objectives. Below is the typical road map to follow one to become a Data engineer
Data Engineer Roadmap
A. Learn Data Modeling Concepts
The data modeling will classify the characteristics of data and the database and all other components. Below are the highlighted areas where you need to have an exposure
- Conceptual Data Modeling(CDM): This data model includes all major entities, relationships, Meaning only high-level information of the entire model and it will not contain much detail about attributes and this will be input to the second model
- Logical Data Modeling(LDM): This is the actual implementation of a conceptual model. A logical data model will clearly show the data entities and relationships between them.
- Physical Data Modeling(PDM): This is a complete model that includes all required tables, columns, relationships, database properties for the physical implementation of the database.
B. Learn Architecting Distributed Systems.
In simple terms, it is a set of computers communicating with each other over the network and get some coherent tasks done. Just explore use case studies around storage and Big data computations and peer to peer file sharing.
The reason why this is important is a lot of critical infrastructures out there is built out of distributed systems. For more info watch this video.
C. Cloud platforms tools
- Amazon Web Services
- Google Cloud
Get yourself, master, in any one of the aforementioned cloud providers, Concentrate and focus on tools related to data pipeline operations like Data Extraction (Real-time/Streaming), Data Transformation, Workflow Scheduler, Data Storage, Data Visualization, Monitoring, Data Governance/Cataloging, Deployment, Data Security.
D. Big data Tools
It is a common requirement to have a base knowledge in the Hadoop ecosystem and apart from that, you need to gain an understanding of the frameworks below listed.
- Hadoop (HDFS, MapReduce, Hive, Hbase, Drill, etc)
- Spark (Streaming RDD, Spark SQL, Spark DStreams, Spark Syncs/Connectors, Spark Executors, Spark Scheduling, Spark Submit, Distributed Data Transformation, and Processing)
- Message Queuing Technologies (Apache Kafka, ActiveMQ, RabbitMQ)
- SQL/No-SQL system and queries (SQL/DML/DDL)
E. Programming language
Newbies can start with python, simple syntax and you can learn easily and it is great for all range of data analysis tasks. Scala is designed for parallel data processing and effective in handling large datasets.
- Python – Pyspark/ Pandas/ SciPy / NumPy
F. Workflow Orchestration
Big data workflow involved in the execution of tasks in a particular order and most of the time we do repetitive tasks. This can be addressed by doing proper workflow Orchestration.
In earlier days I have used oozie and used to write workflow definitions in XML and it was not programmatic. Due to various limitations, I’ve written my own scheduler which is in python, and used PostgreSQL (database) to save our state (job management). The state will have the details of workflow start, whether it is progressed or failed will be saved.
Now you don’t have to struggle that much 🙂 explore the below tools which should suffice your needs.
Data engineer responsibilities
The main responsibilities of a data engineer are to collecting, managing, and transforming the data from multiple sources into a format where it can be easily analyzed. A typical data engineer should possess good knowledge in below areas
- Data Collection Frameworks
- Hands-on building ‘big data’ data pipelines, architectures, and data sets.
- Working knowledge of message queuing systems (Apache Kafka and its producers and Connectors).
- Data Processing/Wrangling-Cleansing & Transformation Frameworks.
- Relational SQL and NoSQL databases Systems and Queries.
- Data Lake/Warehouse development and processing
- Experience with big data toolset – Hadoop Ecosystem – HDFS, MapReduce, Hive, Hbase, Drill, Spark 2x/3x Ecosystem and Development, Workflow orchestration tools.
- Experience with Cloud services – AWS (or) Google Cloud (or)Azure.
- Best Practices on Data Movement, Storage, Processing, Analytics, Cost Efficiency, Scaling / Resiliencies.
My two cents on Data Engineering
- Use transient clusters (Compute clusters will automatically terminate/stop when the job/steps completed) for Batch processing.
- Start testing phase as early as possible – This will help you to find the bugs in your data pipeline stages. Moreover, if you are using commercial off the shelf applications (COTS), it will be easier to mitigate the risks and limitations (if any) asap.
- Plan Disaster Recovery/High availability and get it tested and documented.
- Have a frequent Performance Load test to check the environment behavior and benchmark the results.
- Always preserve the injected (raw) data in the original form. This approach will allow you to repeat the ETL process in case of failures and other purposes.
- In terms of data storage simply store your data in the optimal tier to save cost each month. The right storage tier can be decided based on two parameters: i) its format and ii) how often it’s accessed.
- Data Security – Have the data Encrypted on rest/fly and Access all passwords/keys from a centralized mechanism.
- Data governance – It is not part of the data engineer bucket but to manage data you need to have a defined set of processes, roles, policies, and standards to access the data.
Data Engineering certifications
Data Engineering on AWS – AWS Certified Data Analytics – Specialty
The AWS Certified Big Data – Specialty certification validates individuals’ technical skills and experience in designing and implementing amazon services to derive value from data.
What does this certification validate?
This certification validates individuals’ understanding of using AWS data analytics services to design, build, secure, and maintain analytics solutions that provide insight from the data.
Data Engineering on Google Cloud Platform – Google Professional Data Engineer
A google professional data engineer certification validates individual potential to design, build, secure, and monitor data processing systems with particular importance on scalability, security and compliance and reliability, and portability.
The Professional Data Engineer exam evaluates your ability to:
- Design data processing systems
- Build and operationalize data processing systems
- Operationalize machine learning models
- Ensure solution quality
Data Engineering on Azure – Microsoft: Azure Data Engineer Associate
Azure Data Engineers design and implement the management, monitoring, security, and privacy of data using the full stack of Azure data services to satisfy business needs. Take these two certification exams and earn – Microsoft Certified: Azure Data Engineer Associate
Data engineers write code, highly analytical, and help companies make the best use of the massive amounts of data they collect by analyzing data from multiple sources and presenting it in easily digestible ways.