GCP Professional Data Engineer Guide – September 2020

0
6541

I have recently recalled my first experience with GCP. In London, shortly before the 2012 Olympics, it was in an online gaming project, initially thought for AWS, that was migrated to App Engine – PAAS platform that would evolve to the current GCP.

My initial impression was good, although the platform imposed several development limitations, which would be reduced later with the release of App Engine Flexible.

Coinciding with Tensor Flow’s launch as an Open Source framework in 2015, I was lucky enough to attend a workshop on neural networks – given by one of the AI scientists from Google Seattle – where I had my second experience with the platform. I was shocked by the simplicity of configuration and deployment, the NoOps concept and a Machine Learning / AI offering without competition at the time.

Do Androids Dream of Electric Sheep? Philip K. Dick would have “hallucinated” with the electrical dreams of neural networks – powered by Tensor Flow.

Exam

The exam structure is the usual one in GCP exams: 2 hours and 50 questions, with a format directed towards scenario-type questions, mixing questions of great difficulty with simpler ones of medium-low difficulty.

In general, to choose the correct answer, you must apply technical and business criteria. Therefore, a deep knowledge of the services from the technological point of view and skill/experience applies the business criteria contextually, depending on the question, type of environment, sector, application, etc …

Image #1, Data Lake, the ubiquitous architecture – Image owned by GCP

Pre-requisites and recommendations

At this level of certification, the questions do not refer, in general, to a single topic. That is, a question from the Analytics domain may require more or less advanced knowledge of Computing, Security, Networking or DevOps to solve it successfully. I’d recommend having the GCP Associate Cloud Engineer certification or having equivalent knowledge.

  • GCP experience at the architectural level – In part, the exam focuses on the architecture solution, design and deployment of data pipelines, selection of technologies to solve business problems, and, to a lesser extent, development. I’d recommend studying as many reference architectures as possible, such as those I show in this guide.
  • GCP experience at the development level – Although no direct programming questions appeared in my question set or the mock test, the exam requires technical knowledge of services and APIS: SQL, Python, REST, algorithms, Map-Reduce, Spark, Apache Beam (Dataflow) …
  • GCP experience at the Security level – A domain that appears transversally in all certifications – I’d recommend knowledge at the Associate Engineer level.
  • GCP experience at the Networking level – Another domain that appears transversely – I’d recommend knowledge at the level of Associate Engineer.
  • Knowledge of Data Analytics – It’s a no-brainer, but some domain knowledge is essential. Otherwise, I’d recommend studying books like “Data Analytics with Hadoop” or taking courses like Specialized Program: Data Engineering, Big Data and ML on Google Cloud in Coursera. Likewise, practising with laboratories or pet projects is essential to obtain some practical experience.
  • Knowledge of the Hadoop – Spark ecosystemConnected with the previous point. High-level ecosystem knowledge is necessary: Map Reduce, Spark, Hive, Hdfs, Pig …
  • Knowledge of Machine Learning and IoT – Advanced knowledge in Data Science and Machine Learning is essential, apart from specific knowledge of GCP products. There are questions exclusively about this domain – at the level of certifications like AWS Machine Learning or higher. IoT appears on the exam in a lighter form, but knowing the architecture and services of reference is essential.
  • DevOps experience – Concepts such as CI / CD, infrastructure or configuration as code are important today, reflected in the exam. However, they do not have great specific weight.
  • We can group the relevant services according to the states (and substates) of the data cycle:

Management, Storage, Transformation and Analysis.

  • Ingestion Batch / Data Lake: Cloud Storage.
  • Ingestion Streaming: Kafka, Pub/Sub, Computing Services, Cloud IoT Core.
  • Migrations: Transfer Appliance, Transfer Service, Interconnect, gsutil.
  • Transformations: Dataflow, Dataproc, Cloud Dataprep, Hadoop, Apache Beam.
  • Computing: Kubernetes Engine, Compute Instances, Cloud Functions, App Engine.
  • Storage: Cloud SQL, Cloud Spanner, Datastore / Firebase, BigQuery, BigTable, HBase, MongoDB, Cassandra.
  • Cache: Cloud Memorystore, Redis.
  • Analysis / Data Operations: BigQuery, Cloud Datalab, Data Studio, DataPrep, Cloud Composer, Apache Airflow.
  • Machine Learning: AI Platform, BigQueryML, Cloud AutoML, Tensor Flow, Cloud Text-to-Speech API, Cloud Speech-to-Text, Cloud Vision API, Cloud Video AI, Translations, Recommendations API, Cloud Inference API, Natural Language, DialogFlow, Spark MLib.
  • IoT: Cloud IoT Core, Cloud IoT Edge.
  • Security & Encryption: IAM, Roles, Encryption, KMS, Data Prevention API, Compliance …
  • Operations: Kubeflow, AI Platform, Cloud Deployment Manager …
  • Monitorization: Cloud Stackdriver Logging, Stackdriver Monitoring.
  • Optimization: Cost control, Autoscaling, Preemptive instances …

Standard questions

Representative questions of the level of difficulty of the exam.

Image property of GCP

Practical migration scenario question that includes cloud services and the Hadoop ecosystem and concepts from the Analytics domain.

Services to study in detail

Image #2 – property of GCP

  • Cloud Storage – Core service that appears consistently in all certifications and is central to Data Lake systems. I’d recommend its study in detail at an architectural level – see Image 1 -, configurations according to the data temperature and as an integration/storage element between the different services.
  • BigQuery – Core service in the Analytics GCP domain as a BI and storage element. Extremely important in the exam, so they have to be studied in detail: architecture, configuration, backups, export/import, streaming, batch, security, partitioning, sharding, projects, datasets, views, integration with other services, cost, queries and optimization SQL (legacy and standard) at table levels, keys …
  • Pub / Sub – Core service as an element of ingestion and integration. Its in-depth study is highly recommended: use cases, architecture, configuration, API, security and integration with other services (e.g. Dataflow, Cloud Storage) – Kafka’s native cloud mirror service.
  • Dataflow – Core service in the Analytics GCP domain as a process and transformation element. Implementation based on Apache Beam must know at a high level and pipeline design. Use cases, architecture, configuration, API and integration with other services.
  • Dataproc – Core service in the Analytics GCP domain as a process and transformation element. It is a service based on Hadoop; therefore, it is the indicated service for migrating to the cloud. In this case, knowledge of Dataproc is required, and in native services: Spark, HDFS, HBase, Pig … use cases, architecture, configuration, import/export, reliability, optimization, cost, API and integration with other services.
  • Cloud SQL, Cloud Spanner – Cloud-native relational databases. Use cases, architecture, configuration, security, performance, reliability, cost and optimization: clusters, transactionality, disaster recovery, backups, export/import, SQL performance and optimization, tables, queries, keys and debugging. Integration with other services.
  • Cloud Bigtable – Low latency NoSQL managed database, suitable for time series, IoT… ideal for replacing an HBase installation on-premise. Use cases, architecture, configuration, security, performance, reliability and optimization: clusters, CAP, backups, export/import, partitioning, performance, and optimization of tables, queries, and keys. Integration with other services.
  • Machine Learning – One of the certification strengths is the domain of “Operationalizing machine learning models”. It is much more dense and complex than it may seem at first since it includes the operability and knowledge of the relevant GCP services and the knowledge of the Data Science fundamentals: algorithm selection, optimization, metrics … The questions’ difficulty level is variable but comparable to specific certifications, such as AWS Certified Machine Learning – Specialty. Most essential services: BigQuery ML, Cloud Vision API, Cloud Video Intelligence, Cloud AutoML, Tensor Flow, Dialogflow, GPUs, TPU, ‘s…
  • Security – Security is a transversal concern across all domains and appears consistently in all certifications. In this case, it appears as an independent technical topic, crosscutting concern or a business requirement: KMS, IAM, Policies, Roles, Encryption, Data Prevention API …
Image #3, IoT Reference Architecture – owned by GCP

Essential services to consider

  • Networking – Cross-domain can appear in the form of separate technical issues, cross-cutting concerns, or business requirements: VPC, Direct Interconnect, Multi-Region / Zone, Hybrid connectivity, Firewall rules, Load Balancing, Network Security, Container Networking, API Access (private/public) …
  • Hadoop – The exam covers ecosystems and third-party services like Hadoop, Spark, HDFS, Hive, Pig … use cases, architecture, functionality, integration and migration to GCP.
  • Apache Kafka – Alternative service to Pub / Sub, so it is advisable to study it at a high level: use cases, operational characteristics, configuration, migration and integration with GCP – plugins, connectors.
  • IoT – It can appear in various questions at the architectural level: use cases, reference architecture and integration with other services. IoT core, Edge Computing.
  • Datastore / Firebase – Document database. Use cases, configuration, performance, entity model, keys and index optimization, transactions, backups, export/import and integration with other services. It doesn’t carry as much weight as the other data repositories.
  • Cloud Memory Store / Redis – Structured data cache repository. Use cases, architecture, configuration, performance, reliability and optimization: clusters, backups, export/import and integration with other services.
  • Cloud Dataprep – Use cases, console and general operation, supported formats, and Dataflow integration.
  • Cloud Stackdriver Use cases, monitoring and logging, both at the system and application level: Cloud Stackdriver Logging, Cloud Stackdriver Monitoring, Stackdriver Agent and plugins.

Other services

  • MongoDB, Cassandra – NoSQL databases that can appear in different scenarios. Use cases, architecture and integration with other services.
  • Cloud Composer – Use cases, general operation and web console, the configuration of diagram types, supported formats, import/export, integration with other services, and connectors.
  • Cloud Data Studio – Use cases, configuration, networking, security, general operation and environment, and integration with other services.
  • Cloud Data Lab – Use cases, general operation and web console, types of diagrams, supported formats, import/export and integration with other services.
  • Kubernetes Engine – Use cases, architecture, clustering and integration with other services.
  • Kubeflow – Use cases, architecture, environment configuration, Kubernetes.
  • Apache Airflow – Use cases, architecture and general operation.
  • Cloud Functions Use cases, architecture, configuration and integration with other services – such as Cloud Storage and Pub / Sub, in Push / Pull mode.
  • Compute Engine – Use cases, architecture, configuration, high availability, reliability and integration with other services.
  • App Engine – Use cases, architecture and integration with other services.

Bibliography & essential resources

Google provides many resources for preparing this certification in the form of courses, an official guidebook, documentation and mock exams. These resources are highly recommended and, in some cases, I would say essential.

The Data Engineering Specialized Program contains the Certification Preparation Course includes an extra exam, lots of additional tips, materials, and labs – using the external Qwik Labs tool.

As I have previously indicated, I find the Google courses on Coursera excellent. They combine short videos, reading material, labs, and test questions, thus creating a very dynamic experience. In any case, they should only be considered as a starting point, being necessary for the deepening – according to experience – in each one of the domains using, for instance, the excellent GCP documentation.

But you should not limit yourself to online courses. I can’t hide the fact that I love books in general and IT books in particular. In fact, I have a vast collection of books dating back to the 80s, which at some point, I will donate to a local Cervantina bookstore.

Books provide a more profound and dynamic experience than videos, which can be monotonous if they are too long and a much more passive experience – like watching TV. The ideal combines audiovisual and written media, thus creating your learning path.

Laboratories

Image #4 – Data Lake based upon Cloud Storage – owned by GCP

Part of the job as a Data Engineer consists of creating, integrating, deploying and maintaining data pipelines, both in batch and streaming mode.

The Data Engineering Quest contains several labs that introduce different data transformation, IoT, and Machine Learning pipelines, so I find them excellent exercises – and not just for certification.

Is it worth it?

The level of certification is advanced, and in general, it should not be the first cloud certification to obtain. It covers a large amount of material and domains, so tackling it without a certain level of prior knowledge can be quite a complex task.

Let’s compare it with the mirror certification on the AWS platform. It covers almost twice as much material, mainly due to the inclusion of questions about the Machine Learning / Data Science domain – which in the case of AWS, have been eliminated to be included in its certification. Therefore, it is like taking two certifications in one.

Is it worth it? Of course, but not as a first certification – depending on the experience provided.

Certifications are an excellent way to validate knowledge externally and collect updated information, validate good practices, and consolidate knowledge with real practical cases (or almost).

Good luck to you all!