GCP Professional Data Engineer Guide – September 2020

Posted by

I have recently recalled my first experience with GCP. It was in London, shortly before the 2012 Olympics, in an online gaming project, initially thought for AWS, that was migrated to App Engine – PAAS platform that would evolve to the current GCP.

My initial impression was good, although the platform imposed several development limitations, which would be reduced later with the release of App Engine Flexible.

Coinciding with the launch of Tensor Flow as an Open Source framework in 2015, I was lucky enough to attend a workshop on neural networks – given by one of the AI scientists from Google Seattle – where I had my second experience with the platform. I was shocked by the simplicity of configuration and deployment, the NoOps concept and a Machine Learning / AI offering, without competition at the time.

Do Androids Dream of Electric Sheep? Philip K. Dick would have «hallucinated» with the electrical dreams of neural networks – powered by Tensor Flow.

Exam

The structure of the exam is the usual one in GCP exams: 2 hours and 50 questions, with a format directed towards scenario-type questions, mixing questions of great difficulty with simpler ones of medium-low difficulty.

In general, to choose the correct answer, you have to apply technical and business criteria. Therefore, it is necessary a deep knowledge of the services from the technological point of view, as well as skill/experience to apply the business criteria contextually, depending on the question, type of environment, sector, application, etc …

Image #1, Data Lake, the ubiquitous architecture – Image owned by GCP

We can group the relevant services according to the states (and substates) of the data cycle:

Management, Storage, Transformation and Analysis.

  • Ingestion Batch / Data Lake: Cloud Storage.
  • Ingestion Streaming: Kafka, Pub/Sub, Computing Services, Cloud IoT Core.
  • Migrations: Transfer Appliance, Transfer Service, Interconnect, gsutil.
  • Transformations: Dataflow, Dataproc, Cloud Dataprep, Hadoop, Apache Beam.
  • Computing: Kubernetes Engine, Compute Instances, Cloud Functions, App Engine.
  • Storage: Cloud SQL, Cloud Spanner, Datastore / Firebase, BigQuery, BigTable, HBase, MongoDB, Cassandra.
  • Cache: Cloud Memorystore, Redis.
  • Analysis / Data Operations: BigQuery, Cloud Datalab, Data Studio, DataPrep, Cloud Composer, Apache Airflow.
  • Machine Learning: AI Platform, BigQueryML, Cloud AutoML, Tensor Flow, Cloud Text-to-Speech API, Cloud Speech-to-Text, Cloud Vision API, Cloud Video AI, Translations, Recommendations API, Cloud Inference API, Natural Language, DialogFlow, Spark MLib.
  • IoT: Cloud IoT Core, Cloud IoT Edge.
  • Security & encryption: IAM, Roles, Encryption, KMS, Data Prevention API, Compliance …
  • Operations: Kubeflow, AI Platform, Cloud Deployment Manager …
  • Monitorization: Cloud Stackdriver Logging, Stackdriver Monitoring.
  • Optimization: Cost control, Autoscaling, Preemptive instances …

Pre-requisites and recommendations

At this level of certification, the questions do not refer, in general, to a single topic. That is, a question from the Analytics domain may require more or less advanced knowledge of Computing, Security, Networking or DevOps to be able to solve it successfully. I´d recommend having the GCP Associate Cloud Engineer certification or have equivalent knowledge.

  • GCP experience at the architectural level. The exam is focused, in part, on the architecture solution, design and deployment of data pipelines; selection of technologies to solve business problems, and to a lesser extent development. I´d recommend studying as many reference architectures as possible, such as the ones I show in this guide.
  • GCP experience at the development level. Although no explicit programming questions appeared in my question set, or in the mock test, the exam requires technical knowledge of services and APIS: SQL, Python, REST, algorithms, Map-Reduce, Spark, Apache Beam (Dataflow)
  • GCP experience at the Security level. A domain that appears transversally in all certifications – I´d recommend knowledge at the level of Associate Engineer.
  • GCP experience at Networking level. Another domain that appears transversely – I´d recommend knowledge at the level of Associate Engineer.
  • Knowledge of Data Analytics. It’s a no-brainer, but some domain knowledge is essential. Otherwise, I´d recommend studying books like “Data Analytics with Hadoop” or taking courses like Specialized Program: Data Engineering, Big Data and ML on Google Cloud in Coursera. Likewise, practising with laboratories or pet projects is essential to obtain some practical experience.
  • Knowledge of the Hadoop – Spark ecosystem. Connected with the previous point. High-level knowledge of the ecosystem is necessary: Map Reduce, Spark, Hive, Hdfs, Pig
  • Knowledge of Machine Learning and IoT. Advanced knowledge in Data Science and Machine Learning is essential, apart from specific knowledge of GCP products. There are questions exclusively about this domain – at the level of certifications like AWS Machine Learning or higher. IoT appears on the exam in a lighter form, but it is essential to know the architecture and services of reference.
  • DevOps experience. Concepts such as CI / CD, infrastructure or configuration as code, are of great importance today, and this is reflected in the exam, although they do not have great specific weight.

Standard questions

Representative question of the level of difficulty of the exam.

Image property of GCP

Practical migration scenario question that includes cloud services and the Hadoop ecosystem, as well as concepts from the Analytics domain.

Services to study in detail

Image #2 – property of GCP
  • Cloud Storage – Core service that appears consistently in all certifications, and is central in the Data Lake systems. I´d recommend its study in detail at an architectural level – see Image 1 -, configurations according to the data temperature, and as an integration/storage element between the different services.
  • BigQuery – Core service in the Analytics GCP domain as a BI and storage element. Extremely important in the exam, so have to be studied in detail: architecture, configuration, backups, export/import, streaming, batch, security, partitioning, sharding, projects, datasets, views, integration with other services, cost, queries and optimization SQL (legacy and standard) at table levels, keys …
  • Pub / Sub – Core service as an element of ingestion and integration. Its in-depth study is highly recommended: use cases, architecture, configuration, API, security and integration with other services (e.g. Dataflow, Cloud Storage) – Kafka’s native cloud mirror service.
  • Dataflow – Core service in the Analytics GCP domain as a process and transformation element. Implementation based on Apache Beam that is necessary to know at a high level and pipeline design. Use cases, architecture, configuration, API and integration with other services.
  • Dataproc – Core service in the Analytics GCP domain as a process and transformation element. It is a service based on Hadoop, and therefore, it is the indicated service for a migration to the cloud. In this case, not only knowledge of Dataproc is required, but also in native services: Spark, HDFS, HBase, Pig … use cases, architecture, configuration, import/export, reliability, optimization, cost, API and integration with other services.
  • Cloud SQL, Cloud Spanner – Cloud-native relational databases. Use cases, architecture, configuration, security, performance, reliability, cost and optimization: clusters, transactionality, disaster recovery, backups, export/import, SQL performance and optimization, tables, queries, keys and debugging. Integration with other services.
  • Cloud Bigtable – Low latency NoSQL managed database, suitable for time series, IoT… ideal to replace an HBase installation on-premise. Use cases, architecture, configuration, security, performance, reliability and optimization: clusters, CAP, backups, export/import, partitioning, performance, and optimization of tables, queries, keys. Integration with other services.
  • Machine Learning – One of the strengths of the certification is the domain «Operationalizing machine learning models». Much more dense and complex than it may seem at first since it not only includes the operability and knowledge of the relevant GCP services; likewise, it includes the knowledge of the Data Science fundamentals: algorithm selection, optimization, metrics … The level of difficulty of the questions is variable but comparable to that of specific certifications, such as AWS Certified Machine Learning – Specialty. Most important services: BigQuery ML, Cloud Vision API, Cloud Video Intelligence, Cloud AutoML, Tensor Flow, Dialogflow, GPU´s, TPU´s
  • Security – Security is a transversal concern across all domains, and appears consistently in all certifications. In this case, it appears as an independent technical topic, crosscutting concern or as a business requirement: KMS, IAM, Policies, Roles, Encryption, Data Prevention API …
Image #3, IoT Reference Architecture – owned by GCP

Very important services to consider

  • Networking – Cross-domain that can appear in the form of separate technical issues, cross cutting concerns, or as business requirements: VPC, Direct Interconnect, Multi Region / Zone, Hybrid connectivity, Firewall rules, Load Balancing, Network Security, Container Networking, API Access ( private / public) …
  • Hadoop – The exam covers ecosystems and third-party services like Hadoop, Spark, HDFS, Hive, Pig … use cases, architecture, functionality, integration and migration to GCP.
  • Apache Kafka – Alternative service to Pub / Sub, so it is advisable to study it at a high level: use cases, operational characteristics, configuration, migration and integration with GCP – plugins, connectors.
  • IoT – It can appear in various questions at the architectural level: use cases, reference architecture and integration with other services. IoT core, Edge Computing.
  • Datastore / Firebase – Document database. Use cases, configuration, performance, entity model, keys and index optimization, transactions, backups, export / import and integration with other services. It doesn’t carry as much weight as the other data repositories.
  • Cloud Memory Store / Redis – Structured data cache repository. Use cases, architecture, configuration, performance, reliability and optimization: clusters, backups, export / import and integration with other services.
  • Cloud Dataprep – Use cases, console and general operation, supported formats, and Dataflow integration.
  • Cloud Stackdriver – Use cases, monitoring and logging, both at the system and application level: Cloud Stackdriver Logging, Cloud Stackdriver Monitoring, Stackdriver Agent and plugins.

Other services

  • MongoDB, Cassandra – NoSQL databases that can appear in different scenarios. Use cases, architecture and integration with other services.
  • Cloud Composer – Use cases, general operation and web console, the configuration of diagram types, supported formats, import/export, integration with other services, connectors.
  • Cloud Data Studio – Use cases, configuration, networking, security, general operation and environment, and integration with other services.
  • Cloud Data Lab – Use cases, general operation and web console, types of diagrams, supported formats, import/export and integration with other services.
  • Kubernetes Engine – Use cases, architecture, clustering and integration with other services.
  • Kubeflow – Use cases, architecture, environment configuration, Kubernetes.
  • Apache Airflow – Use cases, architecture and general operation.
  • Cloud Functions Use cases, architecture, configuration and integration with other services – such as Cloud Storage and Pub / Sub, in Push / Pull mode.
  • Compute Engine – Use cases, architecture, configuration, high availability, reliability and integration with other services.
  • App Engine – Use cases, architecture and integration with other services.

Bibliography & essential resources

Google provides a large number of resources for the preparation of this certification, in the form of courses, official guide book, documentation and mock exams. These resources are highly recommended, and in some cases, I would say essential.

The Certification Preparation Course, contained in the Data Engineering Specialized Program, includes an extra exam, lots of additional tips and materials and labs – using the external Qwik Labs tool.


Bibliography (selection) that I have used for the preparation of the certification

As I have previously indicated, I find the Google courses on Coursera to be excellent, as they combine a series of short videos, reading material, labs, and test questions, thus creating a very dynamic experience. In any case, they should only be considered as a starting point, being necessary the deepening – according to experience – in each one of the domains using, for instance, the excellent GCP documentation.

But you should not limit yourself to online courses. I can’t hide the fact that I love books in general, and IT books in particular. In fact, I have a huge collection of books dating back to the 80s, which at some point I will donate to a local Cervantina bookstore.

Books provide a deeper and more dynamic experience than videos, which can be a bit monotonous if they are too long – as well as being a much more passive experience – like watching TV. The ideal is the combination of audiovisual and written media, thus creating your own learning path.

Laboratories

Image #4 – Data Lake based upon Cloud Storage – owned by GCP

Part of the job as a Data Engineer consists of creating, integrating, deploying and maintaining data pipelines, both in batch and streaming mode.

The Data Engineering Quest contains several labs that introduce the creation of different data transformation, IoT, and Machine Learning pipelines, so I find them excellent exercises – and not just for certification.

Is it worth it?

The level of certification is advanced, and in general, it should not be the first cloud certification to obtain. It covers a large amount of material and domains, so tackling it without a certain level of prior knowledge can be quite a complex task.

If we compare it with the mirror certification on the AWS platform, it covers almost twice as much material, mainly due to the inclusion of questions about the Machine Learning / Data Science domain – which in the case of AWS have been eliminated, to be included in its own certification. Therefore, it is like taking two certifications in one.

Is it worth? of course, but not as a first certification – depending on the experience provided.

Certifications are a good way, not only to validate knowledge externally but to collect updated information, validate good practices and consolidate knowledge with real practical cases (or almost).

Good luck to you all!

Deja un comentario