GCP Professional Data Engineer Guide – September 2020

I have recently recalled my first experience with GCP. It was in London, shortly before the 2012 Olympics, in an online gaming project, initially thought for AWS, that was migrated to App Engine – PAAS platform that would evolve to the current GCP.

My initial impression was good, although the platform imposed a number of development limitations, which would be reduced later with the release of App Engine Flexible.

Coinciding with the launch of Tensor Flow as an Open Source framework in 2015, I was lucky enough to attend a workshop on neural networks – given by one of the AI scientists from Google Seattle – where I had my second experience with the platform. I was very surprised by the simplicity of configuration and deployment, the NoOps concept and a Machine Learning / AI offering, without competition at the time.

Do Androids Dream of Electric Sheep? Philip K. Dick would have “hallucinated” with the electrical dreams of neural networks – powered by Tensor Flow.

Exam

The structure of the exam is the usual one in GCP exams: 2 hours and 50 questions, with a format directed towards scenario-type questions, mixing questions of high difficulty with simpler ones of medium-low difficulty.

In general, to choose the correct answer, you have to apply technical and business criteria. Therefore, it is necessary a deep knowledge of the services from the technological point of view, as well as skill / experience to apply the business criteria in a contextual way, depending on the question, type of environment, sector, application, etc .. .

Image #1, Data Lake, the ubiquitous architecture – Image owned by GCP

We can group the relevant services according to the states (and substates) of the data cycle:

Management, Storage, Transformation and Analysis.

  • Ingestion Batch / Data Lake: Cloud Storage.
  • Ingestion Streaming: Kafka, Pub/Sub, Computing Services, Cloud IoT Core.
  • Migrations: Transfer Appliance, Transfer Service, Interconnect, gsutil.
  • Transformations: Dataflow, Dataproc, Cloud Dataprep, Hadoop, Apache Beam.
  • Computing: Kubernetes Engine, Compute Instances, Cloud Functions, App Engine.
  • Storage: Cloud SQL, Cloud Spanner, Datastore / Firebase, BigQuery, BigTable, HBase, MongoDB, Cassandra.
  • Cache: Cloud Memorystore, Redis.
  • Analysis / Data Operations: BigQuery, Cloud Datalab, Data Studio, DataPrep, Cloud Composer, Apache Airflow.
  • Machine Learning: AI Platform, BigQueryML, Cloud AutoML, Tensor Flow, Cloud Text-to-Speech API, Cloud Speech-to-Text, Cloud Vision API, Cloud Video AI, Translations, Recommendations API, Cloud Inference API, Natural Language, DialogFlow, Spark MLib.
  • IoT: Cloud IoT Core, Cloud IoT Edge.
  • Security & encryption: IAM, Roles, Encryption, KMS, Data Prevention API, Compliance …
  • Operations: Kubeflow, AI Platform, Cloud Deployment Manager …
  • Monitorization: Cloud Stackdriver Logging, Stackdriver Monitoring.
  • Optimization: Cost control, Autoscaling, Preemptive instances …

Pre-requisites and recommendations

At this level of certification, the questions do not refer, in general, to a single topic. That is, a question from the Analytics domain may require more or less advanced knowledge of Computing, Security, Networking or DevOps to be able to solve it successfully. I´d recommend having the GCP Associate Cloud Engineer certification or have equivalent knowledge.

  • GCP experience at the architectural level. The exam is focused, in part, on the architecture solution, design and deployment of data pipelines; selection of technologies to solve business problems, and to a lesser extent development. I´d recommend studying as many reference architectures as possible, such as the ones I show in this guide.
  • GCP experience at the development level. Although no explicit programming questions appeared in my question set, or in the mock test, the exam requires technical knowledge of services and APIS: SQL, Python, REST, algorithms, Map-Reduce, Spark, Apache Beam (Dataflow)
  • GCP experience at Security level. Domain that appears transversally in all certifications – I´d recommend knowledge at the level of Associate Engineer.
  • GCP experience at Networking level. Another domain that appears transversely – I´d recommend knowledge at the level of Associate Engineer.
  • Knowledge of Data Analytics. It’s a no-brainer, but some domain knowledge is essential. Otherwise, I´d recommend studying books like “Data Analytics with Hadoop” or taking courses like Specialized Program: Data Engineering, Big Data and ML on Google Cloud in Coursera. Likewise, practicing with laboratories or pet projects is essential to obtain some practical experience.
  • Knowledge of the Hadoop – Spark ecosystem. Connected with the previous point. High-level knowledge of the ecosystem is necessary: Map Reduce, Spark, Hive, Hdfs, Pig
  • Knowledge of Machine Learning and IoT. Advanced knowledge in Data Science and Machine Learning is essential, apart from specific knowledge of GCP products. There are questions exclusively about this domain – at the level of certifications like AWS Machine Learning or higher. IoT appears on the exam in a lighter form, but it is essential to know the architecture and services of reference.
  • DevOps experience. Concepts such as CI / CD, infrastructure or configuration as code, are of great importance today, and this is reflected in the exam, although they do not have a great specific weight.

Standard questions

Representative question of the level of difficulty of the exam.

Image property of GCP

Practical migration scenario question, that includes cloud services and the Hadoop ecosystem, as well as concepts from the Analytics domain.

Services to study in detail

Image #2 – property of GCP
  • Cloud Storage – Core service that appears consistently in all certifications, and is central in the Data Lake systems. I´d recommend its study in detail at an architectural level – see Image 1 -, configurations according to the data temperature, and as an integration / storage element between the different services
  • BigQuery – Core service in the Analytics GCP domain as a BI and storage element. Extremely important in the exam, so have to be studied in detail: architecture, configuration, backups, export / import, streaming, batch, security, partitioning, sharding, projects, datasets, views, integration with other services, cost, queries and optimization SQL (legacy and standard) at table levels, keys …
  • Pub / Sub – Core service as an element of ingestion and integration. Its in-depth study is highly recommended: use cases, architecture, configuration, API, security and integration with other services (eg Dataflow, Cloud Storage) – Kafka’s native cloud mirror service.
  • Dataflow – Core service in the Analytics GCP domain as a process and transformation element. Implementation based on Apache Beam that is necessary to know at a high level and pipeline design. Use cases, architecture, configuration, API and integration with other services.
  • Dataproc – Core service in the Analytics GCP domain as a process and transformation element. It is a service based on Hadoop, and therefore, it is the indicated service for a migration to the cloud. In this case, not only knowledge of Dataproc is required, but also in native services: Spark, HDFS, HBase, Pig … use cases, architecture, configuration, import / export, reliability, optimization, cost, API and integration with other services.
  • Cloud SQL, Cloud Spanner – Cloud native relational databases. Use cases, architecture, configuration, security, performance, reliability, cost and optimization: clusters, transactionality, disaster recovery, backups, export / import, SQL performance and optimization, tables, queries, keys and debugging. Integration with other services.
  • Cloud Bigtable – Low latency NoSQL managed database, suitable for time series, IoT… ideal to replace a HBase installation on premise. Use cases, architecture, configuration, security, performance, reliability and optimization: clusters, CAP, backups, export / import, partitioning, performance, and optimization of tables, queries, keys. Integration with other services.
  • Machine Learning – One of the strengths of the certification is the domain “Operationalizing machine learning models”. Much more dense and complex than it may seem at first, since it not only includes the operability and knowledge of the relevant GCP services; likewise, it includes the knowledge of the Data Science fundamentals: algorithm selection, optimization, metrics … The level of difficulty of the questions is variable, but comparable to that of specific certifications, such as AWS Certified Machine Learning – Specialty. Most important services: BigQuery ML, Cloud Vision API, Cloud Video Intelligence, Cloud AutoML, Tensor Flow, Dialogflow, GPU´s, TPU´s
  • Security – Security is a transversal concern across all domains, and appears consistently in all certifications. In this case, it appears as an independent technical topic, crosscutting concern or as a business requirement: KMS, IAM, Policies, Roles, Encryption, Data Prevention API …
Image #3, IoT Reference Architecture – owned by GCP

Very important services to consider

  • Networking – Cross-domain that can appear in the form of separate technical issues, cross cutting concerns, or as business requirements: VPC, Direct Interconnect, Multi Region / Zone, Hybrid connectivity, Firewall rules, Load Balancing, Network Security, Container Networking, API Access ( private / public) …
  • Hadoop – The exam covers ecosystems and third-party services like Hadoop, Spark, HDFS, Hive, Pig … use cases, architecture, functionality, integration and migration to GCP.
  • Apache Kafka – Alternative service to Pub / Sub, so it is advisable to study it at a high level: use cases, operational characteristics, configuration, migration and integration with GCP – plugins, connectors.
  • IoT – It can appear in various questions at the architectural level: use cases, reference architecture and integration with other services. IoT core, Edge Computing.
  • Datastore / Firebase – Document database. Use cases, configuration, performance, entity model, keys and index optimization, transactions, backups, export / import and integration with other services. It doesn’t carry as much weight as the other data repositories.
  • Cloud Memory Store / Redis – Structured data cache repository. Use cases, architecture, configuration, performance, reliability and optimization: clusters, backups, export / import and integration with other services.
  • Cloud Dataprep – Use cases, console and general operation, supported formats, and Dataflow integration.
  • Cloud Stackdriver – Use cases, monitoring and logging, both at the system and application level: Cloud Stackdriver Logging, Cloud Stackdriver Monitoring, Stackdriver Agent and plugins.

Other services

  • MongoDB, Cassandra – NoSQL databases that can appear in different scenarios. Use cases, architecture and integration with other services.
  • Cloud Composer – Use cases, general operation and web console, configuration of diagram types, supported formats, import / export, integration with other services, connectors.
  • Cloud Data Studio – Use cases, configuration, networking, security, general operation and environment, and integration with other services.
  • Cloud Data Lab – Use cases, general operation and web console, types of diagrams, supported formats, import / export and integration with other services.
  • Kubernetes Engine – Use cases, architecture, clustering and integration with other services.
  • Kubeflow – Use cases, architecture, environment configuration, Kubernetes.
  • Apache Airflow – Use cases, architecture and general operation.
  • Cloud Functions Use cases, architecture, configuration and integration with other services – such as Cloud Storage and Pub / Sub, in Push / Pull mode.
  • Compute Engine – Use cases, architecture, configuration, high availability, reliability and integration with other services.
  • App Engine – Use cases, architecture and integration with other services.

Bibliography & essential resources

Google provides a large number of resources for the preparation of this certification, in the form of courses, official guide book, documentation and mock exams. These resources are highly recommended, and in some cases, I would say essential.

The Certification Preparation Course, contained in the Data Engineering Specialized Program, includes an extra exam, lots of additional tips and materials and labs – using the external Qwik Labs tool.


Bibliography (selection) that I have used for the preparation of the certification

As I have previously indicated, I find the Google courses on Coursera to be excellent, as they combine a series of short videos, reading material, labs, and test questions, thus creating a very dynamic experience. In any case, they should only be considered as a starting point, being necessary the deepening – according to experience – in each one of the domains using, for instance, the excellent GCP documentation.

But you should not limit yourself to online courses. I can’t hide the fact that I love books in general, and IT books in particular. In fact, I have a huge collection of books dating back to the 80s, which at some point I will donate to a local Cervantina bookstore.

Books provide a deeper and more dynamic experience than videos, which can be a bit monotonous if they are too long – as well as being a much more passive experience – like watching TV. The ideal is the combination of audiovisual and written media, thus creating your own learning path.

Laboratories

Image #4 – Data Lake based upon Cloud Storage – owned by GCP

Part of the job as a Data Engineer consists of creating, integrating, deploying and maintaining data pipelines, both in batch and streaming mode.

The Data Engineering Quest contains several labs that introduce the creation of different data transformation, IoT, and Machine Learning pipelines, so I find them excellent exercises – and not just for certification.

Is it worth it?

The level of certification is advanced, and in general, it should not be the first cloud certification to obtain. It covers a large amount of material and domains, so tackling it without a certain level of prior knowledge can be quite a complex task.

If we compare it with the mirror certification on the AWS platform, it covers almost twice as much material, mainly due to the inclusion of questions about the Machine Learning / Data Science domain – which in the case of AWS have been eliminated, to be included in its own certification. Therefore, it is like taking two certifications in one.

Is it worth? of course, but not as a first certification – depending on the experience provided.

Certifications are a good way, not only to validate knowledge externally, but to collect updated information, validate good practices and consolidate knowledge with real practical cases (or almost).

Good luck to you all!

AWS Certified Developer Reloaded

I’m going to share my recent experience with the re-certification – June 2020AWS Developer, one of my favorites, without a doubt. An experience that has been very different from the previous one, since, if my memory serves me well, I didn’t find any repeated question.

The structure of the exam is the usual one for the associated level: 2 hours and 65 questions, with an evolved format, even more, towards scenario-type questions. I don’t recall any direct questions, and certainly not extremely easy ones. That said, it seems to me to be a much more balanced exam than the previous version, where some services had much more weight than others – API gateway, I’m looking at you.

Virtually all Core / Serverless services – important ones – are represented in the exam:

  • S3
  • In Memory Databases: Elastic Cache, Memcache, Redis
  • Databases: RDS, DynamoDB …
  • Security: KMS, Policies ..
  • CI / CD, IAC: ElasticBeanstalk, Codepipeline, Cloudformation …
  • Serverless: Lambda functions, API Gateway, Cognito …
  • Microservices: SQS, SNS, Kinesis, Containers, Step Functions …
  • Monitorización: Cloudwatch, Cloudwatch Logs, Cloudtrail, X-Ray …
  • Optimización: Cost control, Autoscaling, Spot Fleets …

Developer is the Serverless certification par excellence, although some services, such as Step Functions or Fargate Containers, are poorly represented – just one or two questions, and of high difficulty.

Serverless is a great option for IoT Sytems

Prerequisites and recommendations

I will not repeat the information that is already available on the AWS website; instead, I will give my recommendations and personal observations.

Professionals with experience in Serverless development – especially in AWS – Microservices, or experience with React-type applications, will be the most comfortable when preparing and facing this certification.

  • AWS Experience. Certification indicated for professionals with little or no experience on AWS. I´d recommend getting the AWS Certified Cloud Practitioner, though.
  • Dev Experience. It’s essential to possess a certain level, since many of the questions are eminently practical, and are the result of experience in the development field. Knowledge of programming languages like Python, Javascript or Java is something very desirable. The exam poses programming problems indirectly, through concepts, debugging and optimization. The lack of this knowledge or experience generates the impression in many professionals that this certification is of a very high level of difficulty, when in my opinion it is not.
  • Architecture experience. The exam is largely focused on the development of Cloud applications, especially Serverless – Microservices. However, some questions may require knowledge at the Cloud / Serverless / Containers architecture pattern level.
  • DevOps Experience. Concepts such as CI / CD, infrastructure or configuration as code are of great importance today, and this is reflected in the exam. Obviously, the questions focus – for the most part – on AWS products, but knowledge of other products like Docker, Jenkins, Spinaker, Git and general principles can go a long way. Let’s not forget that this certification, together with SysOps, are part of the recommended path to obtain the AWS DevOps Pro certification, and obtaining them automatically re-certifies the two previously mentioned.

Neo, knowing the path is not the same as walking it” – Morpheus. The Matrix, 1999


Imagen aws.amazon.com

AWS Technical Essentials: introductory course, low level. Live remote or in person.

Developing on AWS: course focused on developing AWS applications using the SDK. It is intermediate level, and the agenda seems quite relevant to the certification. Live remote or in person. Not free.

Advanced Developing on AWS: interesting course, but focused on AWS Architecture: migrations, re-architecture, microservices .. Live remote or face-to-face. Not free.

Exam Readiness Developer: essential. Free and digital.

AWS Certified Cloud Practitioner: Official certification, especially aimed at professionals with little knowledge of the Cloud in general, and AWS in particular.

Exam

As I have previously commented, the exam format is similar to most certifications, associated or not. That is, “scenario based”, and in this case of medium difficulty, medium-high. You are not going to find “direct” or excessively simple questions. As it is an associated level exam, each question focuses on a single topic, that is, if the question is about DynamoDB, the question will not contain cross cutting concerns, such as Security, for instance.

Let’s examine a question taken from the certification sample questionnaire:

Very representative question of the medium – high level of difficulty of the exam. We are talking about a development-oriented certification, so you will find questions about development, APIs, configuration, optimization and debugging. In this case, we are presented with a real example of configuring and designing indexes for a DynamoDB table.

DynamoDB is an integral part of the AWS Serverless offering and the flagship database – with permission from Aurora Serverless. Low latency NOSQL database ideal for IoT, events, time – series etc … Its purely Serverless nature allows its use without the need to provide and manage servers, or the need to place it within a VPC. This fact provides a great advantage when accessing it directly from Lambda functions, since it is not necessary that they would “live” within a VPC, with the added expense of resource management and possible performance problems – “enter Hyperplane”.

DynamoDB hardly appears in the new AWS Databases certification, so I´d recommend that you study it in depth for this certification, due the number of questions that may appear.

Services to study in detail

The following services are of great importance – not just to pass the certification – so I highly recommend in-depth study.

Imagen aws.amazon.com
  • AWS S3 – Core service. It appears consistently across all certifications. Use cases, security, encryption, API, development and debugging.
  • Seguridad – It appears consistently in all certifications: KMS encryption, Certificate Manager, AWS Cloud HMS, Federation, Active Directory, IAM, Policies, Roles etc….
  • AWS Lambda – Use cases, creation, configuration-sizing, deployment, optimization, debugging and monitoring (X-RAY).
  • AWS DynamoDB – Use cases, table creation, configuration, optimization, indexes, API, DAX, DynamoDB Streams.
  • AWS API Gateway – Use cases, configuration, API, deployment, security and integration with S3, Cognito and Lambda. Optimization and debugging.
  • AWS ElastiCache – Use cases, configuration-sizing, API, deployment, security, optimization and debugging. It weighs heavily on the exam – at least in my question set.
  • AWS Cognito – Use cases, configuration and integration with other Serverless and Federation services. Concepts like SAML, OAUTH, Active Directory etc … are important for the exam.
  • AWS Cloudformation – Use cases, configuration, creation of scripts, knowledge of the nomenclature / CLI commands.
  • AWS SQS – Use cases, architecture, configuration, API, security, optimization and debugging. Questions of different difficulty levels may appear.

Very important services to consider

  • AWS SNS – Knowledge of use cases at architecture level, configuration, endpoints, integration with other Serverless services.
  • AWS CLI – Average knowledge of different commands and nomenclature. In my set of questions not many appeared, but in any case, it is very positive to have some ease at the console level.
  • AWS Kinesis – Some more complex questions appear in this version of the exam than in the previous embodiment. Use cases, configuration, sizing, KPL, KCL, API, debugging and monitoring.
  • AWS CloudWatch, Events, Log – It appears consistently across all certifications. Knowledge of architecture, configuration, metrics, alarms, integration, use cases.
  • AWS X-RAYUse cases, configuration, instrumentation and installation in different environments.
  • AWS Pipeline, CodeBuild, Cloud Deploy, CodeCommit, CodeStar – High-level operation, architecture, integration and use cases. I´d recommend in depth study of CodePipeline and CodeBuild.
  • AWS ELB / Certificates – Use cases, ELB types, integration, debugging, monitoring, security – certificate installation.
  • AWS EC2, Autoscaling – Use cases, integration with ELB.
  • AWS Beanstalk – Architecture, use cases, configuration, debugging and deployment types – very important for the exam: All at Once, Rolling etc …
  • AWS RDS – One of the star services of AWS and the Databases Certification. Here it makes its appearance in a limited way: use cases, configuration, integration – caches – debugging and monitoring.

Other Services

  • AWS Networking – architecture and basic network knowledge: VPC, security groups, Regions, Zones, VPN … They appear in a general and limited way, compared to the rest of the certifications. It is one of the reasons why this certification is ideal for beginners. Network architecture on AWS can be a very complex and arid topic.
  • AWS Step FunctionsA service widely used in the business environment, but which appears circumstantially in certifications. I recommend studying architecture, use cases and nomenclature – the questions are not easy.
  • AWS SAM – Use cases, configuration and deployment. SAM CLI Commands.
  • AWS ECS / Fargate – Its appearance in the certifications is quite disappointing – and more so when compared to Google Cloud´s certifications, where Kubernetes – GKE – has a main role – logical, since it´s Google’s native technology. I´d recommend studying the architecture, use cases – microservices – configuration, integration and monitoring (X-RAY).
  • AWS Cloudfront – General operation and use cases. Integration with S3.
  • AWS Glue – General operation and use cases.
  • AWS EMR General operation and use cases.
  • AWS DataPipeline – General operation and use cases.
  • AWS Cloudtrail – General operation and use cases.
  • AWS GuardDuty – General operation and use cases.
  • AWS SecretsManager – General operation and use cases.

Essential Resources

  • AWS Certification Website.
  • Sample questions
  • Readiness course – recommended, with additional practice questions,
  • AWS Whitepapers – “Storage Services Overview“, “Hosting Static Websites on AWS“, “In Memory Processing in the Cloud with Amazon ElastiCache“, “Serverless Architectures with AWS Lambda“, “Microservices“.
  • FAQS – specially for Lambda, API Gateway, DynamoDB, Cognito, SQS and ElastiCache.
  • AWS Compute Blog
  • Practice Exam – highly recommended, level of difficulty representative of the exam.

Laboratories

I´d like to propose you an incremental practical exercise, cooked by me, that can be useful for preparing for the exam.

Serverless Web App

Imagen aws.amazon.com
  • Create a static website and host it on S3. Use AWS CLI and APIS to create a bucket and copy the contents.
  • Create a repository with CodeCommit and upload the files from the Web to it.
  • Integrate S3 and Cloudfront – creating a Web distribution.
  • Create a Serverless backend with API Gateway, Lambda and DynamoDB, or alternatively Aurora Serverless, using Cloudformation and the AWS SAM model.
  • Code the Lambdas functions with one of the supported runtimes – Python, Javascript, Java … – and use BOTO to insert and read in DynamoDB. Each Lambda will correspond to an API Gateway method, accessible from the Web.
  • Integrate X-Ray to trace Lambdas.
  • Create the Stack from the console.
  • Upload the generated YAML´s files to CodeCommit.
  • Optional: create a pipeline using CodePipeline and CodeCommit.
  • Optional: integrate Cognito with API Gateway to authenticate, manage, and restrict API usage.
  • Optional: replace DynamoDB with RDS and integrate Elasticache.
  • Optional: add a SQS queue, which will be fed from a Lambda. Create another Lambda that consumes the queue periodically.

Is it worth it?

Certifications are a good way, not only to validate knowledge externally, but to collect updated information, validate good practices and consolidate knowledge with real (or almost) practical cases.

Obtaining the AWS Certified Developer seems to me a “no brainer” in most cases, as I explained previously in another post, and in this one.

Good luck to everyone!