21 Google Cloud tools in 2 minutes or less

By

A Estevez

-

9 March, 2021

0

Pretty cool post from Google’s Blog to quickly view 21 services in a series of short videos.

Some essential videos

Anthos – essential for Hybrid Cloud app modernization

Firestore – NoSQL document database at scale

Cloud Data Fusion – fully managed, cloud-native, enterprise data integration service

You can find the full set of videos in the original post on Google’s Blog.

The Reference Architecture Disappointment

By

A Estevez

-

5 April, 2020

0

There is a “phenomenon” that I have experienced through my career that I like to call the “Reference Architecture Disappointment”.

Some people would experiment with a similar effect when they go to the MD´s consultation with several symptoms to find out that they may have a common cold. No frenzy at the Hospital, no crazy consultations, no House MD´s TV scenes. Just paracetamol, water and rest!

So many years of Medicine School just to prescribe that?

Well, yes. The MD recognised a common cold between dozen of illnesses with the same set of symptoms and prescribed the simplest and best treatment. The question is, would you be able to do it?

Same thing when a Solutions Architect deals with a set of requirements. The “Architect” will select the best architecture that solves a business problem, most simply and efficiently possible. Sometimes, that means to use the “Reference Architecture” for that particular problem, with the necessary changes.

Those architectures emerge from practical experience and encompass patterns and best practices. Usually, reinventing the wheel is not a good idea.

Keep it simple and Rock On!

Generation & Analysis of Tweets IV – RAG & Bedrock Knowledge Databases

By

A Estevez

-

30 May, 2024

0

Finally, we’ve got time to show you some of the work we’ve been doing. Check the previous posts (three parts) to catch up:

Generation & Analysis of Tweets – III, AWS Bedrock & Claude Models

Bedrock is growing fast, as AWS is adding features quickly. One of the latest is the Knowledge Databases functionality:

“Amazon Bedrock’s knowledge bases allow you to gather data sources into a central repository of information. This enables you to create applications that utilize retrieval augmented generation (RAG), a technique where retrieving information from data sources enhances the generation of model responses”.

We’ve updated the Tweets example and replaced Langchain with this new feature to see the difference. We have to say that the experience has been relatively seamless. We found no issues integrating and creating the KD or ingesting and embedding the vector.

Architecture

We use what some would call “naive RAG” for this example. We’ll enrich a model using the data we retrieved directly from X (former Twitter) so the model can tell us about the latest AWS-released features of some services.

A bit about RAG Architecture or framework:

The RAG Framework provides the LLM model with external data sources to give augmented context to the source prompt. In this architecture, we provide the model with updated information about AWS features using Tweets from AWS.

We are not creating a new diagram, as it is virtually the same as this one:

The architecture only includes a new component—the rest is just configuration. That would be the Knowledge Database (and Datasource) connected to the AWS OpenSearch Serverless—that’s all. The source data would be located on the S3-Training datasets bucket.

Open Search Serverless

Now, let’s get to the cool stuff. 🙂 As a RAG database, we’ve chosen AWS Open Search Serveless – with the vector engine – a fully integrated database with the KD that is relatively easy to provision; setting it up properly in production is another matter – don’t be fooled. The design is much more streamlined than its managed counterpart, which can be a pain until you get it effectively up and running. This one is nicely designed with individual policies for security networking and minimal configuration components.

A word for the wise: You pay for what you use in terms of resources—computing and storage—but you pay by the hour of consumed OCUs, so letting this infra up and running can be expensive, especially for some home testing. There are cheaper alternative databases. Ideally, use TF or CDK to provision and delete this.

We need to provision a few components: policies—security, network—a collection to store our vectors and indexes (please check the documentation for more information), and the KD and data source.

Collection

We define a Collection with the VectorSearch engine type, and to save resources, we disable the standby_replicas option:

We need to create the different permissions and policies – security policies, role …

Permissions for the index:

Once provisioned, we will get an entry in the console that represents our collection:

The vector configuration and metadata:

Ingesting the data is a different matter. In this architecture, we ingest the data using ECS and a Python script, retrieving Tweets from X and then storing the data in our Vector Database. We are only going to comment on the KD part:

AWS Open Search Serverless is fully integrated with KD, so the configuration is pretty straightforward:

Once the KD is created and configured, we can upload the data to our vector store, creating vector embeddings with the LLM model. For this architecture, we’ve selected the model Titan Embeddings G1.

We are ingesting Tweets, so after some testing, we’ve selected the following values for the fixed-size chunking strategy that works well, at least for this dataset:

We also need to create a data source for the KD with some parameters:

Then, the ingestion can be started:

Finally, our data and vector embeddings are ingested and synchronized:

Chatting with your dataset

Using the AWS console, we can quickly test our architecture. We’ve selected the Claude v2.1 model in this example, but you can select others. We ask the model for the latest features of AWS EKS:

We can browse the citations [1] and [2] and the source from our Tweets dataset:

So, our embedding and model seem to work pretty well with our external data source 🙂

Generation & Analysis of Tweets – III, AWS Bedrock & Claude Models

By

A Estevez

-

14 October, 2023

0

In the previous post, I described the serverless real-time ingestion of live tweets; you can have a look at the following link:

Summarization & Analysis of Tweets- II, Serverless Ingestion

With those tweets, we can train a model for topic modelling; what is more, we’ll use a synthetic set of tweets as well, generated with Claude, Anthropic’s LLM.

For classifying the tweets, the first model I used was the Sagemaker built-in version of LDA, an unsupervised learning algorithm for modelling topics with a mind of its own. Often, it models unexpected topics, but that’s part of the fun. ;). As an alternative, you can use the NTM algorithm. Using each has pros and cons; look at the docs for a discussion.

Architecture

Let’s have a look at the architecture for generating the synthetic tweets. As you can see in the diagram, in the upper part, the generation is relatively straightforward:

1 – The user invokes the Lambda Function URL with a POST with the desired prompt and the model_id: “anthropic.claude-instant-v1“

2 – The Lambda Function invokes the Claude Model with the prompt, which produces and returns the desired dataset

3 – The function stores the dataset in an S3 bucket in the CSV format – ready for training in Sagemaker (it will need some pre-processing, though)

Technical Notes

As I write this article, on October 14th of 2023, the Lambda function needs a layer with the “botocore>=1.31.57″ version for the Python runtime 3.11. If you don’t do that, you will get an error when invoking the model – type not found.

Let’s see a basic Lambda handler implementation:

For the prompt, you can use the recommendations found everywhere; I do not need to repeat them except to be specific. In the case of the Claude models, the prompt must start with “Human:” and end with “Assistant:”. For instance, a straightforward example with zero-shot mode:

“””Human: Write a dataset about the following <topic> in CSV format,

Assistant:

“””

Let’s see an example generated with Claude’s instant v1 model, which is the fastest and most affordable version but very proficient, as you will see:

traning–dataset.csv (excerpt)

id,text

1,”I was impressed by the updates to #AWSML services at #reinvent. The new features in AWS Sagemaker like AutoPilot, Pipelines and Model Monitor will make it even easier to build and deploy ML models at scale.”
2, “Just read a great case study on how @Anthropic used AWS Lambda, DynamoDB and S3 to build their constitutional AI platform. Serverless architectures on AWS are perfect for building scalable AI applications: https://bit.ly/3fKDs2C“
3, “Did you know that AWS now has a managed service for ML workflows called Amazon SageMaker Pipelines? It handles setting up end-to-end ML workflows with steps for data processing, model training & evaluation, and deployment. Could really speed up my #ML projects!”
4, “Been playing around with AWS DeepRacer, a 1/18th scale race car you can control with reinforcement learning. Such a fun way to learn about RL! Apparently the fastest lap times are around 50 seconds which is impressive considering it’s a tiny car: https://amzn.to/3qLExR9 #AWS #Reinvent”
5, “@AWS re:Invent was awesome, but I think some of the implications of their #ML announcements flew over people’s heads. putting AI/ML services like SageMaker, DeepLens etc. together with IoT services like Greengrass means we’ll start seeing AI/ML capabilities popping up in all kinds of unexpected places!”

It’s pretty good, similar to the results I got using the latest GPT 4 model.

Training the model with Sagemaker

Now that we have two datasets – one from actual live data and one synthetic – both stored in S3, we can train the LDM model. Let’s review some important points:

Have a look at the example Notebook for a quick reference.
The dataset must be pre-processed for training: tokenizing, removing stop word punctuation and then processed using stemming or lemmatization. I used nltk and gensim libraries. Once the text has been tokenized, we need to convert it to a bag of words:

The LDA model – Sagemaker flavour – expects the data in CSV or RecordIO format
Now, we can train the model; I used a discrete ml.c5.2xlarge instance. We can look at the Training Section of Sagemaker -> Training jobs to find all the details about the training job, hyperparameters, the S3 model artefact, logs and metrics.

Finally, we can deploy the model to test it:

AWS Machine Learning – Specialty Certification: Quick Guide, 2023

By

A Estevez

-

6 July, 2023

0

After renewing the AWS ML Certification with a perfect score, I got many messages asking about how I prepare the certification(s) and the resources(s) I use. You can find it all over my blog; I use boundless preparation.

What does it mean? Preparing yourself without limitations, using all the resources you need to complete your journey/course/preparation. Limiting yourself to one resource, a prepacked course or book limits your experience to that particular course and path. Remember that many classes or books are built around a specific set of questions that may appear or not in the exam, and the groups of questions constantly evolve. The set I got was almost completely different from the last time. Certainly, I got questions outside the official and unofficial preparation, which are well-known in the specific domain. So, it’s fair game for professional certifications.

I understand not everybody has the same goals or inclinations or the time to invest. So use whatever works for you, but I don’t recommend using just a course and set of practice questions to achieve any technical certification. That’s just an intellectual exercise that will give you limited and temporary results unless you are already an expert in that domain. That also leads to disappointment for many who don’t get the magical results the hype promises. So, deep dive into the aspects that excite you the most or the ones you lack experience with, and expand all you need.

It would be best if you always maximize your investment in these certifications; they are expensive in terms of time and money, so tailor them to fit your needs, not only to get “the badge”.

Guided by the Light(house)

This certification – and domain – comes with its prerequisites and challenges. I understand that ML is a hot topic right now, so many people will try it for the first time. If that’s your case, you have a challenging journey ahead. This is one of the most complex certifications and domains out there, and that’s because to be successful – and have some real working knowledge, not just a badge – you need a proper understanding of the following, but not limited to:
– Python, Pandas, Data Analytics, Data Science, AWS, AWS ML Services and MLOPS

Disclaimer, I’m not new to ML or Analytics; my real journey in the cloud started in 2015, thanks to the release of TensorFlow as an open-source library by Google. I assisted at a Google conference in Madrid that year about deploying Tensorflow in the cloud, which blew my mind. I was using stuff like Spark and MongoDB on-premises, and for the first time, I saw RNNs translating, computer drawings and autonomous driving cars. At that time, the LLMs didn’t exist; I think it took another three years for the first model to be released. I had been using the cloud since 2011 to deploy web services and apps, but that was the real thing!

So, my first recommendation is to invest time in the subjects you don’t have experience in. For instance, I strongly recommend achieving first the AWS Analytics certification or having the equivalent experience. It will help you in a big way, not just for the test. Same thing with Data Science; if you don’t have any experience, you should invest time and resources in learning the basics before anything else.

Remember, this is a speciality certification, which means that, at the very least, you should hold one accreditation at the associate level or the equivalent knowledge. There will be AWS-focused questions about Security and Networking, mostly related to Sagemaker or AI Services, but still.

So yes, lots of stuff, enough to make your head spin if you are new to this subject matter. Well, we can use some resources

to guide us in the vast sea of knowledge of this technological age, and it’s just starting 🙂

Back to Basics

My suggestion is to go and check the AWS available resources first, there are plenty, and they are quite good. As a main resource, I recommend the AWS Official Guide to help you pass the exam, not learn ML. For that, you will need to take other courses and certifications and get some practical experience, as I mentioned before.

The guide seems slim and basic if you have a quick look. But it’s packed with useful information and contains a summarized version of a good portion of the information you’d need to pass the test. It will take you a while to go through it back to back – and you should.

The guide also contains:

Assessment test that is useful to gauge your present level

Practice questions that are also available online – relevant when writing this post, July 2023. Expect more lengthy and complex questions in the exam, though
Includes many additional resources that you should check

Unfortunately, it doesn’t contain any practical exercises, so you need to look elsewhere for that. As I mentioned, the guide includes much information, but in a summarized way. Hence, it’s a good option for people that don’t want or need to go very deep, but I’d recommend using other resources to complement it; it could be a video course, documentation, Internet resources or a book like “Hands-On: Machine Learning with Scikit-Learn, Keras & TensorFlow.”

I own two physical editions of the book, the first and second editions, and I can tell you that it is the perfect blend of textbook explanations with hands-on examples. Excellent to learn about the algorithms in deep coupled with Python-based examples. It’s cloud agnostic until the last part of the book, which focuses on GCP.

Let’s go back to the AWS resources:

AWS Certification Site: the primary site contains some good resources for preparation, including sample questions, the official practice test and the interactive readiness course, which also includes lots of additional questions.

Let’s have a look at some relevant questions for the test, taken from the sample questions, AWS’s property 🙂

The above question focuses on analytics; as I mentioned, you need a good level to pass the test.

The question focuses purely on data science – it will be the same on any platform.

Let’s see another flavour taken from the AWS readiness course:

This interesting question focuses on AWS ML Engineering, improving the training performance on Sagemaker.

As you can see, you will confront very different types of questions, which makes the test much harder. The AWS official resources contain many questions; if you still need more, you can find additional sets on Udemy. I took some of them updated in 2023, which are relevant, but I wouldn’t say they are essential. It depends on your case; if this is your first or second AWS certification, or you are new to the subject matter, then yes. For me, it was redundant.

More resources:

– AWS ML blog: excellent site to read about actual use cases, technical articles and new features.

Some useful posts:

https://aws.amazon.com/es/blogs/machine-learning/deploy-a-serverless-ml-inference-endpoint-of-large-language-models-using-fastapi-aws-lambda-and-aws-cdk/

https://aws.amazon.com/es/blogs/machine-learning/build-machine-learning-ready-datasets-from-the-amazon-sagemaker-offline-feature-store-using-the-amazon-sagemaker-python-sdk/

https://aws.amazon.com/es/blogs/machine-learning/create-sagemaker-pipelines-for-training-consuming-and-monitoring-your-batch-use-cases/

https://aws.amazon.com/es/blogs/machine-learning/predicting-new-and-existing-product-sales-in-semiconductors-using-amazon-forecast/

– AWS Training and Certification Blog

A post about a new AWS course focused on the ML lifecycle:

https://aws.amazon.com/es/blogs/training-and-certification/learn-to-build-train-and-iterate-machine-learning-models-faster-with-new-aws-course/

– AWS Documentation: an obvious one, but still the best source of information

Conclusion

This is a complex certification that you should prepare accordingly. Take your time and prepare for your journey, not just the destination.

Good Luck!

Summarization & Analysis of Tweets- II, Serverless Ingestion

By

A Estevez

-

26 May, 2023

0

In the previous series post, I described a fully ready architecture to summarize and analyze tweets at scale.

Summarization & Analysis of Tweets using AWS & OpenAI

Some of you have contacted me to ask about the ingestion process and the integration with Twitter. So, I have decided to present a simplified ingestion architecture to show you how it can be done more efficiently.

Architecture

(1) The integration part of the architecture remains the same; we use the ECS Fargate service and a Python script, packaged as a Docker container, as the central infrastructure to retrieve the raw Tweets. Additionally, we use the Amazon ECR to store the docker image privately.

Using Terraform, the ECR component is straightforward to provision:

resource "aws_ecr_repository" "aws-ecr" {
  name = "${var.app_name}-${var.app_environment}-ecr"
}

Creating the ECS cluster is more involved, as it is comprised of a few moving parts:

VPC
Cluster
Task
Execution Role

Let’s have a look at the task which is the most exciting part:

The task definition is required to run Docker containers in Amazon ECS, and as you see, we have defined some parameters:

– Docker image URL, stored in our private ECR

– CPU assigned to the task

– Memory assigned to the task

– Networking mode, we select the “awsvpc” mode – the task is allocated its elastic network interface (ENI) and a primary private IPv4 address.

– portMappings: we are just using the HTTP port for this example

We need to deploy the task in the cluster:

data "aws_ecs_task_definition" "main" {
  task_definition = aws_ecs_task_definition.twitter_task.family
}

The task is just the instantiation of a task definition within a cluster; we can define different containers in a task if we need to implement something like the sidecar pattern.

Now that the task is running within our container, we are fetching live tweets from the service:

(2) I have simplified the ingestion part of the architecture, removing AWS Kinesis Data Analytics and AWS Kinesis Data Streams, ingesting directly the raw tweets using AWS Kinesis Firehose, which allows near real-time ingestion into S3. As an additional exercise, you can use a Lambda function to format the Tweets at ingestion time – Firehose can invoke it.

Once we have created the Firehose tweet_stream, providing the bucket arn and a role so the service can access S3, the tweets will start being ingested in near real-time.

resource "aws_kinesis_firehose_delivery_stream" "firehose_tweet_stream" {
  name        = "tweet_stream"
  destination = "s3"

  s3_configuration {
    role_arn   = aws_iam_role.firehose_role.arn
    bucket_arn = aws_s3_bucket.tweets_storage.arn
  }
}

The Firehose service provides some valuable metrics to monitor the ingestion process:

(3) Finally, the tweets are ingested into the S3 bucket. Remember, “the frequency of data delivery to Amazon S3 is determined by the S3 buffer size and buffer interval value you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3” – from AWS Documentation.

Integration with Twitter

The last part of the puzzle is integrating with Twitter using its API. For that, you need an account on Twitter’s Developer Platform, and a paid subscription, depending on the functionality you want to use.

To access the Twitter API, I recommend Tweepy, a Python library, to handle the API more easily.

As a first step, instantiate a kinesis client; remember, don’t store the keys in the code. Use a credentials store, like Secrets Manager.

kinesis_client = boto3.client('firehose',
                region_name='eu-west-1',  # enter the region
                aws_access_key_id=access_token,  # fill your AWS access key id
                aws_secret_access_key=access_token_secret)

Then, using Tweepy, we search for the tweets that we are interested in, and we ingest them into the Firehose stream:

logger.info("Searching and ingesting Tweets")
    response = client.search_recent_tweets("AWS", max_results=10)
    tweets = response.data
    for tweet in tweets:
        response = kinesis_client.put_record(
            DeliveryStreamName="tweet_stream",
            Record={
                'Data': tweet.text
            }
        )

Well, I hope this article helps you get started, have fun! 🙂

Summarization & Analysis of Tweets using AWS & OpenAI

By

A Estevez

-

4 May, 2023

0

I’m sharing with you one of the architectures, production-ready, that I’ve been working on lately. I was keen to learn how the GPT algorithms would fare summarizing Tweets, not just individually – for instance, with two or three words – but with groups of Tweets that share a tag and common themes. Also, to analyze and generate a response for those Tweets or summaries.

Let’s see an example.

We want to generate a summary for architects of the latest Tweets from the AWS Architecture account:

“By migrating to the cloud, we could develop our capabilities faster & focus our efforts on building solutions for our customers.” Jacek Presz, Director of IT Management Bureau, Bank Pocztowy SA Here’s the unique #AWSConfig solution 👉 go.aws/3JmtSnk #CloudCompliance”

“Building highly available, resilient apps is a *MUST* these days. Hear from Joseph Wagner, Jr., Domain Architect & Cloud Specialist about how using the #AWS #WellArchitected Tool helped Vanguard improve resilience of their apps 💪 #ResilienceOnAWS

“Aurora leads the way in autonomous trucking with the benefits of self-driving safely 🚘 Discover how #AWS contributes to this success story. It’s our latest look at the drivers of transformation in this season of #AllThingsAutomotive Watch how @IndyAChallenge is paving the way for the next generation of autonomous race cars with the help of #AWS & open innovation. ✨ Watch now: go.aws/3UCnoTH #AllThingsAutomotive”

We may get something like this:

“Here’s the summary of the tweets related to AWS Architecture:

Jacek Presz, Director of IT Management Bureau at Bank Pocztowy SA, highlights the benefits of migrating to the cloud for faster development and customer-focused solutions. They used the unique AWS Config solution for cloud compliance.
Joseph Wagner, Jr., Domain Architect & Cloud Specialist at Vanguard, shares how the AWS Well-Architected Tool helped improve the resilience of their highly available and resilient apps.
AWS contributes to the success story of Aurora, a leader in autonomous trucking with self-driving safety, showcasing the drivers of transformation in this season of AllThingsAutomotive.
Indy Autonomous Challenge uses AWS and open innovation to pave the way for the next generation of autonomous race cars in AllThingsAutomotive.”

Now we generate a response for the summary as a Tweet:

“Exciting to see how #AWS architecture is driving cutting-edge technologies like autonomous trucking and race cars while also improving app resilience and cloud compliance for businesses. It’s all about leveraging the latest tools like AWS Config and Well-Architected Framework!” #CloudInnovation #AllThingsAutomotive #ResilienceOnAWS

Architecture

The architecture I’m sharing with you comprises well-known AWS components and patterns. For instance, the ingestion pipeline of the architecture could be done in different ways; in this case, I’m using the well know pattern Kinesis Data Streams –> Kinesis Data Analytics –> Kinesis Data Firehose to ingest and transform the Tweets in almost real-time. In the final version, I’ve used another different pipeline, but this is fine and better known.

Let’s review the architecture flow and the components quickly, and in the next instalments, I will go deeper if needed. Ideally, it would be best if you tried to implement this by yourself and learn from the experience. For instance, the first thing you could do is extract an MVP from the architecture and start building on top of it. Be advised this architecture could be $$$, so proceed with caution.

1- We first need to connect to Twitter to collect raw tweets; for that, we can use a Python script that connects to the service to retrieve the data and ingests it into Kinesis. This part is pretty straightforward; you will need a Twitter developer account and credentials to access it. Deploying the script as a Docker container in ECS Fargate is a very convenient way to do it and is a serverless service with all its benefits. The credentials are stored in the AWS Secrets Manager Service.

2 – As explained previously, for ingesting and transforming the data in almost real-time, we use the pattern Kinesis Data Streams -> Kinesis Data Analytics -> Kinesis Data Firehose -> AWS Open Search. This is a very convenient way to do it, and Data Analytics can help transform the raw Tweets, using SQL, to the format you need to create the Open Search index.

3 – Finally, to obtain the summaries, we have exposed a REST API with API Gateway, Cognito and Lambda. Again, you can implement this in many ways, but this is serverless and easy to implement. The Lambda function is the core of the API, querying the OpenSearch, fetching and grouping the tweets we want to summarise, and finally invoking the OPENAI API; for this, you need credentials that you can obtain on their website.

AWS Data Analytics Specialty, New features – April 2023

By

A Estevez

-

21 April, 2023

0

I recertified in April 2023 on the AWS Data Analytics certification for the second time, if you count the AWS Big Data certification – now retired. I got an email from AWS asking for feedback and another survey about the competencies for the AWS Data Engineer role – fair warning, a new version of the exam may be coming; no big secret here; they review the exams to keep them current and challenging.

AWS Data Analytics Specialty, is it worth it?

Personal Experience

You can read a bit about my experience with this certification in the past in the post above. I got it days after it was released, and I can say that it was a challenging and fun experience. Compared to the present experience, I can say that the exam pattern remains very similar but with more questions about some services like OpenSearch, Lake Formation and MSK, to name a few. Remember that this is just an orientation based on my test set so you could get a different experience. In any case, base your preparation on your personal experience and needs; as I always say, take the opportunity to create your course and journey.

About the value of the certification, I can say that it works for me, helping me to keep up to date and proving my skills in the field up to a point. The certification is not a substitute for experience, so remember that you need to have some to back it up, plus experience in other complementary domains.

New Features

I will focus this post on the new features or services with more presence in the test; as for the rest, you can find more information in the older post or around the net.

Let’s have a look at a question from the sample set:

This is a representative question that includes different services and is relatively verbose. It references many essential services, including Kafka.

In many questions, you must choose the best option, filtered by some functional requirement, in this case, the overall latency. The difference between two answers could be one word, literally one service. That can be a difficult choice, depending on the context given that can help you to choose.

So, let’s review some new features that I find interesting and can be relevant for the test too:

LakeFormation & Glue

https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawlers-using-aws-lake-formation-permission-management

LakeFormation and Glue are two pivotal services you need to know well, especially Glue. They can now work together, so you can use AWS Lake Formation permission defined on the data lake for crawling the data, avoiding creating roles and S3 permissions for them. Additionally, this feature allows cross-account access and can be used with Athena, creating a model for centralized permission government.

More on the subject, this time with Redshift:

https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

AWS MSK & Kafka

There is one service you need to prepare for if you don’t have previous experience with it. The growing number of questions surprised me in a way, but it makes sense. You will find many projects using Kafka, and MSK is an excellent alternative to replace it with a managed solution. The questions weren’t that easy, so prepare well. Also, I’d recommend that you get a roundup of Kafka:

The book’s second edition is relatively recent, and it’s an excellent introduction. Topics and partitions are well explained 🙂

https://aws.amazon.com/blogs/big-data/how-to-choose-the-right-amazon-msk-cluster-type-for-you/

A classic, what’s the better choice for your workload, provisioned or serverless?

https://aws.amazon.com/blogs/architecture/amazon-msk-backup-for-archival-replay-or-analytics

MSK interacts very well with other AWS services:

https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

The ability to ingest hundreds of megabytes of data per second into Amazon Redshift materialized views and query it in seconds, avoiding provisioning additional infrastructure like Firehose, allows for implementing use cases like live leaderboards, clickstream analysis, application monitoring, and others in a much easier way.

AppFlow

https://aws.amazon.com/blogs/aws/announcing-additional-data-connectors-for-amazon-appflow/

New AppFlow integrations, like Marketing connectors (Facebook Ads…), Customer Service (MailChimp, Sendgrid … ), and Business operations (Stripe …).

But primarily, I’d like to highlight the integration with the AWS Glue Data Catalog that allows the registration of the SasS data into it. With this integration, there is no need to create crawlers to populate the data; it can be done with a few clicks.

AWS OpenSearch

https://aws.amazon.com/blogs/aws/preview-amazon-opensearch-serverless-run-search-and-analytics-workloads-without-managing-cluster

Ah, AWS Open Search, an old friend of mine :), I have provisioned a few so far, and I’m happy to see that it appears more often in the test.

This new serverless flavour seems a good step forward for the service, following the steps of RedShift, for instance.

Resources

Apart from the usual suspects, documentation, blog, FAQ, practice test, and readiness course, I can recommend the official certification guide, which is very relevant, and packed with additional resources to prepare:

Simplifying Hybrid Cloud Adoption with AWS

By

A Estevez

-

31 March, 2023

0

This a good book, with a kind of misleading title. Yes, it’s about hybrid cloud adoption, but with AWS Outposts, to be more precise.

AWS Outposts is a family of fully managed solutions that deliver AWS infrastructure and services to virtually any on-premises or edge location for a consistent hybrid experience.

And it’s good; it could appear contradictory that a cloud vendor offers you to run fully managed workloads on a “restricted cloud” on-premises, but it makes sense for many use cases that need low latency, local data processing or data residency.

For instance, in the Healthcare Industry: Linear accelerators, CT Scanners and surgical devices.

Some of the benefits that Outposts provides:

Inherently Secure
Reliable and available
Simplicity of use
Build once, deploy anywhere
Leverage existing skill sets and tools
Same pace of innovation as in the cloud
A truly consistent hybrid experience

Some use cases examples:

Data residency

A practical use case where you need to use more than one region for disaster recovery for regulated enterprises and data residency is required for compliance.

https://aws.amazon.com/blogs/architecture/ensure-workload-resiliency-and-comply-with-data-residency-requirements-with-aws-outposts/

As you can see, the London Region is connected to the Outposts rack installed on-premises, using a VPN or Direct connection. All the workloads and data reside in the UK zone, and only the control plane traffic is sent to the IR region in case a disaster impacts the London region.

Low Latency & Local Processing

https://aws.amazon.com/blogs/aws/deploy-your-amazon-eks-clusters-locally-on-aws-outposts/

The ability to deploy the EKS cluster locally is beneficial when you need low latency, so keeping the cluster near the rest of the infra helps. But that’s not the only benefit; being local means that the cluster can work even if there are problems with the connections, which are not uncommon when the workloads are deployed on rough edges.

Other resources

https://aws.amazon.com/outposts/rack/resources/?outpost-blogs.sort-by=item.additionalFields.createdDate&outpost-blogs.sort-order=desc

https://docs.aws.amazon.com/whitepapers/latest/aws-outposts-high-availability-design/aws-outposts-high-availability-design.html

https://pages.awscloud.com/Best-Practices-for-Capacity-Management-with-AWS-Outposts_2021_1011-CMP_OD.html

AWS Certified Database – Specialty Certification: Preparation & Resources, 2023

By

A Estevez

-

19 March, 2023

0

A bit about my personal experience

The AWS Certified Database Beta certification was the last one I took at a test centre, a few weeks before Covid hit Spain very hard and the whole country went into lockdown.

It was an intense experience; the certification was at the beta stage at the time – Jan 2020 – which meant 85 questions in 4 hours, with no breaks :(, so my mind went into lockdown too after 3 hours and a half; I passed nonetheless 🙂

As the test was in the beta stage, there was no available information, courses, test sets or anything, which is great because it forces you to prepare for your journey, depending on your experience with the subject matter. I created a “course” focused on architecture because that´s my interest and my bread and butter, but I was off target.

Well, the exam hasn´t changed; it´s focused on operations – backups, snapshots, migrations – optimization and troubleshooting, not that much on architecture or infrastructure – but there are a few questions. Also, RDS/Aurora and DynamoDB are the real stars of the test. Other products like Redshift, ElasticCache, DocumentDB, Athena, Neptune, TimeStream, QLDB and OpenSearch are present in the test in decreased order of importance – at least in my set of questions. Remember that AWS holds an extensive database of questions and updates them very often, so take this always with caution.

You should always maximize your investment in these certifications; they are expensive in terms of time and money, so tailor them to fit your needs, not only to get “the badge”. I took a deep dive into the databases I didn’t have commercial experience with, and it was worthwhile; it just wasn´t reflected on the test. But that keeps with me from now on.

Preparing for the Hero’s Journey

I took my notes from three years ago and created a new version of my “course”. As a base, I’ve used the guide “AWS Certified Database – Specialty Certification Guide” from Pack Publishing. It’s an excellent book to use as an intro to the subject matter and covers the test well, but you must go deeper in every section to pass it. BTW, a new official guide is coming in July from Sybex.

Remember, this is a speciality certification, which means that, at the very least, you should hold one certification at the associate level or the equivalent knowledge. And it pays off; several questions involve Networking, Security and IaC, and CloudFormation.

Coming Soon 🙂

I don´t really use video courses; I get bored watching hours of videos that often are rehearsing the official documentation – it is really a passive activity for me; I prefer to be in control of my learning and move between different activities. I watched a few videos this time, and I couldn´t take more after a few minutes, so I returned to the real thing and the reading material. If you like this kind of learning, pick one from your favourite provider, select the ones from active professionals that can add some real experience to the material, tailor it to your needs and further it.

Resources & Key Sites

AWS Certification Site: the primary site, but it contains some good resources for preparation, including sample questions, the official practice test and the interactive Readiness Course, which also includes additional questions:

A relevant question for the test: “choose the best database” for a specific workload.

– AWS Database blog: excellent site to read about actual use cases, technical articles and new features.

Some nice posts:

https://aws.amazon.com/blogs/database/speed-up-database-migration-by-using-aws-dms-with-parallel-load-and-filter-options/

https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/

https://aws.amazon.com/blogs/database/use-point-in-time-recovery-to-restore-an-amazon-dynamodb-table-managed-by-aws-cdk/

– AWS Training and Certification Blog

A post from 2021 with valuable tips for preparing for the certification:

https://aws.amazon.com/blogs/training-and-certification/10-study-tips-for-the-aws-certified-database-specialty-certification/

– AWS Documentation: an obvious one, but still the best source of information

RDS/Aurora: everything. From the basics to the good practices. You will not pass this test without a deep understanding of RDS: architecture, authentication, global databases, HA, operations, migrations, disaster recovery, good practices, troubleshooting, monitoring and serverless.

DynamoDB: another star of the exam, not at the same level as RDS, but learn as much as you can, depending on your knowledge of the subject matter: architecture, tables design and provisioning, indexes, operations, migrations, disaster recovery, DAX, streaming, and good practices.

ElasticCache MemCache/Redis: you may expect a few questions at different difficulty levels, so learn as much as you can: architecture, authentication, HA, operations, clusters, disaster recovery, good practices and troubleshooting.
Redshift: This is covered entirely in the Analytics cert, but you may get some questions anyway. I can’t be particular about the questions, so you should get a similar roundup as the other databases, depending on your knowledge.
DocumentDB, Athena, Neptune, TimeStream, QLDB and OpenSearch: as I mentioned, you would get a few questions about the rest of the AWS databases. At a minimum, you should understand the Architecture and the different use cases for each of them and be able to select the best one for a specific workload. But you may get more complex questions, which I can’t be clear about. So, it’s up to you.
Other services: a lot of them, S3, Lambda, KMS, DataSync, Data Pipelines, DMS, Kinesis, …

Conclusion

This a challenging certification that you should prepare accordingly. Take your time and prepare for your journey, not just the destination.

Good Luck!

A MidSummer’s Book List

By

A Estevez

-

22 July, 2022

0

Here we are, July 2022, in the middle of a scorching summer, so I’d guess it is a perfect moment to relax near the sea and read a book or two. I’ve compiled a list of favourite books that I can personally recommend. All the books are paperbacks that I own, and I’ve read back to back; also, I’ve gone through most of the examples myself, so there is no cheating here 😉

The order in which I’m presenting the books doesn’t reflect the quality; in fact, all are great books;

Google Cloud for DevOps Engineers

Let me recommend you this book from Packt.

First, this is not a straightforward preparation guide to achieving the Professional Cloud DevOps Engineer; it is much more than that: it’s almost a textbook on the subject matter that can be used as a guide for the certification. The edition of the book is one of the best I’ve seen from this publisher; As I mentioned before, this book feels like a college book but is written in a concise style.

But don’t get me wrong; the book goes intense on many topics, like, for instance, K8’s and GKE, which is hands down one of my favourite parts. An excellent introduction or refresher to the case helps to grasp many concepts in not many pages.

Another section that stands out is the book’s first part, which introduces the DevOps/SRE concepts in an informative way – and not tied up to GCP. The monitoring section is pretty informative, as well.

One of the best books I’ve read about #gcp – but not limited to – and #devops.

Learning Domain-Driven Design

I can’t believe that 13 years have passed since I read the blue book of DDD, Eric’s Evan seminal work, and 11 since I worked on the first project using it. We were breaking a monolith into what we now call microservices or an early incarnation.

I was looking for updated literature about the subject when I found this excellent from Vlad Khononov. It provides the correct theory and practical examples to understand the different concepts and patterns. The style is clear and concise but academic, which is not easy to do.

Some stuff you are going to find in the book:

– Architectural patterns and Heuristics

– Microservices & EDA

– Datamesh

To summarize, one of the best books on the subject matter that can be used as a primer. I was expecting more examples, but that helped keep it to the book at a reasonable size, around 300 pages.

Note: The code listings are in C#, but they are very generic and easy to follow.

Data Engineering with AWS

This book from Packt is an excellent book to introduce yourself to the beautiful world of #analytics on #AWS. The author has done a fantastic job cramming into 440 pages many topics you may find when working on this field in the #cloud.

You pay the price; some topics are covered at an introductory level. For instance, ingesting data with #kinesis is a topic that needs a whole book, not a few pages. But you get an intro. Other chapters, like “Transforming Data to Optimize for Analytics”, are more comprehensive.

The book covers recent services like AppFlow, Glue Studio, DataWrangler, and other third-party services.

An excellent #book that I’d recommend to data engineers that want to introduce themselves to

MICROSERVICES WITH SPRING BOOT AND SPRING CLOUD

I was looking for a book to replace my old book about microservices: “Manning Spring Microservices, 2017″. Specifically, that covered new tools and deployment in the cloud – I wasn’t that interested in the Spring part.

Well, this book does its job very well. I was so hooked that I read the book several times and reviewed most examples.

If I have to pick a few highlights, it would be the Kubernetes deployment chapters, Service Mesh and the replacement of Netflix components.

A word of advice. This book is a Packt Publishing release, so it follows its editorial style: very hands-on and includes many code listings. It agrees with me a lot, but make sure that’s what you want.

Cloud Native GO

A few months back, I spent a weekend in Barcelona, one of my favourite cities. It was great to see the city back to life, filled with people and tourists 🙂 We even went to the beach on the first day!

OK, back to the book!

I liked it, the author’s style is very engaging, and I enjoyed having a fresh look at the different Cloud-Native Patterns from the GO perspective.

The only concern is that the subject is too broad to be covered in just 400 pages, so you can use this book as a good introduction, but you’ll need to complement this with other literature.

Professional Cloud Architect Google Cloud Certification Guide

Funnily enough, I’m not using this guide from Konrad Cłapa, Google Cloud Certified Fellow, to achieve the Architect certification, as I recertified in the beta; 77 intense questions in three hours, including questions from the new scenarios.

However, I have a bunch of GCP recertifications coming my way very soon, so I’m using this book as one of the stepping stones to resync and refresh my whole GCP knowledge.

This guide is very comprehensive and to the point but still clocks at 600+ pages; don’t worry, it includes lots of pictures 🙂

Also, you can find four mocks test (20 questions each) and an analysis of each case study. I only miss some review questions in every chapter.

Remember that this is a study guide, so expand each section accordingly and get practical knowledge as required. In RL, you don’t have a list of questions to choose from, and not everything is on the web; that’s where a mix of knowledge, experience and intuition kicks in.

LEARN AMAZON SAGEMAKER

I’ve had this book by Julien SIMON on my reading list for a while – it’s a 2020 release – and when I finally made the time to read it, I couldn’t put it down. Like the previous book, it’s a Pack Publishing release packed with examples so that you can start prototyping immediately.

It’s a hands-on book focused on Sagemaker and AWS services, so it’s not a book to learn Machine Learning in-depth. The good news is that you can learn to train and deploy pre-built models using Sagemaker without that specific knowledge.

The book gives you an excellent overview of Sagemaker, but I’d guess it was written before 2020, so it doesn’t cover many new features like Sagemaker Studio, Autopilot, etc…

But not to worry :), there is already a second edition that covers all that new shiny stuff.

Happy Summer!

TDD AWS – the Moto Library

By

A Estevez

-

29 September, 2021

0

Moto is a library that allows mocking AWS services when using the Python Boto library. It can be used with other languages as stated in the documentation – standalone server mode. I don’t have direct experience with that feature, though, as Python is my language of choice for coding PoCs or data projects in general.

The TDD way in the Cloud

I don’t need to preach about the benefits of using TDD to design and test your components. In any case, if you are still not convinced, check out books like:

Developing for the cloud comes with its own set of challenges. For instance, testing. Not easy when you need to use services that are not locally available.

The appearance of libraries like Moto has made testing much more manageable. Like any library, it has its peculiarities, but the learning curve or resistance is not exceptionally high, especially if you have previous experience with Pytest or other testing frameworks.

Prerequisites

I assume you have previous knowledge of AWS, Boto, Python, TDD and Pytest. In this case, I’m providing some complete listings to learn by example, not the entire exercise, though, so you can fill any gaps and enhance your learning experience.

Installing Moto

It’s straightforward –

$ pip install moto[all] - if you want to install all the mocks
$ pip install moto[dynamodb]

To mock AWS services and test them properly, you will need a few more dependencies –

$ pip install boto3 pytest

Mocking DynamoDB – Setting up

Let’s go beyond the canonical examples and find out how to mock DynamoDB, the excellent AWS NoSQL database.

As shown in the following code listing, we’ll create the books table with the PK id. The function mocked_table returns a table with some data. Later on, this table will be mocked by Moto.

test_helper.py

import boto3
import json

def mocked_table(data):
    dynamodb = boto3.client("dynamodb")
    table = dynamodb.create_table(
        TableName='notes',
        KeySchema=[
            {
                'AttributeName': 'id',
                'KeyType': 'HASH'
            }
        ],
        AttributeDefinitions=[
            {
                'AttributeName': 'id',
                'AttributeType': 'HASH'
            }
        ],
        ProvisionedThroughput={
            'ReadCapacityUnits': 1,
            'WriteCapacityUnits': 1
        }
    )
    
    table = boto3.resource('dynamodb').Table("books")
    with table.batch_writer() as user_data:
        for item in data:
            table.put_item(Item=item)

    return table

Moto recommends using test credentials to avoid any leaking to other environments. You can provide them in the configuration file conftest.py, using Python fixtures –

conftest.py

import os

import boto3
import pytest

from moto import mock_dynamodb2

os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'

@pytest.fixture(scope='function')
def aws_credentials():
    """Mocked AWS Credentials for moto."""
    os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
    os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
    os.environ['AWS_SECURITY_TOKEN'] = 'testing'
    os.environ['AWS_SESSION_TOKEN'] = 'testing'

Now, we are ready to start designing and developing.

Mocking DynamoDB with Moto and TDD

We will code the function get_book that retrieves a particular Item from the table. Following the TDD cycle, we have to code the test first, make it fail etc … not showing here the entire cycle, but just a few steps to get the idea.

The test unit would look like this –

get_books_test.py

import pytest
import json
import boto3

from moto import mock_dynamodb2

book_OK = {
    "pathParameters":{
        "id": "B9B3022F98Fjvjs83AB8a80C185D",
    }
}
book_ERROR = {
    "pathParameters":{
        "id": "B9B3022F98Fjvjs83AB8a80C18",
    }
}

@mock_dynamodb2
def test_get_book():
    from get_book import get_book # dont't change the order
    from test_helper import mocked_table # must be imported first
    
    dynamodb = boto3.client("dynamodb") # before getting the client 
    
    data = [{'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}]   
    mocked_table(data)

    result = get_book(book_OK)

    item = result['Item']
    assert item['id'] == 'B9B3022F98Fjvjs83AB8a80C185D'
    assert item['user'] == 'User1'

As you can see, nothing very different from a Pytest test unit, except for the Moto annotations and AWS’S specific code.

Let’s increase the test coverage, ensuring that the functionality “Item not found” is working as expected.

@mock_dynamodb2
def test_get_book_not_found():    
from get_book import get_book # dont't change the order
    from test_helper import mocked_table # must be imported first
    
    dynamodb = boto3.client("dynamodb") # before getting the client 
    
    data = [{'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}]   
    mocked_table(data)
    
    result = get_book(book_ERROR)
    
    assert 'Item' not in result # book not found

OK, now that we have the test unit by design, we need to write the function get_book. We can start with something basic that satisfies the import and the method signature. By the way, I showed the test unit fully coded, but you can do this gradually and code the essential function initially.

get_book.py

import json
import boto3

dynamodb = boto3.resource('dynamodb')
tableName = 'books'

def get_book(id):
    return {'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}

To execute the tests –

$ pytest

The test will fail. The book that we are returning is not in the expected format. So let’s add the DynamoDb calls – that will be mocked by Moto.

get_book.py

import json
import boto3

dynamodb = boto3.resource('dynamodb')
tableName = 'books'

def get_book(id):
    table = dynamodb.Table(tableName)
    result = table.get_item(
         Key={
              'id': id,
         }
    )
    return result

Now the test unit will pass 🙂

Some important things to point out:

The test case is built using the Pytest framework, then Moto for mocking the calls.

@mock_dynamodb – marks this method to indicate to the Moto framework that dynamodb will be decorated.
The methods we want to mock must be imported before getting the client instance so that the Moto framework can decorate them properly. This is very important; if you don’t do that, the test won’t work.
The assert methods come from the Pytest framework – check the docs for more examples.

Conclusion

Testing is not easy and can be tedious; for some people, even a nuisance. Using TDD – or BDD – changes your mindset entirely because you are designing your system, not only testing. But this is something that shouldn’t be news to you. TDD and BDD have been around for a while.

Not for the cloud, though.

Testing in the cloud is not accessible; it’s all about integration and cost. Libraries like Moto helps to alleviate that, and I have to say that it does pretty well.