Pretty cool post from Google’s Blog to quickly view 21 services in a series of short videos.
Some essential videos
You can find the full set of videos in the original post on Google’s Blog.
Pretty cool post from Google’s Blog to quickly view 21 services in a series of short videos.
You can find the full set of videos in the original post on Google’s Blog.
There is a “phenomenon” that I have experienced through my career that I like to call the “Reference Architecture Disappointment”.
Some people would experiment with a similar effect when they go to the MD´s consultation with several symptoms to find out that they may have a common cold. No frenzy at the Hospital, no crazy consultations, no House MD´s TV scenes. Just paracetamol, water and rest!
So many years of Medicine School just to prescribe that?
Well, yes. The MD recognised a common cold between dozen of illnesses with the same set of symptoms and prescribed the simplest and best treatment. The question is, would you be able to do it?
Same thing when a Solutions Architect deals with a set of requirements. The “Architect” will select the best architecture that solves a business problem, most simply and efficiently possible. Sometimes, that means to use the “Reference Architecture” for that particular problem, with the necessary changes.
Those architectures emerge from practical experience and encompass patterns and best practices. Usually, reinventing the wheel is not a good idea.
Keep it simple and Rock On!
In the previous series post, I described a fully ready architecture to summarize and analyze tweets at scale.
Some of you have contacted me to ask about the ingestion process and the integration with Twitter. So, I have decided to present a simplified ingestion architecture to show you how it can be done more efficiently.
(1) The integration part of the architecture remains the same; we use the ECS Fargate service and a Python script, packaged as a Docker container, as the central infrastructure to retrieve the raw Tweets. Additionally, we use the Amazon ECR to store the docker image privately.
Using Terraform, the ECR component is straightforward to provision:
resource "aws_ecr_repository" "aws-ecr" {
name = "${var.app_name}-${var.app_environment}-ecr"
}
Creating the ECS cluster is more involved, as it is comprised of a few moving parts:
Let’s have a look at the task which is the most exciting part:
The task definition is required to run Docker containers in Amazon ECS, and as you see, we have defined some parameters:
– Docker image URL, stored in our private ECR
– CPU assigned to the task
– Memory assigned to the task
– Networking mode, we select the “awsvpc” mode – the task is allocated its elastic network interface (ENI) and a primary private IPv4 address.
– portMappings: we are just using the HTTP port for this example
We need to deploy the task in the cluster:
data "aws_ecs_task_definition" "main" {
task_definition = aws_ecs_task_definition.twitter_task.family
}
The task is just the instantiation of a task definition within a cluster; we can define different containers in a task if we need to implement something like the sidecar pattern.
Now that the task is running within our container, we are fetching live tweets from the service:
(2) I have simplified the ingestion part of the architecture, removing AWS Kinesis Data Analytics and AWS Kinesis Data Streams, ingesting directly the raw tweets using AWS Kinesis Firehose, which allows near real-time ingestion into S3. As an additional exercise, you can use a Lambda function to format the Tweets at ingestion time – Firehose can invoke it.
Once we have created the Firehose tweet_stream, providing the bucket arn and a role so the service can access S3, the tweets will start being ingested in near real-time.
resource "aws_kinesis_firehose_delivery_stream" "firehose_tweet_stream" {
name = "tweet_stream"
destination = "s3"
s3_configuration {
role_arn = aws_iam_role.firehose_role.arn
bucket_arn = aws_s3_bucket.tweets_storage.arn
}
}
The Firehose service provides some valuable metrics to monitor the ingestion process:
(3) Finally, the tweets are ingested into the S3 bucket. Remember, “the frequency of data delivery to Amazon S3 is determined by the S3 buffer size and buffer interval value you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before delivering it to Amazon S3. You can configure the values for S3 buffer size (1 MB to 128 MB) or buffer interval (60 to 900 seconds), and the condition satisfied first triggers data delivery to Amazon S3” – from AWS Documentation.
The last part of the puzzle is integrating with Twitter using its API. For that, you need an account on Twitter’s Developer Platform, and a paid subscription, depending on the functionality you want to use.
To access the Twitter API, I recommend Tweepy, a Python library, to handle the API more easily.
As a first step, instantiate a kinesis client; remember, don’t store the keys in the code. Use a credentials store, like Secrets Manager.
kinesis_client = boto3.client('firehose',
region_name='eu-west-1', # enter the region
aws_access_key_id=access_token, # fill your AWS access key id
aws_secret_access_key=access_token_secret)
Then, using Tweepy, we search for the tweets that we are interested in, and we ingest them into the Firehose stream:
logger.info("Searching and ingesting Tweets")
response = client.search_recent_tweets("AWS", max_results=10)
tweets = response.data
for tweet in tweets:
response = kinesis_client.put_record(
DeliveryStreamName="tweet_stream",
Record={
'Data': tweet.text
}
)
Well, I hope this article helps you get started, have fun! 🙂
I’m sharing with you one of the architectures, production-ready, that I’ve been working on lately. I was keen to learn how the GPT algorithms would fare summarizing Tweets, not just individually – for instance, with two or three words – but with groups of Tweets that share a tag and common themes. Also, to analyze and generate a response for those Tweets or summaries.
Let’s see an example.
We want to generate a summary for architects of the latest Tweets from the AWS Architecture account:
“By migrating to the cloud, we could develop our capabilities faster & focus our efforts on building solutions for our customers.” Jacek Presz, Director of IT Management Bureau, Bank Pocztowy SA Here’s the unique #AWSConfig solution 👉 go.aws/3JmtSnk #CloudCompliance”
“Building highly available, resilient apps is a *MUST* these days. Hear from Joseph Wagner, Jr., Domain Architect & Cloud Specialist about how using the #AWS #WellArchitected Tool helped Vanguard improve resilience of their apps 💪 #ResilienceOnAWS
“Aurora leads the way in autonomous trucking with the benefits of self-driving safely 🚘 Discover how #AWS contributes to this success story. It’s our latest look at the drivers of transformation in this season of #AllThingsAutomotive Watch how @IndyAChallenge is paving the way for the next generation of autonomous race cars with the help of #AWS & open innovation. ✨ Watch now: go.aws/3UCnoTH #AllThingsAutomotive”
We may get something like this:
“Here’s the summary of the tweets related to AWS Architecture:
Now we generate a response for the summary as a Tweet:
“Exciting to see how #AWS architecture is driving cutting-edge technologies like autonomous trucking and race cars while also improving app resilience and cloud compliance for businesses. It’s all about leveraging the latest tools like AWS Config and Well-Architected Framework!” #CloudInnovation #AllThingsAutomotive #ResilienceOnAWS
The architecture I’m sharing with you comprises well-known AWS components and patterns. For instance, the ingestion pipeline of the architecture could be done in different ways; in this case, I’m using the well know pattern Kinesis Data Streams –> Kinesis Data Analytics –> Kinesis Data Firehose to ingest and transform the Tweets in almost real-time. In the final version, I’ve used another different pipeline, but this is fine and better known.
Let’s review the architecture flow and the components quickly, and in the next instalments, I will go deeper if needed. Ideally, it would be best if you tried to implement this by yourself and learn from the experience. For instance, the first thing you could do is extract an MVP from the architecture and start building on top of it. Be advised this architecture could be $$$, so proceed with caution.
1- We first need to connect to Twitter to collect raw tweets; for that, we can use a Python script that connects to the service to retrieve the data and ingests it into Kinesis. This part is pretty straightforward; you will need a Twitter developer account and credentials to access it. Deploying the script as a Docker container in ECS Fargate is a very convenient way to do it and is a serverless service with all its benefits. The credentials are stored in the AWS Secrets Manager Service.
2 – As explained previously, for ingesting and transforming the data in almost real-time, we use the pattern Kinesis Data Streams -> Kinesis Data Analytics -> Kinesis Data Firehose -> AWS Open Search. This is a very convenient way to do it, and Data Analytics can help transform the raw Tweets, using SQL, to the format you need to create the Open Search index.
3 – Finally, to obtain the summaries, we have exposed a REST API with API Gateway, Cognito and Lambda. Again, you can implement this in many ways, but this is serverless and easy to implement. The Lambda function is the core of the API, querying the OpenSearch, fetching and grouping the tweets we want to summarise, and finally invoking the OPENAI API; for this, you need credentials that you can obtain on their website.
I recertified in April 2023 on the AWS Data Analytics certification for the second time, if you count the AWS Big Data certification – now retired. I got an email from AWS asking for feedback and another survey about the competencies for the AWS Data Engineer role – fair warning, a new version of the exam may be coming; no big secret here; they review the exams to keep them current and challenging.
Personal Experience
You can read a bit about my experience with this certification in the past in the post above. I got it days after it was released, and I can say that it was a challenging and fun experience. Compared to the present experience, I can say that the exam pattern remains very similar but with more questions about some services like OpenSearch, Lake Formation and MSK, to name a few. Remember that this is just an orientation based on my test set so you could get a different experience. In any case, base your preparation on your personal experience and needs; as I always say, take the opportunity to create your course and journey.
About the value of the certification, I can say that it works for me, helping me to keep up to date and proving my skills in the field up to a point. The certification is not a substitute for experience, so remember that you need to have some to back it up, plus experience in other complementary domains.
I will focus this post on the new features or services with more presence in the test; as for the rest, you can find more information in the older post or around the net.
Let’s have a look at a question from the sample set:
This is a representative question that includes different services and is relatively verbose. It references many essential services, including Kafka.
In many questions, you must choose the best option, filtered by some functional requirement, in this case, the overall latency. The difference between two answers could be one word, literally one service. That can be a difficult choice, depending on the context given that can help you to choose.
So, let’s review some new features that I find interesting and can be relevant for the test too:
LakeFormation & Glue
LakeFormation and Glue are two pivotal services you need to know well, especially Glue. They can now work together, so you can use AWS Lake Formation permission defined on the data lake for crawling the data, avoiding creating roles and S3 permissions for them. Additionally, this feature allows cross-account access and can be used with Athena, creating a model for centralized permission government.
More on the subject, this time with Redshift:
AWS MSK & Kafka
There is one service you need to prepare for if you don’t have previous experience with it. The growing number of questions surprised me in a way, but it makes sense. You will find many projects using Kafka, and MSK is an excellent alternative to replace it with a managed solution. The questions weren’t that easy, so prepare well. Also, I’d recommend that you get a roundup of Kafka:
The book’s second edition is relatively recent, and it’s an excellent introduction. Topics and partitions are well explained 🙂
https://aws.amazon.com/blogs/big-data/how-to-choose-the-right-amazon-msk-cluster-type-for-you/
A classic, what’s the better choice for your workload, provisioned or serverless?
https://aws.amazon.com/blogs/architecture/amazon-msk-backup-for-archival-replay-or-analytics
MSK interacts very well with other AWS services:
The ability to ingest hundreds of megabytes of data per second into Amazon Redshift materialized views and query it in seconds, avoiding provisioning additional infrastructure like Firehose, allows for implementing use cases like live leaderboards, clickstream analysis, application monitoring, and others in a much easier way.
AppFlow
https://aws.amazon.com/blogs/aws/announcing-additional-data-connectors-for-amazon-appflow/
New AppFlow integrations, like Marketing connectors (Facebook Ads…), Customer Service (MailChimp, Sendgrid … ), and Business operations (Stripe …).
But primarily, I’d like to highlight the integration with the AWS Glue Data Catalog that allows the registration of the SasS data into it. With this integration, there is no need to create crawlers to populate the data; it can be done with a few clicks.
AWS OpenSearch
Ah, AWS Open Search, an old friend of mine :), I have provisioned a few so far, and I’m happy to see that it appears more often in the test.
This new serverless flavour seems a good step forward for the service, following the steps of RedShift, for instance.
Resources
Apart from the usual suspects, documentation, blog, FAQ, practice test, and readiness course, I can recommend the official certification guide, which is very relevant, and packed with additional resources to prepare:
This a good book, with a kind of misleading title. Yes, it’s about hybrid cloud adoption, but with AWS Outposts, to be more precise.
AWS Outposts is a family of fully managed solutions that deliver AWS infrastructure and services to virtually any on-premises or edge location for a consistent hybrid experience.
And it’s good; it could appear contradictory that a cloud vendor offers you to run fully managed workloads on a “restricted cloud” on-premises, but it makes sense for many use cases that need low latency, local data processing or data residency.
For instance, in the Healthcare Industry: Linear accelerators, CT Scanners and surgical devices.
Some of the benefits that Outposts provides:
Some use cases examples:
Data residency
A practical use case where you need to use more than one region for disaster recovery for regulated enterprises and data residency is required for compliance.
As you can see, the London Region is connected to the Outposts rack installed on-premises, using a VPN or Direct connection. All the workloads and data reside in the UK zone, and only the control plane traffic is sent to the IR region in case a disaster impacts the London region.
Low Latency & Local Processing
https://aws.amazon.com/blogs/aws/deploy-your-amazon-eks-clusters-locally-on-aws-outposts/
The ability to deploy the EKS cluster locally is beneficial when you need low latency, so keeping the cluster near the rest of the infra helps. But that’s not the only benefit; being local means that the cluster can work even if there are problems with the connections, which are not uncommon when the workloads are deployed on rough edges.
Other resources
A bit about my personal experience
The AWS Certified Database Beta certification was the last one I took at a test centre, a few weeks before Covid hit Spain very hard and the whole country went into lockdown.
It was an intense experience; the certification was at the beta stage at the time – Jan 2020 – which meant 85 questions in 4 hours, with no breaks :(, so my mind went into lockdown too after 3 hours and a half; I passed nonetheless 🙂
As the test was in the beta stage, there was no available information, courses, test sets or anything, which is great because it forces you to prepare for your journey, depending on your experience with the subject matter. I created a “course” focused on architecture because that´s my interest and my bread and butter, but I was off target.
Well, the exam hasn´t changed; it´s focused on operations – backups, snapshots, migrations – optimization and troubleshooting, not that much on architecture or infrastructure – but there are a few questions. Also, RDS/Aurora and DynamoDB are the real stars of the test. Other products like Redshift, ElasticCache, DocumentDB, Athena, Neptune, TimeStream, QLDB and OpenSearch are present in the test in decreased order of importance – at least in my set of questions. Remember that AWS holds an extensive database of questions and updates them very often, so take this always with caution.
You should always maximize your investment in these certifications; they are expensive in terms of time and money, so tailor them to fit your needs, not only to get “the badge”. I took a deep dive into the databases I didn’t have commercial experience with, and it was worthwhile; it just wasn´t reflected on the test. But that keeps with me from now on.
Preparing for the Hero’s Journey
I took my notes from three years ago and created a new version of my “course”. As a base, I’ve used the guide “AWS Certified Database – Specialty Certification Guide” from Pack Publishing. It’s an excellent book to use as an intro to the subject matter and covers the test well, but you must go deeper in every section to pass it. BTW, a new official guide is coming in July from Sybex.
Remember, this is a speciality certification, which means that, at the very least, you should hold one certification at the associate level or the equivalent knowledge. And it pays off; several questions involve Networking, Security and IaC, and CloudFormation.
I don´t really use video courses; I get bored watching hours of videos that often are rehearsing the official documentation – it is really a passive activity for me; I prefer to be in control of my learning and move between different activities. I watched a few videos this time, and I couldn´t take more after a few minutes, so I returned to the real thing and the reading material. If you like this kind of learning, pick one from your favourite provider, select the ones from active professionals that can add some real experience to the material, tailor it to your needs and further it.
Resources & Key Sites
AWS Certification Site: the primary site, but it contains some good resources for preparation, including sample questions, the official practice test and the interactive Readiness Course, which also includes additional questions:
A relevant question for the test: “choose the best database” for a specific workload.
– AWS Database blog: excellent site to read about actual use cases, technical articles and new features.
Some nice posts:
https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/
– AWS Training and Certification Blog
A post from 2021 with valuable tips for preparing for the certification:
– AWS Documentation: an obvious one, but still the best source of information
Conclusion
This a challenging certification that you should prepare accordingly. Take your time and prepare for your journey, not just the destination.
Good Luck!
Here we are, July 2022, in the middle of a scorching summer, so I’d guess it is a perfect moment to relax near the sea and read a book or two. I’ve compiled a list of favourite books that I can personally recommend. All the books are paperbacks that I own, and I’ve read back to back; also, I’ve gone through most of the examples myself, so there is no cheating here 😉
The order in which I’m presenting the books doesn’t reflect the quality; in fact, all are great books;
Let me recommend you this book from Packt.
First, this is not a straightforward preparation guide to achieving the Professional Cloud DevOps Engineer; it is much more than that: it’s almost a textbook on the subject matter that can be used as a guide for the certification. The edition of the book is one of the best I’ve seen from this publisher; As I mentioned before, this book feels like a college book but is written in a concise style.
But don’t get me wrong; the book goes intense on many topics, like, for instance, K8’s and GKE, which is hands down one of my favourite parts. An excellent introduction or refresher to the case helps to grasp many concepts in not many pages.
Another section that stands out is the book’s first part, which introduces the DevOps/SRE concepts in an informative way – and not tied up to GCP. The monitoring section is pretty informative, as well.
One of the best books I’ve read about #gcp – but not limited to – and #devops.
I can’t believe that 13 years have passed since I read the blue book of DDD, Eric’s Evan seminal work, and 11 since I worked on the first project using it. We were breaking a monolith into what we now call microservices or an early incarnation.
I was looking for updated literature about the subject when I found this excellent from Vlad Khononov. It provides the correct theory and practical examples to understand the different concepts and patterns. The style is clear and concise but academic, which is not easy to do.
Some stuff you are going to find in the book:
– Architectural patterns and Heuristics
– Microservices & EDA
– Datamesh
To summarize, one of the best books on the subject matter that can be used as a primer. I was expecting more examples, but that helped keep it to the book at a reasonable size, around 300 pages.
Note: The code listings are in C#, but they are very generic and easy to follow.
This book from Packt is an excellent book to introduce yourself to the beautiful world of #analytics on #AWS. The author has done a fantastic job cramming into 440 pages many topics you may find when working on this field in the #cloud.
You pay the price; some topics are covered at an introductory level. For instance, ingesting data with #kinesis is a topic that needs a whole book, not a few pages. But you get an intro. Other chapters, like “Transforming Data to Optimize for Analytics”, are more comprehensive.
The book covers recent services like AppFlow, Glue Studio, DataWrangler, and other third-party services.
An excellent #book that I’d recommend to data engineers that want to introduce themselves to
I was looking for a book to replace my old book about microservices: “Manning Spring Microservices, 2017″. Specifically, that covered new tools and deployment in the cloud – I wasn’t that interested in the Spring part.
Well, this book does its job very well. I was so hooked that I read the book several times and reviewed most examples.
If I have to pick a few highlights, it would be the Kubernetes deployment chapters, Service Mesh and the replacement of Netflix components.
A word of advice. This book is a Packt Publishing release, so it follows its editorial style: very hands-on and includes many code listings. It agrees with me a lot, but make sure that’s what you want.
A few months back, I spent a weekend in Barcelona, one of my favourite cities. It was great to see the city back to life, filled with people and tourists 🙂 We even went to the beach on the first day!
OK, back to the book!
I liked it, the author’s style is very engaging, and I enjoyed having a fresh look at the different Cloud-Native Patterns from the GO perspective.
The only concern is that the subject is too broad to be covered in just 400 pages, so you can use this book as a good introduction, but you’ll need to complement this with other literature.
Funnily enough, I’m not using this guide from Konrad Cłapa, Google Cloud Certified Fellow, to achieve the Architect certification, as I recertified in the beta; 77 intense questions in three hours, including questions from the new scenarios.
However, I have a bunch of GCP recertifications coming my way very soon, so I’m using this book as one of the stepping stones to resync and refresh my whole GCP knowledge.
This guide is very comprehensive and to the point but still clocks at 600+ pages; don’t worry, it includes lots of pictures 🙂
Also, you can find four mocks test (20 questions each) and an analysis of each case study. I only miss some review questions in every chapter.
Remember that this is a study guide, so expand each section accordingly and get practical knowledge as required. In RL, you don’t have a list of questions to choose from, and not everything is on the web; that’s where a mix of knowledge, experience and intuition kicks in.
I’ve had this book by Julien SIMON on my reading list for a while – it’s a 2020 release – and when I finally made the time to read it, I couldn’t put it down. Like the previous book, it’s a Pack Publishing release packed with examples so that you can start prototyping immediately.
It’s a hands-on book focused on Sagemaker and AWS services, so it’s not a book to learn Machine Learning in-depth. The good news is that you can learn to train and deploy pre-built models using Sagemaker without that specific knowledge.
The book gives you an excellent overview of Sagemaker, but I’d guess it was written before 2020, so it doesn’t cover many new features like Sagemaker Studio, Autopilot, etc…
But not to worry :), there is already a second edition that covers all that new shiny stuff.
Happy Summer!
Moto is a library that allows mocking AWS services when using the Python Boto library. It can be used with other languages as stated in the documentation – standalone server mode. I don’t have direct experience with that feature, though, as Python is my language of choice for coding PoCs or data projects in general.
I don’t need to preach about the benefits of using TDD to design and test your components. In any case, if you are still not convinced, check out books like:
Developing for the cloud comes with its own set of challenges. For instance, testing. Not easy when you need to use services that are not locally available.
The appearance of libraries like Moto has made testing much more manageable. Like any library, it has its peculiarities, but the learning curve or resistance is not exceptionally high, especially if you have previous experience with Pytest or other testing frameworks.
Prerequisites
I assume you have previous knowledge of AWS, Boto, Python, TDD and Pytest. In this case, I’m providing some complete listings to learn by example, not the entire exercise, though, so you can fill any gaps and enhance your learning experience.
It’s straightforward –
$ pip install moto[all] - if you want to install all the mocks
$ pip install moto[dynamodb]
To mock AWS services and test them properly, you will need a few more dependencies –
$ pip install boto3 pytest
Let’s go beyond the canonical examples and find out how to mock DynamoDB, the excellent AWS NoSQL database.
As shown in the following code listing, we’ll create the books table with the PK id. The function mocked_table returns a table with some data. Later on, this table will be mocked by Moto.
test_helper.py
import boto3
import json
def mocked_table(data):
dynamodb = boto3.client("dynamodb")
table = dynamodb.create_table(
TableName='notes',
KeySchema=[
{
'AttributeName': 'id',
'KeyType': 'HASH'
}
],
AttributeDefinitions=[
{
'AttributeName': 'id',
'AttributeType': 'HASH'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
}
)
table = boto3.resource('dynamodb').Table("books")
with table.batch_writer() as user_data:
for item in data:
table.put_item(Item=item)
return table
Moto recommends using test credentials to avoid any leaking to other environments. You can provide them in the configuration file conftest.py, using Python fixtures –
conftest.py
import os
import boto3
import pytest
from moto import mock_dynamodb2
os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'
@pytest.fixture(scope='function')
def aws_credentials():
"""Mocked AWS Credentials for moto."""
os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
os.environ['AWS_SECURITY_TOKEN'] = 'testing'
os.environ['AWS_SESSION_TOKEN'] = 'testing'
Now, we are ready to start designing and developing.
We will code the function get_book that retrieves a particular Item from the table. Following the TDD cycle, we have to code the test first, make it fail etc … not showing here the entire cycle, but just a few steps to get the idea.
The test unit would look like this –
get_books_test.py
import pytest
import json
import boto3
from moto import mock_dynamodb2
book_OK = {
"pathParameters":{
"id": "B9B3022F98Fjvjs83AB8a80C185D",
}
}
book_ERROR = {
"pathParameters":{
"id": "B9B3022F98Fjvjs83AB8a80C18",
}
}
@mock_dynamodb2
def test_get_book():
from get_book import get_book # dont't change the order
from test_helper import mocked_table # must be imported first
dynamodb = boto3.client("dynamodb") # before getting the client
data = [{'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}]
mocked_table(data)
result = get_book(book_OK)
item = result['Item']
assert item['id'] == 'B9B3022F98Fjvjs83AB8a80C185D'
assert item['user'] == 'User1'
As you can see, nothing very different from a Pytest test unit, except for the Moto annotations and AWS’S specific code.
Let’s increase the test coverage, ensuring that the functionality “Item not found” is working as expected.
@mock_dynamodb2
def test_get_book_not_found():
from get_book import get_book # dont't change the order
from test_helper import mocked_table # must be imported first
dynamodb = boto3.client("dynamodb") # before getting the client
data = [{'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}]
mocked_table(data)
result = get_book(book_ERROR)
assert 'Item' not in result # book not found
OK, now that we have the test unit by design, we need to write the function get_book. We can start with something basic that satisfies the import and the method signature. By the way, I showed the test unit fully coded, but you can do this gradually and code the essential function initially.
get_book.py
import json
import boto3
dynamodb = boto3.resource('dynamodb')
tableName = 'books'
def get_book(id):
return {'id' : 'B9B3022F98Fjvjs83AB8a80C185D','user' : 'User1'}
To execute the tests –
$ pytest
The test will fail. The book that we are returning is not in the expected format. So let’s add the DynamoDb calls – that will be mocked by Moto.
get_book.py
import json
import boto3
dynamodb = boto3.resource('dynamodb')
tableName = 'books'
def get_book(id):
table = dynamodb.Table(tableName)
result = table.get_item(
Key={
'id': id,
}
)
return result
Now the test unit will pass 🙂
Some important things to point out:
Testing is not easy and can be tedious; for some people, even a nuisance. Using TDD – or BDD – changes your mindset entirely because you are designing your system, not only testing. But this is something that shouldn’t be news to you. TDD and BDD have been around for a while.
Not for the cloud, though.
Testing in the cloud is not accessible; it’s all about integration and cost. Libraries like Moto helps to alleviate that, and I have to say that it does pretty well.
Last February, AWS added support for creating canaries for API Gateway to Cloudwatch Synthetics, which I’ve been using lately to monitor some REST APIS successfully.
Let’s review some technical concepts first:
After some hard work, now we have our brand new Serverless App deployed and ready to be tested, comprised of a REST API and a Microfrontend – Cloudfront + S3.
We need to monitor our endpoints for latency and resiliency; how do we do it right away? I have already answered the question: creating a synthetic canary.
You can find the Canaries dashboard in the AWS console:
Cloudwatch > Insights > Application monitoring > Synthetic canaries
Creating a canary for API gateway from the console it’s really straightforward; you are given two options: select an API Gateway API and stage or a Swagger template.
We are presented then with a series of options:
The endpoint URL should be populated automatically:
The next step, adding a HTTP request, is also straightforward. The configuration – parameters, headers – depends on the method that is being tested:
Now you can finalize the creation of the canary:
In the following screen, we can see the canary that has been created and have been in execution for a while:
Finally, a few metrics are shown: Duration, Errors (4xx), Faults (5xx).
As promised, canaries are really easy to implement.
DynamoDB is the fastest NoSQL database at scale from AWS, operating within the key-value and document-based models. I won’t delve into the basics because I’m sure I don’t need to explain them to you – as you have arrived here :). Anyways, if you’d need a quick introduction, please check out the following links:
If you have ever implemented a pagination component, you already know that it is not easy, especially in a clean and performant way.
Furthermore, DynamoDB adds its own set of challenges because of the way it works. The resultset is divided into groups or pages of data up to 1 MB in size if you execute a Query. So you’d need to find out if there are some remaining results to return after that first query. Also, you’d likely need to return a fixed number of results, which adds a few excellent edge cases to the mix.
If you use Scan
, instead of Query
, things get worse because it reads up the whole table, exhausting the assigned RCUS very quickly. I produced a first quick version using Scan
; it works, but it’s not optimal for pagination, especially when you have many records – and expensive too.
Not all is gloom and doom, though. The Query
object contains an element, LastEvaluatedKey, that points to the last processed record. We can use this element to build a cursor we can pass back and forth – in the response and request – to make our pagination component. When there are no elements left, this element is null, and therefore, we have reached the end of the resultset.
LastEvaluatedKey is a Map
type object that contains the PK of the Table. We shouldn’t pass it like that, as we would expose our model to the world. A standard and better way to do this is by passing the element using Base64 encoding. You can use the Python module base64:
import base64
cursor_ascii = cursor.encode("ascii")
base64_bytes = base64.b64encode(s)
#we convert the bytes into a string or whatever we'd need
The first thing we have to do is to retrieve the cursor from the request, if it exists, and execute a first query. Then we assign the cursor – the decoded LastEvaluatedKey from the previous pagination – to the field ExclusiveStartKey. In this example, I’m retrieving a user’s data set using the id as a filter.
#get cursor from the request
exclusiveStartKey = decode_base64(cursor)
if exclusiveStartKey is not None:
response = table.query(
KeyConditionExpression=Key('id').eq(userId),
ExclusiveStartKey=exclusiveStartKey
)
Now, we find out if there are some remaining records – remember the 1 MB limit – until the element LastEvaluatedKey is present in the result object or if we have reached our imposed limit. Finally, we have to keep track of the LastEvaluatedKey to pass and encode it in the response.
lastEvaluatedKey = None
while 'LastEvaluatedKey' in response:
key = response['LastEvaluatedKey']
lastEvaluatedKey = key
response = table.query(
KeyConditionExpression=Key('id').eq(userId),
ExclusiveStartKey=key
)
............
cusor = encode_base64(cursor)
I hope this helps to build your pagination component 🙂
What a surprise!
I was writing a piece about a component for DynamoDB – that I’ve produced for a PoC -when the console completely changed!
The new console looks like the rest of the updated services; the experience gets more cohesive between all services.
Includes three new sections: Items, PartiQL Editor and Export to S3.
So far, so good!