In the previous post, I described the serverless real-time ingestion of live tweets; you can have a look at the following link:
With those tweets, we can train a model for topic modelling; what is more, we’ll use a synthetic set of tweets as well, generated with Claude, Anthropic’s LLM.
For classifying the tweets, the first model I used was the Sagemaker built-in version of LDA, an unsupervised learning algorithm for modelling topics with a mind of its own. Often, it models unexpected topics, but that’s part of the fun. ;). As an alternative, you can use the NTM algorithm. Using each has pros and cons; look at the docs for a discussion.
Let’s have a look at the architecture for generating the synthetic tweets. As you can see in the diagram, in the upper part, the generation is relatively straightforward:
1 – The user invokes the Lambda Function URL with a POST with the desired prompt and the model_id: “anthropic.claude-instant-v1“
2 – The Lambda Function invokes the Claude Model with the prompt, which produces and returns the desired dataset
3 – The function stores the dataset in an S3 bucket in the CSV format – ready for training in Sagemaker (it will need some pre-processing, though)
As I write this article, on October 14th of 2023, the Lambda function needs a layer with the “botocore>=1.31.57″ version for the Python runtime 3.11. If you don’t do that, you will get an error when invoking the model – type not found.
Let’s see a basic Lambda handler implementation:
For the prompt, you can use the recommendations found everywhere; I do not need to repeat them except to be specific. In the case of the Claude models, the prompt must start with “Human:” and end with “Assistant:”. For instance, a straightforward example with zero-shot mode:
“””Human: Write a dataset about the following <topic> in CSV format,
Let’s see an example generated with Claude’s instant v1 model, which is the fastest and most affordable version but very proficient, as you will see:
1,”I was impressed by the updates to #AWSML services at #reinvent. The new features in AWS Sagemaker like AutoPilot, Pipelines and Model Monitor will make it even easier to build and deploy ML models at scale.”
2, “Just read a great case study on how @Anthropic used AWS Lambda, DynamoDB and S3 to build their constitutional AI platform. Serverless architectures on AWS are perfect for building scalable AI applications: https://bit.ly/3fKDs2C“
3, “Did you know that AWS now has a managed service for ML workflows called Amazon SageMaker Pipelines? It handles setting up end-to-end ML workflows with steps for data processing, model training & evaluation, and deployment. Could really speed up my #ML projects!”
4, “Been playing around with AWS DeepRacer, a 1/18th scale race car you can control with reinforcement learning. Such a fun way to learn about RL! Apparently the fastest lap times are around 50 seconds which is impressive considering it’s a tiny car: https://amzn.to/3qLExR9 #AWS #Reinvent”
5, “@AWS re:Invent was awesome, but I think some of the implications of their #ML announcements flew over people’s heads. putting AI/ML services like SageMaker, DeepLens etc. together with IoT services like Greengrass means we’ll start seeing AI/ML capabilities popping up in all kinds of unexpected places!”
It’s pretty good, similar to the results I got using the latest GPT 4 model.
Training the model with Sagemaker
Now that we have two datasets – one from actual live data and one synthetic – both stored in S3, we can train the LDM model. Let’s review some important points:
- Have a look at the example Notebook for a quick reference.
- The dataset must be pre-processed for training: tokenizing, removing stop word punctuation and then processed using stemming or lemmatization. I used nltk and gensim libraries. Once the text has been tokenized, we need to convert it to a bag of words:
- The LDA model – Sagemaker flavour – expects the data in CSV or RecordIO format
- Now, we can train the model; I used a discrete ml.c5.2xlarge instance. We can look at the Training Section of Sagemaker -> Training jobs to find all the details about the training job, hyperparameters, the S3 model artefact, logs and metrics.
- Finally, we can deploy the model to test it: