Sagemaker hero section

Deploying LLM and DocsGPT on AWS Sagemaker

Retrieval Augmented Generation chatbot on your private data

We have been fine-tuning LLM for RAG on AWS Sagemaker for some time now and it was always easy to deploy models there afterwards. This guide will show you how to deploy LLM for RAG on AWS Sagemaker with this step-by-step guide. If you are looking to learn how to Fine-tune LLM's in general and deploy them on AWS I recommend you check out Phil Schmid's blog .

Deploying docsgpt-7b-mistral on AWS Sagemaker

We will use our fine-tuned model that is based on Mistral-7B-v0.1 model. We used around 50k high quality examples to fine-tune this model. You can check it out on HuggingFace here .

To begin you need to go to your Sagemaker and create a notebook or you can do this on your device. For environment I recommend using `PyTorch 2.0.1 Python 3.10 CPU` in Sagemaker.

This guide uses python code to deploy the resource that is being run in a notebook. To make sure this code runs well your environment should be authenticated with AWS and have the right permissions like: `AmazonSageMakerFullAccess` and `AmazonS3FullAccess`.

First we import correct dependencies:
!pip install -U sagemaker --quiet

Next we prep the boto3 session:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it doesn’t exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

Next we load huggingface's image:
from sagemaker.huggingface import get_huggingface_llm_image_uri
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

Next we define our hyperparameters:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
# chose desired or available instance type here
instance_type = "ml.g5.xlarge"
number_of_gpu = 1
health_check_timeout = 600

# Define Model and Endpoint configuration parameter
config = {
'HF_MODEL_ID': "Arc53/docsgpt-7b-mistral", # model_id from
'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(3072),  # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(4096),
'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),

 # check if token is set

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(

Finally to deploy it use:
llm = llm_model.deploy(

Once you have launched DocsGPT application make sure that you set env variables:
SAGEMAKER_ENDPOINT: str = None # SageMaker endpoint name (docsgpt-7b-mistral)
SAGEMAKER_REGION: str = None # SageMaker region name
SAGEMAKER_ACCESS_KEY: str = None # SageMaker access key
SAGEMAKER_SECRET_KEY: str = None # SageMaker secret key

Also make sure you switch to appropriate embeddings. For example you want everything run locally:

Simplifying Deployment with AWS Sagemaker

In conclusion, deploying your custom LLM and DocsGPT on AWS Sagemaker is a streamlined and user-friendly process. If you encounter any challenges or require a tailored solution for your specific needs, don't hesitate to reach out for assistance. Our team is ready to help you optimize your deployment!

Get in touch