Vector Databases on AWS: Enabling AI-Powered Search and RAG
The shift from keyword search to semantic search — and from static reports to AI-powered question-answering systems — has made vector databases one of the most rapidly adopted infrastructure components in 2023 and 2024. For data engineering teams, vector databases represent a new storage primitive that requires different design thinking than relational or columnar stores, but integrates with familiar AWS infrastructure in ways that make production deployment tractable.
This post explains what vector databases are, when they are the right choice, which AWS-native options exist and how they compare, and how to build a production-grade Retrieval-Augmented Generation (RAG) architecture on AWS.
What Vector Databases Actually Do
Traditional databases store and retrieve data based on exact or fuzzy matches against discrete values. Find the row where customer_id = 12345. Find all products where category = 'electronics'. Vector databases store and retrieve data based on mathematical similarity between high-dimensional vectors — numerical representations of unstructured content.
When you embed a text document, image, or audio clip using an embedding model, you transform it into a vector of several hundred to several thousand floating-point numbers. These numbers encode semantic meaning: documents about similar topics have vectors that are geometrically close to each other in high-dimensional space. A vector database’s core operation is approximate nearest-neighbour (ANN) search: given a query vector, find the stored vectors most similar to it.
The practical consequence is that you can store your organisation’s documentation, product catalogue, support tickets, or transaction notes as vectors, then retrieve the most semantically relevant items for any natural-language query — without requiring exact keyword matches.
AWS Vector Database Options: A Comparison
AWS currently provides vector search capability in four services, each with different trade-offs:
Amazon OpenSearch Service with k-NN plugin: The most mature and flexible AWS-native option. OpenSearch’s k-NN plugin supports Hierarchical Navigable Small World (HNSW) and IVF (Inverted File Index) algorithms with configurable precision/speed trade-offs. Supports filtering on metadata fields alongside vector similarity, which is essential for real-world RAG systems. Scales horizontally. Best for: production RAG systems, enterprise search, high-volume semantic search.
Amazon Aurora PostgreSQL with pgvector: The pgvector extension adds vector similarity search to Postgres. Supports exact and approximate nearest-neighbour search using IVFFlat and HNSW indexes. Best for: organisations already using Aurora Postgres, use cases where vector search is one capability among many in a relational workload, smaller vector datasets (under 100M vectors).
Amazon MemoryDB for Redis with vector search: Redis Vector Similarity Search (VSS) in MemoryDB provides in-memory vector search with sub-millisecond latency. Best for: real-time personalisation, session-based recommendations, use cases where vector search latency must be under 5ms.
Amazon Bedrock Knowledge Bases: Managed RAG service that abstracts the vector database layer, using OpenSearch Serverless as the backing store. Best for: teams that want managed RAG without infrastructure responsibility. Less flexible than building your own stack but significantly faster to implement.
| Service | Scale | Latency | Filtering | Operational Overhead |
|---|---|---|---|---|
| OpenSearch k-NN | Hundreds of millions | 10–50ms | Excellent | Medium |
| Aurora pgvector | Tens of millions | 5–20ms | Full SQL | Low (if already using Aurora) |
| MemoryDB VSS | Tens of millions | <5ms | Good | Medium |
| Bedrock Knowledge Bases | Managed | 100–500ms | Limited | Very Low |
Building a RAG Architecture on AWS
Retrieval-Augmented Generation (RAG) is the pattern that makes vector databases operationally useful for most organisations. Instead of relying entirely on an LLM’s training data, RAG retrieves relevant context documents from a vector store and passes them to the LLM as part of the prompt. The LLM generates a response grounded in your specific data rather than generic training data.
A production RAG architecture on AWS:
Document Ingestion Pipeline:
Documents (S3) → AWS Lambda (chunking + metadata extraction)
→ Amazon Bedrock (Titan Embeddings or Cohere Embed)
→ Amazon OpenSearch (vector store)
Query Pipeline:
User query → Amazon API Gateway → AWS Lambda
→ Amazon Bedrock (embed query using same model)
→ Amazon OpenSearch k-NN search (retrieve top-k chunks)
→ Amazon Bedrock (Claude / other LLM, prompt + retrieved context)
→ Response to user
The critical implementation detail is using the same embedding model for both document ingestion and query embedding. Embedding spaces are model-specific — a vector from Titan Embeddings V2 is not comparable to a vector from Cohere Embed. If you change embedding models, you must re-embed all documents.
Implementing Vector Search with OpenSearch
Here is a concrete implementation using Amazon OpenSearch Service and Amazon Bedrock’s Titan Embeddings for a document Q&A system:
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
# Set up OpenSearch client with IAM authentication
region = "ca-central-1"
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
credentials.access_key,
credentials.secret_key,
region,
'es',
session_token=credentials.token
)
os_client = OpenSearch(
hosts=[{"host": "your-domain.ca-central-1.es.amazonaws.com", "port": 443}],
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection
)
# Create index with k-NN mapping
index_mapping = {
"settings": {
"index": {
"knn": True,
"knn.algo_param.ef_search": 512
}
},
"mappings": {
"properties": {
"content": {"type": "text"},
"embedding": {
"type": "knn_vector",
"dimension": 1536, # Titan Embeddings V2 dimension
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 512,
"m": 16
}
}
},
"document_id": {"type": "keyword"},
"source": {"type": "keyword"},
"created_at": {"type": "date"}
}
}
}
os_client.indices.create(index="documents", body=index_mapping)
# Embed and index a document chunk
bedrock = boto3.client("bedrock-runtime", region_name="ca-central-1")
def embed_text(text: str) -> list[float]:
response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": text, "dimensions": 1536, "normalize": True})
)
return json.loads(response["body"].read())["embedding"]
def index_document_chunk(chunk: str, document_id: str, source: str):
embedding = embed_text(chunk)
os_client.index(
index="documents",
body={
"content": chunk,
"embedding": embedding,
"document_id": document_id,
"source": source,
"created_at": "2024-04-09T00:00:00Z"
}
)
# Retrieve relevant chunks for a query
def retrieve_relevant_chunks(query: str, top_k: int = 5, source_filter: str = None):
query_embedding = embed_text(query)
search_body = {
"size": top_k,
"query": {
"bool": {
"must": [
{
"knn": {
"embedding": {
"vector": query_embedding,
"k": top_k
}
}
}
],
"filter": (
[{"term": {"source": source_filter}}]
if source_filter else []
)
}
},
"_source": ["content", "document_id", "source"]
}
results = os_client.search(index="documents", body=search_body)
return [hit["_source"]["content"] for hit in results["hits"]["hits"]]
Chunking Strategy: The Engineering Detail That Determines RAG Quality
The quality of a RAG system is determined less by the LLM choice and more by the chunking strategy used during document ingestion. Chunking is how you split documents into retrievable segments, and the wrong strategy produces either chunks too small to be useful or chunks too large to be precise.
Three chunking strategies and when to use each:
Fixed-size chunking (e.g., 512 tokens with 50-token overlap): Simple, predictable, works well for homogeneous documents (support tickets, product descriptions). The overlap prevents important context from being split across chunk boundaries.
Semantic chunking: Split documents at semantic boundaries (paragraph breaks, section headers) rather than fixed token counts. Produces chunks that are semantically coherent and often retrieve better than fixed-size chunks. More complex to implement — requires a sentence splitter and boundary detection logic.
Hierarchical chunking: Store both large parent chunks and small child chunks. Retrieve using small chunks (for precision), then expand context to parent chunks before passing to the LLM. This “parent document retrieval” pattern significantly improves RAG answer quality for long documents.
For most production use cases, semantic chunking with 200–800 token target chunk sizes and 10–20% overlap is a good starting point. Measure retrieval quality (recall@k) with a held-out evaluation set before optimising further.
Data Engineering Considerations for Vector Pipelines
Vector databases require the same operational rigour as any production data store:
Embedding pipeline as a data pipeline: Document ingestion and embedding should be managed via the same DataOps practices as any other pipeline — version-controlled, tested, monitored. An embedding pipeline failure that silently indexes zero documents is a quality regression that needs detection, not just an error log.
Index refresh strategy: Documents change. When a source document is updated, the corresponding vectors must be re-computed and re-indexed. Design your ingestion pipeline to handle updates via document ID, not just inserts.
Metadata filtering: A vector search without metadata filtering is operationally limited. Real production RAG systems need to filter by document type, date range, department, or access control group. OpenSearch’s hybrid query (k-NN + filter) supports this; design your metadata schema before ingesting documents.
Cost management: Bedrock embedding API calls cost money per token. For large document corpora (millions of documents), embedding costs can be significant. Cache embeddings for documents that have not changed. Use batch embedding APIs where available to reduce per-call overhead.
The data pipeline patterns for managing large-scale document ingestion at scale build on the same Modern Data Stack principles that govern any production data workflow — idempotent writes, monitoring, and automated quality checks.
Connecting RAG to Your Existing Data Platform
The most powerful RAG implementations are not isolated question-answering systems — they are integrated with the broader data platform. A few integration patterns worth considering:
RAG over structured data via text-to-SQL: For BI-style questions over your Redshift warehouse, a vector search of pre-generated natural language descriptions of tables and columns helps an LLM generate accurate SQL. The LLM retrieves the relevant table descriptions, then generates a SQL query against Redshift via the Redshift Data API.
RAG over data catalogue: Embedding the descriptions, tags, and business glossary entries in your data catalogue (Amazon DataZone or AWS Glue Data Catalog) enables natural-language search over your data assets. Data consumers ask “which table has customer lifetime value by region?” and retrieve the relevant tables without knowing exact table names.
Audit and access logging: RAG queries are business intelligence queries. They should be logged with the same rigour as SQL queries against your warehouse — who asked what, when, what was retrieved, what was generated. This is essential for regulatory compliance in sectors like financial services and healthcare. The governance patterns from Data Lineage on AWS extend naturally to RAG systems — tracking which documents were retrieved to produce a given answer is the AI equivalent of column-level data lineage.
Conclusion
Vector databases on AWS are no longer experimental infrastructure — they are production components available as managed services (OpenSearch, Aurora pgvector, MemoryDB) or fully managed RAG abstractions (Bedrock Knowledge Bases). The engineering patterns for building reliable, scalable RAG systems are well understood: consistent embedding models, metadata-filtered retrieval, semantic chunking, and operationalised ingestion pipelines.
For data engineering teams, the entry point is treating vector ingestion pipelines with the same rigour as any other production pipeline: version-controlled, monitored, tested, and governed. The AI capability sits on top of that data engineering foundation, and its reliability is determined by the foundation’s quality.
If you are designing a RAG architecture or vector search capability for your organisation on AWS, contact the Infra IT Consulting team. We help data teams in Canada, the UK, and Africa build AI-ready data infrastructure grounded in sound data engineering practice.
Related posts
Book a free 30-minute consultation to discuss your data engineering and analytics needs.
Talk to our team →