Retrieval-Augmented Generation (RAG) is a powerful pattern that enhances Large Language Models (LLMs) by grounding their responses in your specific documents and data. While GPT-4 is incredibly capable, it doesn’t know about your proprietary documents, internal knowledge bases, or recent updates that occurred after its training cutoff date. RAG solves this problem by retrieving relevant context from your documents before generating responses.
RAG offers a practical, scalable approach to injecting domain knowledge into AI systems without frequent model retraining. By retrieving information from your latest documents at query time, RAG ensures that responses remain current and aligned with rapidly changing data and business context.
Another important advantage is source attribution. Because answers are generated from retrieved content, it is possible to see exactly which documents influenced a response. This level of transparency is especially valuable in enterprise settings where trust, explainability, and auditability are essential.
RAG is also more cost-effective than fine-tuning large models for domain-specific knowledge. Instead of investing in expensive retraining cycles, teams can update or expand knowledge simply by managing documents. This makes RAG highly flexible, allowing content to be added or removed without touching the underlying model, while grounding responses in real data helps reduce hallucinations and improve accuracy.
The RAG process
The RAG process follows this workflow:
1 2 3 | PDF Document → Text Extraction → Chunking → Embedding → Vector Store ↓ User Question → Embedding → Similarity Search → Context → LLM → Answer |
- Document Ingestion: PDF files are extracted and split into manageable chunks (1000 characters with 200 character overlap)
- Embedding: Each chunk is converted to a vector representation using OpenAI’s text-embedding-3-small model
- Storage: Vectors are stored in a vector database (in-memory for this demo)
- Query: User questions are converted to embeddings and matched against stored chunks using cosine similarity
- Response: The most relevant chunks are included as context in the prompt to GPT-4, which generates accurate, grounded answers
Setting Up RAG in Spring AI
Add the embedding model dependency to your pom.xml:
1 2 3 4 5 6 7 8 9 | <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-openai-spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>3.0.0</version> </dependency> |
Configure the embedding model in application.properties:
1 2 3 4 5 6 7 | # Chat Model spring.ai.openai.model=gpt-4o spring.ai.openai.api-key=${OPENAI_API_KEY} spring.ai.openai.temperature=0.7 # Embedding Model (for RAG) spring.ai.openai.embedding.options.model=text-embedding-3-small |
RagService Implementation
The RagService class handles the core RAG functionality through several key functions. Refer to the Git repository for the code.
Document Ingestion – ingestPdfDocument()
This function handles the complete document ingestion pipeline:
- Text Extraction: Uses Apache PDFBox to extract text from PDF files
- Chunking: Splits the document into 1000-character chunks with 200-character overlap. The overlap ensures context isn’t lost at chunk boundaries
- Embedding Generation: Converts each chunk to a vector using OpenAI’s
text-embedding-3-smallmodel - Storage: Stores chunks with their embeddings and metadata (source file, chunk index) in an in-memory vector database
Key Configuration:
1 2 | private static final int CHUNK_SIZE = 1000; // Characters per chunk private static final int CHUNK_OVERLAP = 200; // Overlap for context preservation |
2. Semantic Search – chatWithDocuments()
This is the heart of the RAG pattern, orchestrating the question-answering workflow:
- Question Embedding: Converts the user’s question to a vector representation
- Similarity Search: Finds the most similar document chunks using cosine similarity
- Context Assembly: Combines the top-K most relevant chunks into a single context string
- Prompt Construction: Builds a RAG-specific prompt with the retrieved context
- Answer Generation: Sends the contextualized prompt to GPT-4 for response generation
The topK Parameter: Controls how many relevant chunks to retrieve. Higher values (e.g., 5-7) provide more context but may include less relevant information. Lower values (e.g., 2-3) are more focused but might miss important details.
3. Cosine Similarity – cosineSimilarity()
This function measures semantic similarity between vectors:
- Input: Two embedding vectors (the question and a document chunk)
- Output: A similarity score between -1 and 1
- 1.0 = Vectors point in the same direction (semantically very similar)
- 0.0 = Vectors are orthogonal (unrelated content)
- -1.0 = Vectors point in opposite directions (contradictory content)
Calculates the dot product of the vectors divided by the product of their magnitudes. This measures the angle between vectors in high-dimensional space, which correlates with semantic similarity.
4. RAG Prompt Engineering – buildRagPrompt()
Constructs a specialized prompt that grounds the LLM’s response in retrieved documents:
1 2 3 4 5 6 7 8 | You are a helpful assistant. Answer the question based on the context provided below. If the answer cannot be found in the context, say so. CONTEXT: [Retrieved document chunks] QUESTION: [User's question] ANSWER: |
This prompt engineering technique is crucial for preventing hallucinations. It explicitly instructs the model to:
- Use only the provided context
- Admit when information isn’t available
- Ground responses in specific documents rather than general knowledge
5. Document Search – searchDocuments()
A utility function for debugging and transparency:
- Returns the actual chunks that would be retrieved for a given query
- Includes metadata (source file, chunk index, text preview)
- Helpful in understanding the context the LLM receives
- Helps diagnose poor search results
REST API Endpoints
The ChatController exposes five RAG endpoints that provide a complete document management and querying workflow:
/api/rag/status (GET)
Check the current state of the document store:
- Returns the total number of chunks ingested
- Indicates whether the system is ready for queries
/api/rag/ingest (POST)
Upload and process PDF documents:
- Accepts a file path in the request body
- Triggers the complete ingestion pipeline (extract, chunk, embed, store)
- Returns the number of chunks created
- Limited to 50 chunks in this demo to manage API costs
/api/rag/chat (POST)
The main RAG endpoint for asking questions:
- Accepts a question and an optional topK parameter
- Performs a semantic search to find relevant chunks
- Generates context-grounded answers using GPT-4
- Returns responses based on your specific documents
/api/rag/search (GET)
Debug endpoint to inspect retrieval results:
- Shows which chunks would be retrieved for a query
- Returns chunk text, metadata, and preview
- Helps understand and tune the retrieval process
- Useful for diagnosing irrelevant results
/api/rag/clear (POST)
Reset the document store:
- Removes all ingested documents and embeddings
- Returns the count of chunks removed
- Essential for starting fresh with new documents
Testing the RAG Implementation
1. Check RAG Status
To verify if the system is ready:
1 2 | curl http://localhost:8080/api/rag/status |
Response:
1 2 3 4 5 | <span class="hljs-punctuation">{</span> <span class="hljs-attr">"total_chunks"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">0</span><span class="hljs-punctuation">,</span> <span class="hljs-attr">"ready"</span><span class="hljs-punctuation">:</span> <span class="hljs-literal"><span class="hljs-keyword">false</span></span> <span class="hljs-punctuation">}</span> |
2. Ingest a PDF Document
Upload a document to the vector store:
1 2 3 4 5 6 | curl -X POST http://localhost:8080/api/rag/ingest \ -H <span class="hljs-string">"Content-Type: application/json"</span> \ -d <span class="hljs-string">'{ "filePath": "/path/to/your/document.pdf" }'</span> |
Response:
1 2 3 4 5 6 | <span class="hljs-punctuation">{</span> <span class="hljs-attr">"message"</span><span class="hljs-punctuation">:</span> <span class="hljs-string">"Document ingested successfully"</span><span class="hljs-punctuation">,</span> <span class="hljs-attr">"chunks_created"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">45</span><span class="hljs-punctuation">,</span> <span class="hljs-attr">"total_chunks"</span><span class="hljs-punctuation">:</span> <span class="hljs-number">45</span> <span class="hljs-punctuation">}</span> |
3. Ask Questions
Now query your document:
1 2 3 4 5 6 7 | curl -X POST http://localhost:8080/api/rag/chat \ -H <span class="hljs-string">"Content-Type: application/json"</span> \ -d <span class="hljs-string">'{ "question": "What are the key findings in the document?", "topK": 3 }'</span> |
The topK parameter controls how many relevant chunks to retrieve. Higher values provide more context but may include less relevant information.
4. Search for Specific Content
To see which chunks would be retrieved for a query:
1 2 | curl <span class="hljs-string">"http://localhost:8080/api/rag/search?query=machine%20learning&topK=3"</span> |
This endpoint is useful for debugging and understanding what context the LLM receives.
Production Considerations
1. Use a Real Vector Database
While this implementation uses an in-memory vector store for simplicity, production deployments should consider a persistent vector database such as…
- PostgreSQL + pgvector: Great for existing PostgreSQL users
- Pinecone: Managed vector database with excellent performance
- Qdrant: Open-source vector database with rich features
- Weaviate: Semantic search with built-in ML capabilities
2. Async Processing
For large documents, implement asynchronous ingestion:
1 2 3 4 | @Async public CompletableFuture<Integer> ingestPdfDocumentAsync(String pdfPath) { // Process in background thread } |
3. Caching
Cache frequently asked questions and embeddings:
1 2 3 4 | @Cacheable("embeddings") public float[] embed(String text) { return embeddingModel.embed(text); } |
4. Security
Add authentication for sensitive operations:
1 2 3 4 5 | @PostMapping("/rag/ingest") @PreAuthorize("hasRole('ADMIN')") public ResponseEntity<Map<String, Object>> ingestDocument(...) { // Only admins can ingest documents } |
Conclusion
RAG is critical for building AI applications that need to work with your specific data. The Spring AI framework makes it straightforward to implement RAG patterns with minimal code, while providing the flexibility to scale to production use cases.
The combination of Spring Boot’s robust ecosystem and OpenAI’s embeddings and language models enables you to build sophisticated question-answering systems over your documents in just a few hundred lines of Java code.
The complete code for this RAG implementation is available in the GitHub repository, including additional examples and a Postman collection for testing.
Note: Full disclosure – I vibe-coded some of the RAG code using Claude. TBH, I am starting to question the value of traditional blogs as I go into 2026. Anything a developer needs can be “taught” by an LLM as long as you know what to ask, and you can quickly generate code to experiment with. Vibe coding adds an entirely different perspective, such that we may not even need to look at the code (yes, yes, that is controversial as of Dec 2025).