Step 1: How to Choose an Embeddings Model

If possible, we recommend using voyage-code-3, which will give the most accurate answers of any existing embeddings model for code. You can obtain an API key here. Because their API is OpenAI-compatible, you can use any OpenAI client by swapping out the URL.

Step 2: How to Choose a Vector Database

There are a number of available vector databases, but because most vector databases will be able to performantly handle large codebases, we would recommend choosing one for ease of setup and experimentation. LanceDB is a good choice for this because it can run in-memory with libraries for both Python and Node.js. This means that in the beginning you can focus on writing code rather than setting up infrastructure. If you have already chosen a vector database, then using this instead of LanceDB is also a fine choice.

Step 3: How to Choose a “Chunking” Strategy

Most embeddings models can only handle a limited amount of text at once. To get around this, we “chunk” our code into smaller pieces. If you use voyage-code-3, it has a maximum context length of 16,000 tokens, which is enough to fit most files. This means that in the beginning you can get away with a more naive strategy of truncating files that exceed the limit. In order of easiest to most comprehensive, 3 chunking strategies you can use are:
  1. Truncate the file when it goes over the context length: in this case you will always have 1 chunk per file.
  2. Split the file into chunks of a fixed length: starting at the top of the file, add lines in your current chunk until it reaches the limit, then start a new chunk.
  3. Use a recursive, abstract syntax tree (AST)-based strategy: this is the most exact, but most complex. In most cases you can achieve high quality results by using (1) or (2), but if you’d like to try this you can find a reference example in our code chunker or in LlamaIndex.
As usual in this guide, we recommend starting with the strategy that gives 80% of the benefit with 20% of the effort.

Step 4: How to Put Together an Indexing Script

Indexing, in which we will insert your code into the vector database in a retrievable format, happens in three steps:
  1. Chunking
  2. Generating embeddings
  3. Inserting into the vector database
With LanceDB, we can do steps 2 and 3 simultaneously, as demonstrated in their docs. If you are using Voyage AI for example, it would be configured like this:
from lancedb.pydantic import LanceModel, Vectorfrom lancedb.embeddings import get_registrydb = lancedb.connect("/tmp/db")func = get_registry().get("openai").create(    name="voyage-code-3",    base_url="https://api.voyageai.com/v1/",    api_key=os.environ["VOYAGE_API_KEY"],)class CodeChunks(LanceModel):    filename: str    text: str = func.SourceField()    # 1024 is the default dimension for `voyage-code-3`: https://docs.voyageai.com/docs/embeddings#model-choices    vector: Vector(1024) = func.VectorField()table = db.create_table("code_chunks", schema=CodeChunks, mode="overwrite")table.add([    {"text": "print('hello world!')", filename: "hello.py"},    {"text": "print('goodbye world!')", filename: "goodbye.py"}])query = "greetings"actual = table.search(query).limit(1).to_pydantic(CodeChunks)[0]print(actual.text)
If you are indexing more than one repository, it is best to store these in separate “tables” (terminology used by LanceDB) or “collections” (terminology used by some other vector DBs). The alternative of adding a “repository” field and then filtering by this is less performant.
Regardless of which database or model you have chosen, your script should iterate over all of the files that you wish to index, chunk them, generate embeddings for each chunk, and then insert all of the chunks into your vector database.

Step 5: How to Run Your Indexing Script

In a perfect production version, you would want to build “automatic, incremental indexing”, so that you whenever a file changes, that file and nothing else is automatically re-indexed. This has the benefits of perfectly up-to-date embeddings and lower cost.That said, we highly recommend first building and testing the pipeline before attempting this. Unless your codebase is being entirely rewritten frequently, an incremental refresh of the index is likely to be sufficient and reasonably cheap.
At this point, you’ve written your indexing script and tested that you can make queries from your vector database. Now, you’ll want a plan for when to run the indexing script. In the beginning, you should probably run it by hand. Once you are confident that your custom RAG is providing value and is ready for the long-term, then you can set up a cron job to run it periodically. Because codebases are largely unchanged in short time frames, you won’t want to re-index more than once a day. Once per week or month is probably even sufficient.

Step 6: How to set up an MCP server

To integrate your custom RAG system with Continue, you’ll create an MCP (Model Context Protocol) server. MCP provides a standardized way for AI tools to access external resources.

Create your MCP server

Here’s a reference implementation using Python that queries your vector database:
"""Custom RAG MCP server for code retrieval"""
import asyncio
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
import lancedb

# Initialize your vector database connection
db = lancedb.connect("/path/to/your/db")
table = db.open_table("code_chunks")

app = Server("custom-rag-server")

@app.tool()
async def search_codebase(query: str, limit: int = 10) -> list[TextContent]:
    """
    Search the codebase using vector similarity.
    
    Args:
        query: The search query
        limit: Maximum number of results to return
    """
    # Query your vector database
    results = table.search(query).limit(limit).to_list()
    
    # Format results for Continue
    formatted_results = []
    for result in results:
        formatted_results.append(TextContent(
            type="text",
            text=f"File: {result['filename']}\n\n{result['text']}"
        ))
    
    return formatted_results

@app.tool()
async def get_file_context(filename: str) -> list[TextContent]:
    """
    Get all chunks from a specific file.
    
    Args:
        filename: The name of the file to retrieve
    """
    results = table.where(f"filename = '{filename}'").to_list()
    
    return [TextContent(
        type="text",
        text="\n".join([r['text'] for r in results])
    )]

if __name__ == "__main__":
    stdio_server(app).run()

Configure Continue to use your MCP server

Add your MCP server to Continue’s configuration: config.yaml:
mcpServers:
  - name: custom-rag
    command: python
    args:
      - /path/to/your/mcp_server.py
    env:
      VOYAGE_API_KEY: ${VOYAGE_API_KEY}
config.json:
{
  "mcpServers": [
    {
      "name": "custom-rag",
      "command": "python",
      "args": ["/path/to/your/mcp_server.py"],
      "env": {
        "VOYAGE_API_KEY": "${VOYAGE_API_KEY}"
      }
    }
  ]
}

Step 7 (Bonus): How to Set Up Reranking

If you’d like to improve the quality of your results, a great first step is to add reranking. This involves retrieving a larger initial pool of results from the vector database, and then using a reranking model to order them from most to least relevant. This works because the reranking model can perform a slightly more expensive calculation on the small set of top results, and so can give a more accurate ordering than similarity search, which has to search over all entries in the database. If you wish to return 10 total results for each query for example, then you would:
  1. Retrieve ~50 results from the vector database using similarity search
  2. Send all of these 50 results to the reranker API along with the query in order to get relevancy scores for each
  3. Sort the results by relevancy score and return the top 10
We recommend using the rerank-2 model from Voyage AI, which has examples of usage here.