- Published on
Semantic search with Drupal and Typesense
- Authors
- Name
- Christophe Jossart
- @colorfield
Ever wondered how to enhance your Drupal search with AI? Following insights from a Search API Typesense maintainer and building on our previous exploration of Drupal Typesense full text + faceted search, this post will compare three approaches — lexical, semantic, and hybrid search — before showing you how to configure Search API Typesense for RAG (Retrieval-Augmented Generation).
Typesense is a fast alternative to Solr (~10x faster) and a cost-effective alternative to Algolia (Typesense offers both self-hosted and cloud options). It's also an open-source alternative to Algolia and Pinecone, as Typesense combines these two tools — it functions as both a search engine and a vector database.
The API integrates with many languages and frameworks like LangChain, Symfony, Laravel, and Drupal.
The project has a public roadmap and aims to provide a new release every 3 months.
Unlike Solr, it doesn't come with a dashboard/GUI out of the box, but a contributed project provides this functionality.
The main difference with other Search API implementations is that it fully skips Drupal for querying (similar to using the Solarium library for Solr). This means Search API is only used for backend indexing, not frontend querying. While this prevents using Views out of the box, the benefits are improved performance and the ability to easily implement soft-decoupled or fully decoupled UIs.
Typesense Use Cases
We already explored facets and full-text search (keyword/lexical) in the previous post. Now let's dive into semantic search and hybrid search.
We'll illustrate these concepts with examples from a gardening site.
Keyword / Lexical Search
- Searches for exact word matches and variations
- Great for finding documents with specific terms, proper nouns, or technical terminology
Example: Searching for "composting" will find documents that literally contain that word.
Semantic / Vector Search
- Uses embeddings to understand meaning and context
- Can find conceptually related content even without exact word matches
- Captures synonyms, related concepts, and contextual similarities
Example: Searching for "composting" might also find content about "organic waste recycling" or "decomposition."
Hybrid Search
Both searches run on the same query:
- Keyword search returns results ranked by lexical relevance
- Vector search returns results ranked by semantic similarity
In brief, results will be
- Keyword search: Articles literally containing "reduce," "garden," "waste"
- Semantic search: Articles about "composting," "recycling organic matter," "sustainable practices," "minimizing yard debris"
- Hybrid search: Both exact matches AND conceptually related content, ranked by combined relevance
Benefits
- Keyword search misses synonyms and related concepts → Semantic search fills this gap
- Semantic search can be "too creative" and miss exact matches → Keyword search ensures precision
- Vector search struggles with proper nouns, acronyms, or very specific terms → Keyword search handles these well
Keyword, semantic, hybrid search demo
Querying for e.g. "firefox" will be done based on these parameters
Keyword
{
"searches": [
{
"collection": "hn-comments",
"exclude_fields": "embedding",
"facet_by": "by",
"highlight_full_fields": "text",
"max_facet_values": 20,
"page": 1,
"per_page": 15,
"q": "firefox",
"query_by": "text"
}
]
}
Semantic
{
"searches": [
{
"collection": "hn-comments",
"exclude_fields": "embedding",
"facet_by": "by",
"highlight_full_fields": "embedding",
"max_facet_values": 20,
"page": 1,
"per_page": 15,
"q": "firefox",
"query_by": "embedding",
"vector_query": "embedding:([], k:200)"
}
]
}
Hybrid
{
"searches": [
{
"collection": "hn-comments",
"exclude_fields": "embedding",
"facet_by": "by",
"highlight_full_fields": "text,embedding",
"max_facet_values": 20,
"page": 1,
"per_page": 15,
"q": "firefox",
"query_by": "text,embedding",
"vector_query": "embedding:([], k:200)"
}
]
}
How Does It Work with Drupal?
1. Data Preparation Phase
- Drupal CMS stores content (e.g., plants) in its database
- Plant collection data is extracted and processed
- Content is sent to an embedding LLM to create vector representations
- Vector embeddings are stored in the Typesense vector database
2. RAG Query Phase
- End user submits a search query
- Query is converted to embeddings using the same embedding LLM
- Vector search is performed against Typesense to find relevant plants
- Retrieved plant content provides context to a RAG LLM
- LLM generates a comprehensive response combining the query and retrieved context
- Final response is returned to the user
Drupal integration
Search API Typesense integrates with the Search API and AI ecosystems. It's already fully functional, integrates stop words, scoped keys, synonyms, curation, and - at the time of writing - the following features are in the roadmap:
- Parsing PDF files for embedding: https://www.drupal.org/project/search_api_typesense/issues/3459544
- Integrate with AI search: https://www.drupal.org/project/search_api_typesense/issues/3543841
- Streaming text responses
Configuration
Install and enable the Search API Typesense and the AI modules.
AI
Go to the Provider settings (/admin/config/ai/providers) and add a new API key.
Typesense
- Make sure that you have Typesense installed and create a new Typesense Search API Server (see previous blog)
- Configure the Server - go to the Conversation models tab
- Id:
typesense
- Model: pick your favourite one from the list
- System prompt: copy the prompt below
You are an assistant for question-answering. You can only make conversations based on the provided context. If a response cannot be formed strictly using the provided context, politely say you do not have knowledge about that topic.
- Max bytes: depends on your model, example for
gpt-3.5-turbo
:16385
- Create a new Content Search API Index in
/admin/config/search/search-api
that makes use of this server (for more details, see also the previous blog), then in the Schema tab, below AI features
- Check Enable embedding
- Check fields to be used for embedding (e.g. text and title)
- Select the LLM embedding model, it can be a Typesense or external provider one, let's use
ts: all-MiniLM-L12-v2
- Add fields to prepend to all chunks (e.g. a taxonomy term)
- Set the chunk size to
1000
and the chunk overlap size to10
Re-index
You should now be able to use the Converse tab of your index, as a starter, and then build on top of this with InstantSearch widgets.
Typesense documentation
Photo from A Chosen Soul on Unsplash