Hybrid search

Beakr combines vector similarity search with full-text keyword search into a single retrieval system. Every query runs both pipelines in parallel, blends the results, and returns chunks ranked by a weighted score — giving you semantic understanding and keyword precision at the same time.

Why hybrid search

Neither vector search nor keyword search is sufficient on its own. Each has a blind spot that the other covers:

Approach	Strengths	Weaknesses
Vector search only	Understands meaning, synonyms, paraphrases	Misses exact terms — searching "BRCA2" may return results about "gene mutations" without mentioning BRCA2
Keyword search only	Precise term matching, fast, predictable	Misses meaning — searching "heart attack" will not find documents that only say "myocardial infarction"
Hybrid (Beakr)	Semantic understanding with keyword precision	Slightly more compute per query — but the quality difference is substantial

Beakr blends both so that a search for "quarterly revenue projections" finds documents that use that exact phrase and documents that discuss "Q3 financial forecasts" without ever using the word "revenue."

Vector search

The vector pipeline converts text into high-dimensional numerical representations (embeddings) that capture semantic meaning. Similar concepts land near each other in vector space, so searching for a query means finding the nearest embedding vectors.

Embedding model

Beakr uses Google Gemini embedding-2-preview to generate 768-dimensional embedding vectors. Every chunk of ingested content — text, image captions, video transcripts — is embedded and stored as semantic embeddings.

The database retains multiple embedding generations for historical compatibility. Only the current generation (Gemini, 768-dim) is used for new queries.

Distance metric and indexing

Similarity is measured using cosine distance via a vector database extension. To avoid scanning every vector on every query, the database uses optimized indexes for approximate nearest neighbor search, providing sub-linear query times even as the number of vectors grows into the millions.

Component	Value
Vector store	Vector database extension
Distance metric	Cosine distance
Index type	Optimized approximate nearest neighbor index
Dimensionality	768

Full-text search

The lexical pipeline uses the database's built-in full-text search engine. Each chunk's text is converted into a normalized representation that strips stop words, applies stemming (via the English dictionary), and stores word positions.

At query time, the search input is parsed and matched against the stored representations. Results are ranked using keyword relevance scoring based on term frequency and positional proximity — similar in spirit to BM25-style ranking. Documents that contain the search terms more frequently and in closer proximity receive higher scores.

What the English dictionary does

Stemming — "running", "runs", and "ran" all reduce to the stem "run"
Stop-word removal — common words like "the", "is", "at" are filtered out
Normalization — case folding and accent stripping ensure consistent matching

Score blending

After both pipelines return their results, Beakr blends the scores using a weighted linear combination:

score = alpha * vector_similarity + (1 - alpha) * lexical_rank

The default value of alpha = 0.7 means semantic similarity contributes 70% of the final score and keyword relevance contributes 30%.

Parameter	Default	Effect
`alpha`	0.7	Weight given to vector similarity. Higher values favor semantic understanding; lower values favor exact term matching.
`vector_similarity`	—	Cosine similarity between the query embedding and chunk embedding. Range: 0 to 1.
`lexical_rank`	—	Normalized keyword relevance score from full-text search. Range: 0 to 1.

Why semantic gets more weight

In practice, most knowledge-base queries are natural-language questions ("What is our refund policy?") rather than keyword lookups ("refund-policy-v3"). Weighting semantic similarity at 70% means the system performs well for conversational queries while still boosting results that contain exact terms. The 30% keyword contribution ensures that technical identifiers, product names, and acronyms are not lost in the semantic space.

Multi-modal search

Beakr does not limit search to text. Every chunk carries a modality field that indicates its content type:

Modality	What is embedded	Use case
`text`	Raw text content	Documents, knowledge base pages, messages
`image`	Image descriptions and captions	Diagrams, screenshots, figures
`video`	Transcripts and frame descriptions	Meeting recordings, tutorials

All modalities share the same embedding space and the same search pipeline. Search uses a shared ranking pipeline, while the underlying representations can be text-derived, modality-specific, or joint embeddings depending on the source. A text query can surface relevant images or video segments alongside document chunks because those assets carry captions, transcripts, frame descriptions, and modality metadata.

Tenant isolation in search

Every search query is automatically scoped to the authenticated tenant. This is not an application-level filter — it is enforced by PostgreSQL Row Level Security at the database layer.

The retriever joins chunks to their parent resources and relies on database-level security policies to ensure that only resources belonging to the authenticated organization are visible. Even if a query is malformed or a code path has a bug, the database will never return chunks from another tenant's data.

Tenant isolation is enforced at the database level -- not in application code. It is transaction-scoped and cannot leak between requests. See Multi-tenancy & isolation for details.

How search feeds the agent

Search is one layer in Beakr's retrieval system, not the only one. When an agent processes a question, it has access to multiple retrieval strategies:

Hybrid search — the vector + keyword pipeline described above, returning ranked chunks
Knowledge base navigation tools — structured browsing of knowledge base pages, sections, and links for targeted exploration
Graph traversal — following connections between entities and pages to find related context

The agent decides which tools to use based on the question. A broad question ("What do we know about customer churn?") may start with hybrid search to find relevant pages, then follow links for deeper context. A precise question ("What is the BRCA2 variant classification in our latest report?") benefits from keyword-heavy retrieval.

This is the difference between search as a retrieval call and agentic search as a workflow. Beakr can search, read pages, traverse the graph, inspect provenance, compare dates, and then decide whether another retrieval step is needed before answering.

Performance at scale

The system is designed to remain fast as the knowledge base grows:

Mechanism	What it does	Why it matters
Optimized vector indexes	Approximate nearest neighbor search with sub-linear complexity	Query time grows logarithmically, not linearly, with data volume
Full-text indexes	Inverted index for fast keyword matching	Keyword lookups remain fast regardless of corpus size
Tenant-scoped queries	Search only scans the authenticated tenant's data	Multi-tenant databases do not degrade per-tenant performance as total data grows
Parallel execution	Vector and text pipelines run concurrently	Latency is the max of the two pipelines, not the sum

Beyond simple RAG

Most retrieval-augmented generation systems do one thing: embed a query, find similar chunks, and pass them to a language model. Beakr's retrieval system goes further:

Hybrid ranking — not just nearest vectors, but a blended score that rewards both meaning and terminology
Multi-modal — text, image, and video content all participate in the same retrieval pipeline
Tenant-isolated — every query is scoped to the authenticated organization at the database level, not filtered after the fact
Agent-integrated — search is one tool among many; agents can also navigate the knowledge base structure, follow links, and explore the knowledge graph
Provenance-aware — retrieved chunks carry attribution metadata, so agents can cite their sources

The result is a retrieval system that behaves less like a search engine and more like a research assistant with structured access to your organization's knowledge.