Ingestion

Five pathways feed the knowledge base. Each one ends at the same checkpoint -- an ingestion log entry recording what changed, what it cost, and which agent wrote it. Multi-modal data normalizes into a common source-record model before compilation. The compiler then produces domain-specific pages for experiments, protocols, compounds, datasets, initiatives, people, organizations, decisions, and more.

Five entry points, one knowledge base. Every ingest is attributed, cost-tracked, deduplicated, and compiled into the right page type. The diagram highlights domain-specific outputs; the full supported set is listed below.

What ingestion can create

Ingestion no longer ends at generic document pages. The compiler chooses a page type from the current knowledge model and captures the fields that matter for that type.

All page types

Topics, people, organizations, decisions, meetings, overviews, research notes, experiments, protocols, compounds, datasets, and initiatives.

Original knowledge pages

Topics, people, organizations, decisions, meetings, overviews, and research notes remain first-class outputs.

Domain pages

Experiments, protocols, compounds, datasets, and initiatives add life-sciences-specific structure on top of the original model.

Topic / org pages

Capture concepts, programs, companies, labs, partners, competitors, regulators, aliases, and how they connect to the rest of the graph.

Decision / meeting pages

Capture what was discussed, what was decided, who was involved, open questions, and follow-up actions.

Experiment pages

Capture objective or hypothesis, dates, setup, materials or compounds, conditions, results, interpretation, and follow-ups.

Compound pages

Capture aliases, modality, target or mechanism, formulation or dose, indication, status, and supporting evidence.

Dataset pages

Capture source, scope, schema or measures, generation method, date or version, key findings, limitations, and location.

Resource layer

The engine also tracks underlying resources such as artifacts, protocols, experiment runs, samples, and instruments for ACLs, provenance, and lineage.

The five pathways

Pathway	Mechanism	Write mode
Connectors	Background enumeration + file and multimodal ingest via OAuth. Provider-specific. 3-tier fan-out.	async
Uploads	Direct file upload -> parse -> S3 -> compile.	async
Ask Agent Capture	After substantive agent responses, the capture agent evaluates the synthesis for knowledge-base-worthiness.	proposal
Manual Creation	Users or agents create pages via the knowledge base proposal tool.	proposal
Health Maintenance	The background health agent creates index pages to organize orphans and reparent drifting pages.	direct

Connector flow: the 3-tier fan-out

One trigger fans out to N items. Each item downloads, hits S3, then goes to the compiler for structuring.

Deduplication via source records

Every item ingested from a connector is tracked as a source record. The combination of provider and external ID uniquely identifies each source item, preventing re-ingestion of unchanged content and enabling delta sync when checksums are populated.

Source system

Which service the content came from (Slack, Drive, Notion, etc.)

Source identifier

The service's native identifier for the document.

Content hash

Detects whether content has changed since last sync.

Metadata snapshot

Last modified time, owner, size, file type -- used for staleness detection.

Source link

Join record linking a source record to the knowledge base page(s) it produced. One source can produce multiple pages; one page can have multiple sources.

Delta sync and change awareness

The infrastructure exists but isn't fully wired. Source records store fingerprints, modified timestamps, and source-to-page joins so Beakr can identify what changed, when it changed, and which downstream pages or graph edges may need re-evaluation. See Connector sync for the full sync lifecycle.