Knowledge Sources
A KnowledgeSourceConfig connects a tenant's data sources (Notion, Gmail, uploads, databases) to knowledge spaces. Each source is synced, chunked, embedded, and indexed according to the space's configuration.
KnowledgeSourceConfig
type KnowledgeSourceConfig = {
id: string;
tenantId: string;
spaceId: string;
// Source type and location
kind: "uploaded-document" | "url" | "email"
| "notion" | "database-query" | "raw-text"
| "slack" | "confluence" | "google-drive";
location: string;
// Sync configuration
syncPolicy: {
interval?: string; // e.g., "1h", "24h"
webhook?: boolean;
manual?: boolean;
};
// State
status: "active" | "paused" | "error";
lastSyncedAt?: string;
lastErrorMessage?: string;
// Metadata
metadata?: Record<string, unknown>;
createdAt: string;
updatedAt: string;
};Source types
Uploaded Documents
{
kind: "uploaded-document",
location: "s3://bucket/tenant-123/docs/product-spec.pdf",
syncPolicy: { manual: true }
}PDFs, Word docs, presentations uploaded by users
Notion
{
kind: "notion",
location: "https://notion.so/workspace/product-docs",
syncPolicy: { interval: "1h", webhook: true }
}Notion pages and databases with real-time webhook updates
Gmail / Email
{
kind: "email",
location: "support@company.com",
syncPolicy: { webhook: true }
}Email threads from Gmail or other providers
Database Query
{
kind: "database-query",
location: "SELECT * FROM products WHERE active = true",
syncPolicy: { interval: "24h" }
}Structured data from application databases
URL / Web Scraping
{
kind: "url",
location: "https://stripe.com/docs",
syncPolicy: { interval: "24h" }
}External documentation and web content
Sync strategies
| Strategy | When to Use | Latency |
|---|---|---|
| webhook | Real-time updates (Notion, Gmail, Slack) | Seconds |
| interval | Periodic sync (databases, URLs) | Minutes to hours |
| manual | User-triggered (uploads, one-time imports) | On-demand |
Example: Multi-source space
A single knowledge space can be fed by multiple sources:
// Product Canon space with multiple sources
{
spaceId: "product-canon",
sources: [
{
id: "src_database_schema",
kind: "database-query",
location: "SELECT * FROM schema_definitions",
syncPolicy: { interval: "1h" }
},
{
id: "src_notion_product_docs",
kind: "notion",
location: "https://notion.so/product-docs",
syncPolicy: { interval: "1h", webhook: true }
},
{
id: "src_uploaded_specs",
kind: "uploaded-document",
location: "s3://bucket/specs/",
syncPolicy: { manual: true }
}
]
}Processing pipeline
When a source is synced, ContractSpec processes it through several stages:
- Fetch - Retrieve content from source (API, database, file)
- Parse - Extract text from documents (PDF, Word, HTML)
- Chunk - Split into semantic chunks (paragraphs, sections)
- Embed - Generate vector embeddings (OpenAI, Cohere)
- Index - Store in vector database (Qdrant) or search engine
- Audit - Log sync operation and results
Best practices
- Use webhooks for real-time sources (Notion, Gmail) to minimize latency
- Set appropriate sync intervals - hourly for active docs, daily for stable content
- Monitor sync failures and set up alerts for critical sources
- Test sources in sandbox before enabling in production
- Document the purpose and ownership of each source for your team
- Use manual sync for sensitive or infrequently updated content