Knowledge Sources

A KnowledgeSourceConfig connects a tenant's data sources (Notion, Gmail, uploads, databases) to knowledge spaces. Each source is synced, chunked, embedded, and indexed according to the space's configuration.

KnowledgeSourceConfig

type KnowledgeSourceConfig = {
  id: string;
  tenantId: string;
  spaceId: string;
  
  // Source type and location
  kind: "uploaded-document" | "url" | "email" 
        | "notion" | "database-query" | "raw-text"
        | "slack" | "confluence" | "google-drive";
  location: string;
  
  // Sync configuration
  syncPolicy: {
    interval?: string;  // e.g., "1h", "24h"
    webhook?: boolean;
    manual?: boolean;
  };
  
  // State
  status: "active" | "paused" | "error";
  lastSyncedAt?: string;
  lastErrorMessage?: string;
  
  // Metadata
  metadata?: Record<string, unknown>;
  createdAt: string;
  updatedAt: string;
};

Source types

Uploaded Documents

{
  kind: "uploaded-document",
  location: "s3://bucket/tenant-123/docs/product-spec.pdf",
  syncPolicy: { manual: true }
}

PDFs, Word docs, presentations uploaded by users

Notion

{
  kind: "notion",
  location: "https://notion.so/workspace/product-docs",
  syncPolicy: { interval: "1h", webhook: true }
}

Notion pages and databases with real-time webhook updates

Gmail / Email

{
  kind: "email",
  location: "support@company.com",
  syncPolicy: { webhook: true }
}

Email threads from Gmail or other providers

Database Query

{
  kind: "database-query",
  location: "SELECT * FROM products WHERE active = true",
  syncPolicy: { interval: "24h" }
}

Structured data from application databases

URL / Web Scraping

{
  kind: "url",
  location: "https://stripe.com/docs",
  syncPolicy: { interval: "24h" }
}

External documentation and web content

Sync strategies

Strategy	When to Use	Latency
webhook	Real-time updates (Notion, Gmail, Slack)	Seconds
interval	Periodic sync (databases, URLs)	Minutes to hours
manual	User-triggered (uploads, one-time imports)	On-demand

Example: Multi-source space

A single knowledge space can be fed by multiple sources:

// Product Canon space with multiple sources
{
  spaceId: "product-canon",
  sources: [
    {
      id: "src_database_schema",
      kind: "database-query",
      location: "SELECT * FROM schema_definitions",
      syncPolicy: { interval: "1h" }
    },
    {
      id: "src_notion_product_docs",
      kind: "notion",
      location: "https://notion.so/product-docs",
      syncPolicy: { interval: "1h", webhook: true }
    },
    {
      id: "src_uploaded_specs",
      kind: "uploaded-document",
      location: "s3://bucket/specs/",
      syncPolicy: { manual: true }
    }
  ]
}

Processing pipeline

When a source is synced, ContractSpec processes it through several stages:

Fetch - Retrieve content from source (API, database, file)
Parse - Extract text from documents (PDF, Word, HTML)
Chunk - Split into semantic chunks (paragraphs, sections)
Embed - Generate vector embeddings (OpenAI, Cohere)
Index - Store in vector database (Qdrant) or search engine
Audit - Log sync operation and results

Best practices

Use webhooks for real-time sources (Notion, Gmail) to minimize latency
Set appropriate sync intervals - hourly for active docs, daily for stable content
Monitor sync failures and set up alerts for critical sources
Test sources in sandbox before enabling in production
Document the purpose and ownership of each source for your team
Use manual sync for sensitive or infrequently updated content

Previous: Spaces Examples