Knowledge Sources

A KnowledgeSourceConfig connects a tenant's data sources (Notion, Gmail, uploads, databases) to knowledge spaces. Each source is synced, chunked, embedded, and indexed according to the space's configuration.

KnowledgeSourceConfig

type KnowledgeSourceConfig = {
  id: string;
  tenantId: string;
  spaceId: string;
  
  // Source type and location
  kind: "uploaded-document" | "url" | "email" 
        | "notion" | "database-query" | "raw-text"
        | "slack" | "confluence" | "google-drive";
  location: string;
  
  // Sync configuration
  syncPolicy: {
    interval?: string;  // e.g., "1h", "24h"
    webhook?: boolean;
    manual?: boolean;
  };
  
  // State
  status: "active" | "paused" | "error";
  lastSyncedAt?: string;
  lastErrorMessage?: string;
  
  // Metadata
  metadata?: Record<string, unknown>;
  createdAt: string;
  updatedAt: string;
};

Source types

Uploaded Documents

{
  kind: "uploaded-document",
  location: "s3://bucket/tenant-123/docs/product-spec.pdf",
  syncPolicy: { manual: true }
}

PDFs, Word docs, presentations uploaded by users

Notion

{
  kind: "notion",
  location: "https://notion.so/workspace/product-docs",
  syncPolicy: { interval: "1h", webhook: true }
}

Notion pages and databases with real-time webhook updates

Gmail / Email

{
  kind: "email",
  location: "support@company.com",
  syncPolicy: { webhook: true }
}

Email threads from Gmail or other providers

Database Query

{
  kind: "database-query",
  location: "SELECT * FROM products WHERE active = true",
  syncPolicy: { interval: "24h" }
}

Structured data from application databases

URL / Web Scraping

{
  kind: "url",
  location: "https://stripe.com/docs",
  syncPolicy: { interval: "24h" }
}

External documentation and web content

Sync strategies

StrategyWhen to UseLatency
webhookReal-time updates (Notion, Gmail, Slack)Seconds
intervalPeriodic sync (databases, URLs)Minutes to hours
manualUser-triggered (uploads, one-time imports)On-demand

Example: Multi-source space

A single knowledge space can be fed by multiple sources:

// Product Canon space with multiple sources
{
  spaceId: "product-canon",
  sources: [
    {
      id: "src_database_schema",
      kind: "database-query",
      location: "SELECT * FROM schema_definitions",
      syncPolicy: { interval: "1h" }
    },
    {
      id: "src_notion_product_docs",
      kind: "notion",
      location: "https://notion.so/product-docs",
      syncPolicy: { interval: "1h", webhook: true }
    },
    {
      id: "src_uploaded_specs",
      kind: "uploaded-document",
      location: "s3://bucket/specs/",
      syncPolicy: { manual: true }
    }
  ]
}

Processing pipeline

When a source is synced, ContractSpec processes it through several stages:

  1. Fetch - Retrieve content from source (API, database, file)
  2. Parse - Extract text from documents (PDF, Word, HTML)
  3. Chunk - Split into semantic chunks (paragraphs, sections)
  4. Embed - Generate vector embeddings (OpenAI, Cohere)
  5. Index - Store in vector database (Qdrant) or search engine
  6. Audit - Log sync operation and results

Best practices

  • Use webhooks for real-time sources (Notion, Gmail) to minimize latency
  • Set appropriate sync intervals - hourly for active docs, daily for stable content
  • Monitor sync failures and set up alerts for critical sources
  • Test sources in sandbox before enabling in production
  • Document the purpose and ownership of each source for your team
  • Use manual sync for sensitive or infrequently updated content