Knowledge Management: External Source I Web Crawler – Knowledge Base

The Web Crawler is a powerful tool designed to index website content for use across the Talkdesk platform, including Copilot and Autopilot services, facilitating the creation of a comprehensive knowledge base from web content.

Key Features

Enhanced Content Extraction: Capable of extracting headings, links, and tables accurately, ensuring well-formatted data for your knowledge base.
Security: Supports HTTPS only, and provides secure handling of credentials specifically for websites requiring basic authentication.
Customizable Crawling: Users can define crawl depth and set limits on the number of links visited per page, offering precise control over the crawling process.

Configuration Options

Basic Settings

Username & Password: Required for accessing sites protected by basic authentication. At the moment, only Basic Auth is supported.
URLs: Provide the seed URLs of the websites you want to index. Ensure these sites are HTTPS-enabled.

Advanced Settings

Crawl Depth: Determines how deeply the crawler follows links from the original page.
Maximum Links per Page: Controls the number of links visited per page, allowing for tailored indexing.

Example:

Basic Settings

Ring Groups and Segments: Facilitate categorization and organization of indexed content.
Initial Sync Time & Frequency: Schedule the first indexing and set how often content should be re-indexed to ensure up-to-date information.

Limitations

Tables: The crawler supports HTML tables with a basic structure but does not support custom styles.
Media Extraction: Focuses on text only, as media elements like images, videos, and audio are not extracted.
Iframes: This web crawler does not support iframes, meaning content embedded within iframes will not be indexed.
Operational Limits:
- - Maximum HTTP Response Body Size: 50 MiB. Larger responses are ignored to maintain performance.
  - Maximum Content Size per Article: 260 KB, with larger articles split into smaller segments.
  - Maximum Number of Articles per Sync: 40,000 articles per sync cycle.
  - Maximum Sync Duration: 1 hour.

How can we help?

Knowledge Management: External Source I Web Crawler

Published November 14, 2024 14:49 • Last Updated March 13, 2025 16:40

Key Features

Configuration Options

Basic Settings

Advanced Settings

Basic Settings

Limitations