Knowledge Management: External Source I Web Crawler

The Web Crawler is a powerful tool designed to index website content for use across the Talkdesk platform, including Copilot and Autopilot services, facilitating the creation of a comprehensive knowledge base from web content. 

 

Key Features

  • Enhanced Content Extraction: Capable of extracting headings, links, and tables accurately, ensuring well-formatted data for your knowledge base.
  • Security: Supports HTTPS only, and provides secure handling of credentials specifically for websites requiring basic authentication.
  • Customizable Crawling: Users can define crawl depth and set limits on the number of links visited per page, offering precise control over the crawling process.

 

Configuration Options

Basic Settings

  • Username & Password: Required for accessing sites protected by basic authentication. At the moment, only Basic Auth is supported.
  • URLs: Provide the seed URLs of the websites you want to index. Ensure these sites are HTTPS-enabled.

 

Advanced Settings

  • Crawl Depth: Determines how deeply the crawler follows links from the original page.
  • Maximum Links per Page: Controls the number of links visited per page, allowing for tailored indexing.

Example:

 

Basic Settings

  • Ring Groups and Segments: Facilitate categorization and organization of indexed content.
  • Initial Sync Time & Frequency: Schedule the first indexing and set how often content should be re-indexed to ensure up-to-date information.

 

Limitations

  • Tables: The crawler supports HTML tables with a basic structure but does not support custom styles.
  • Media Extraction: Focuses on text only, as media elements like images, videos, and audio are not extracted.
  • Iframes: This web crawler does not support iframes, meaning content embedded within iframes will not be indexed.
  • Operational Limits:
      • Maximum HTTP Response Body Size: 50 MiB. Larger responses are ignored to maintain performance.
      • Maximum Content Size per Article: 260 KB, with larger articles split into smaller segments.
      • Maximum Number of Articles per Sync: 40,000 articles per sync cycle.
      • Maximum Sync Duration: 1 hour.
All Articles ""
Please sign in to submit a request.