The Web Crawler is a powerful tool designed to index website content for use across the Talkdesk platform, including Copilot and Autopilot services, facilitating the creation of a comprehensive knowledge base from web content.
Key Features
- Enhanced Content Extraction: Capable of extracting headings, links, and tables accurately, ensuring well-formatted data for your knowledge base.
- Security: Supports HTTPS only, and provides secure handling of credentials specifically for websites requiring basic authentication.
- Customizable Crawling: Users can define crawl depth and set limits on the number of links visited per page, offering precise control over the crawling process.
Configuration Options
Basic Settings
- Username & Password: Required for accessing sites protected by basic authentication. At the moment, only Basic Auth is supported.
- URLs: Provide the seed URLs of the websites you want to index. Ensure these sites are HTTPS-enabled.
Advanced Settings
- Crawl Depth: Determines how deeply the crawler follows links from the original page.
- Maximum Links per Page: Controls the number of links visited per page, allowing for tailored indexing.
Example:
Basic Settings
- Ring Groups and Segments: Facilitate categorization and organization of indexed content.
- Initial Sync Time & Frequency: Schedule the first indexing and set how often content should be re-indexed to ensure up-to-date information.
Limitations
- Tables: The crawler supports HTML tables with a basic structure but does not support custom styles.
- Media Extraction: Focuses on text only, as media elements like images, videos, and audio are not extracted.
- Iframes: This web crawler does not support iframes, meaning content embedded within iframes will not be indexed.
- Operational Limits:
-
- Maximum HTTP Response Body Size: 50 MiB. Larger responses are ignored to maintain performance.
- Maximum Content Size per Article: 260 KB, with larger articles split into smaller segments.
- Maximum Number of Articles per Sync: 40,000 articles per sync cycle.
- Maximum Sync Duration: 1 hour.
-