Knowledge Management: External Source I Web Crawler

The Web Crawler is designed to index website content for use across the Talkdesk platform, including Copilot and Autopilot services, facilitating the creation of a comprehensive knowledge base from web content. 

This feature enables the autonomous ingestion of external content into Talkdesk. It supports both public and password-protected (Basic Auth) sites. Configuring the Crawl Strategy correctly is the most important step to ensure your Copilot and Autopilot receive high-quality, relevant data without indexing unnecessary "web noise."

 

Configuration Guide

Source Details

  • Name & Description: Define the source identity.
  • Authentication: Select None (Public) or Basic Authentication (User/Pass) depending on the target site's security.

Crawl Strategy

Choose how Talkdesk discovers and indexes your content. The available settings change based on the method you select.

Sitemap-guided (Recommended)

Best for large enterprises and clean data ingestion. Instead of guessing links, the crawler fetches a specific list of URLs from your XML sitemap. This saves time and ensures you only index valid, canonical pages.

Sitemap Source Options:

  • Auto Detect: Talkdesk automatically scans standard locations (e.g., /sitemap.xml, /sitemap_index.xml) and checks the robots.txt file for references. Ideal for sites following standard SEO best practices.
  • URL Input: Manually specify the exact URLs of your sitemaps.Supports up to 10 sitemap URLs.
  • File Upload: Directly upload an XML sitemap file from your computer. Max file size is 10MB.

Advanced Settings:

  • Max URLs per connector: (Default: 2000). The crawler stops once it reaches this limit (Range: 1–5000).
  • Ignore robots.txt restrictions: When active, the crawler proceeds to access pages explicitly disallowed by robots.txt, provided they are listed in the sitemap.

 

Automated Link Crawl

Best for smaller websites, simple marketing pages, or unstructured wikis. The crawler acts like a standard search engine bot: it starts at a "seed" URL and follows links recursively to discover pages. Enter the start URL to decide where the crawler begins.

Advanced Settings:

  • Max URLs per connector: (Default: 2000). The crawler stops once it reaches this limit (Range: 1–5000).
  • Crawl Depth: (0-100). Defines how many "clicks" away from the start URL the crawler will go. 0 = Starting URL only.
  • Maximum links per page: (Max: 5000). Limits the number of hyperlinks processed on a single page to prevent processing loop traps.
  • Ignore robots.txt restrictions: When active, the crawler ignores standard Disallow directives in the site's robots.txt file.

 

Crawl Boundaries

The URL Path Rules feature provides granular control over which pages the crawler is allowed to index. By defining specific patterns, you ensure that the connector only ingests high-value content. You can define up to 50 rules per crawler. Exclusion rules take precedence. If a URL matches both an "Include" rule and an "Exclude" rule, it will not be crawled.

Rule Types

You can configure two categories of rules:

  • Exclude path rules:
    • Purpose: To block specific sections of your site that add no value or contain sensitive/irrelevant info.
    • Behavior: Any URL matching these rules is immediately discarded.
    • Strategic Examples: Remove dynamic pages like /search-results, /my-account, /cart, or /login.
  • Include only path rules:
    • Purpose: To strictly limit the crawler to a specific sandbox.
    • Behavior: If you add any rule here, the crawler becomes whitelist-only. It will only index pages that match these rules (unless they are also excluded).
    • Strategic Example: If your help center is hosted on www.brand.com/support, set an "Include only" rule for /support. The crawler will ignore the rest of your site.

Matching Logic (How to Build a Rule)

When adding a rule, you must select a matching operator. The interface supports four distinct logic types to handle simple to complex scenarios:

OperatorHow it WorksBest Use CaseExample (from UI)
ContainsMatches if the text string appears anywhere in the URL.Broad blocking of site sections./blog/ matches site.com/blog/article-1
Starts withMatches if the URL begins exactly with this string.Locking the crawler to a specific subdomain or directory.www.abc.com/docs matches www.abc.com/docs/api
Ends withMatches if the URL ends with this specific extension or string.Filtering out specific file types..pdf matches annual-report.pdf
RegexUses "Regular Expressions" for advanced, flexible pattern matching.Complex logic (e.g., "Exclude URLs with 4-digit years").^.*/old/.*$ matches any URL containing an "old" directory

 

Document Processing

Enable ingestion of non-HTML assets if needed:

  • PDF / Word / PowerPoint Documents: Toggle these to scrape these files if found on the site domain or sub-domain.

 

Pages' Filtering

In Pages Filtering, you can enable “Strip Head & Footers” to automatically remove these types of components. If you want more granular control, you can use the Advanced Content Filtering mechanism described below.

 

Advanced Content Filtering (HTML/CSS)

The Advanced content filtering feature allows you to precisely control which parts of a web page are extracted and which are ignored. By utilizing HTML tags and CSS selectors, you can strip away noise (like navigation bars, ads, or footers) and focus the crawler strictly on the high-value content you need.

How It Works

This setting consists of two main fields:

  1. Exclude page elements: Removes specific parts of the page before processing.
  2. Include only page elements: Restricted processing to specific areas of the page.

Important Information: Exclusion always takes priority over inclusion. If an element matches rules in both boxes, it will be excluded.

 

1. Exclude Page Elements

Use this field to define elements that should be removed from the page HTML before the crawler extracts text. This is ideal for removing "boilerplate" content that appears on every page but adds no unique value.

Common use cases:

  • Removing navigation menus and sidebars.
  • stripping out advertisements and cookie banners.
  • Hiding comments sections or "related posts."

Syntax: Enter a comma-separated list of HTML tags, Class names (start with .), or IDs (start with #).

Examples:

GoalSelector to Use
Remove all scriptsscript
Remove the navigation barnav, .navbar, #main-menu
Remove footer and adsfooter, .ad-banner, .sidebar
Remove a specific buttonbutton.subscribe-popup

Input Example:

nav, footer, .cookie-consent, #advertisement-wrapper
 

2. Include Only Page Elements

Use this field when you want the crawler to ignore everything on the page except for specific sections. This acts as a strict whitelist. If you define selectors here, the crawler will discard everything else.

Common use cases:

  • Extracting only the main article text from a blog.
  • Grabbing only the product description and price from an e-commerce site.
  • Targeting a specific data table.

Examples:

GoalSelector to Use
Get only the main articlearticle, .post-content, main
Get product details#product-description, .price
Get body text only#post-body

Input Example:

article, .main-content, h1
 

3. How Exclusion & Inclusion Work Together

The crawler applies these rules in a specific order:

  1. First, it looks at your Include list. It isolates those specific sections.
  2. Second, it looks at your Exclude list. It scans the included sections and removes any elements that match the exclusion rules.

 

Use Case: Cleaning up a Blog Post

Imagine you want to scrape a blog post (article), but inside that article, there is a "Sign up for our Newsletter" box (.newsletter-signup) that you don't want.

  • Include only: article (This tells the crawler: "Throw away the sidebar, header, and footer. Keep only the article.")
  • Exclude: .newsletter-signup (This tells the crawler: "Look inside the article and remove the newsletter box.")

Result: You get the clean text of the article without the surrounding site noise or the interruptions inside the text.

All Articles ""
Please sign in to submit a request.