Web crawler – Exclude content

Former Member · ‎09-28-2005

Dear Web crawler experts

I have an issue regarding crawling of an intranet that consist of 5 iframes. I need to index the entire intranet, but I only want to index content from one frame, and still display the full page in a search result. Is this possible?

I can call the page directly by an secondary asp page like: http://intranet/info.asp?sdddsd, and call the entire page by use of http://intranet/default.asp.

I thought if I could add some tags either to tell the crawler to include or exclude the content part. I have looked at help.sap.com but can’t find any documentation on excluding sections under a web crawler.

Any ideas on how to solve this issue?

Cheers

Stubbe

Former Member · ‎09-28-2005

Hi John,

You could create a new crawler and then assign it to your index: (Global ServicesCrawler parameter) there you can find Resource Filters property.

Perhaps, this link help you to find a solution to your problem.

https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/library/kmc/knowledge management and collaboration developers guide.html

Crawler

The crawler service provides functions to create and manage crawlers. Crawlers are used to determine all the resources contained in a Content Management (CM) repository and to obtain references to them. The behavior of crawlers can be controlled in various ways. For example, they can be instructed to find resources that match certain conditions.

Various applications use the crawler. For example, the CM indexing service uses the crawler when it builds indexes to enable search and classification operations. It uses the crawler to get references to all the resources in a directory which must be indexed. It passes the references on to a search engine which then accesses and analyses the corresponding resources to build an index. The subscription service also makes use of the crawler. It schedules the crawler to find out the contents of directories at regular intervals. It can then determine whether any objects in the directories have changed in the time between the scheduled crawls.

Note that a new crawler has been implemented in the package com.sapportals.wcm.service.xcrawler. The new implementation allows crawler to be resumed after a restart of the underlying SAP J2EE Engine. The previous implementation in package com.sapportals.wcm.service.crawler has been deprecated.

Further Information

See the package description com.sapportals.wcm.service.xcrawler for:

Detailed explanation of the API and underlying concepts of the new implementation

UML diagrams of:

Architecture

Cached sets which store the resources that are collected by the crawler threads

Crawler threads

Code sample that shows how to crawl a documents repository

Patricio.

Message was edited by: Patricio Garcia

Web crawler Exclude content

Accepted Solutions (0)

Answers (1)

Answers (1)

Re: Failed to commit objects to server : Undefined...

BTP Free Tier SAP Integration Suite (CPI) Cloud Fo...

Re: Filter on multiple criteria

Crystal Reports text clipping issue. PDF created b...

Re: Crystal Reports

Web crawler  Exclude content

Accepted Solutions (0)

Answers (1)

Answers (1)

Re: Failed to commit objects to server : Undefined...

BTP Free Tier SAP Integration Suite (CPI) Cloud Fo...

Re: Filter on multiple criteria

Crystal Reports text clipping issue. PDF created b...

Re: Crystal Reports

Web crawler Exclude content