cancel
Showing results for 
Search instead for 
Did you mean: 

Web crawler – Exclude content

Former Member
0 Kudos

Dear Web crawler experts

I have an issue regarding crawling of an intranet that consist of 5 iframes. I need to index the entire intranet, but I only want to index content from one frame, and still display the full page in a search result. Is this possible?

I can call the page directly by an secondary asp page like: http://intranet/info.asp?sdddsd, and call the entire page by use of http://intranet/default.asp.

I thought if I could add some tags either to tell the crawler to include or exclude the content part. I have looked at help.sap.com but can’t find any documentation on excluding sections under a web crawler.

Any ideas on how to solve this issue?

Cheers

Stubbe

Accepted Solutions (0)

Answers (1)

Answers (1)

Former Member
0 Kudos

Hi John,

You could create a new crawler and then assign it to your index: (Global ServicesCrawler parameter) there you can find Resource Filters property.

Perhaps, this link help you to find a solution to your problem.

https://www.sdn.sap.com/irj/servlet/prt/portal/prtroot/com.sap.km.cm.docs/library/kmc/knowledge management and collaboration developers guide.html

Crawler

The crawler service provides functions to create and manage crawlers. Crawlers are used to determine all the resources contained in a Content Management (CM) repository and to obtain references to them. The behavior of crawlers can be controlled in various ways. For example, they can be instructed to find resources that match certain conditions.

Various applications use the crawler. For example, the CM indexing service uses the crawler when it builds indexes to enable search and classification operations. It uses the crawler to get references to all the resources in a directory which must be indexed. It passes the references on to a search engine which then accesses and analyses the corresponding resources to build an index. The subscription service also makes use of the crawler. It schedules the crawler to find out the contents of directories at regular intervals. It can then determine whether any objects in the directories have changed in the time between the scheduled crawls.

Note that a new crawler has been implemented in the package com.sapportals.wcm.service.xcrawler. The new implementation allows crawler to be resumed after a restart of the underlying SAP J2EE Engine. The previous implementation in package com.sapportals.wcm.service.crawler has been deprecated.

Further Information

See the package description com.sapportals.wcm.service.xcrawler for:

Detailed explanation of the API and underlying concepts of the new implementation

UML diagrams of:

Architecture

Cached sets which store the resources that are collected by the crawler threads

Crawler threads

Code sample that shows how to crawl a documents repository

Patricio.

Message was edited by: Patricio Garcia

Former Member
0 Kudos

Hi Patricio

Thanks for the reply. Creating a new web crawler will involve development right?

I was just hoping for at function where I could define which part of a web page that should be included in the crawling.

Like you define a HTML tag like <include> web content </include>. But that not possible by use of the standard on Netweaver EP Stack 12

Cheers

John

Former Member
0 Kudos

Hi John,

Creating a new web crawler will not require development.

You can create a new crawler under

System Administration@System Configuration@Knowledge Management@Content Management@Global Services@Crawler Parameters

Here you will see the list of crawlers that are available. By default the crawler is called as standard.

You can create a new one by duplicating the standard crawler.

For filtering pages while crawling you can use the scope filters and result filters.

Regards

Prakash