I have an issue with indexing my web repository where once the crawler finds a link to an external web site, it goes wild indexing all of the content that it finds. What I'm looking to do is restrict the crawler so that it only indexes the first page that is referenced.
Let me try to give an example of what I'm referring to. The web site that I've got configured for my web repository is http://my.internal.site/index.htm. This index.htm page has a link in it to http://my.external.site/index.htm. I want the crawler to include http://my.external.site/index.htm in it's content, but I do not the crawler to crawl any of the content contained in the http://my.external.site/index.htm file.
This is a very simplistic example of what I'm trying to do. In reality, I've got hundreds of pages on my internal site that are referencing external links and crawling all of the content in those external links. What I'm looking to do is dynamically configure the crawler so that whenever it encounters an external site, it only crawls the top level document. I do not want to have to add a crawler resource filter for each and every external site that is referenced.
Is this possible? I seem to remember that in the previous verison of the portal there was a field where you could specify the depth to crawl on external sites. I do not see that as a specific setting now. I imagine that I could do that using the resource filters, but I am at a loss as to how. Can anyone help me with this?
Thanks!
-StephenS