Skip to Content

Does Trex follow external links in Web Repositories?

Hello,

We've configured a Dynamic Web Repository ponting to a internal web page that has internal and external links. When I index this repository the crawler only follows internal links and ignores external ones. We use a personalized crawled that is like the standard one but limiting the depth to 3 levels and the links that I want to index are in the first level.

Does anybody know how to configure the crawler or the web repostitory to index this external links and its content?

Thanks in advance and best regards,

JuanCarlos

Add a comment
10|10000 characters needed characters exceeded

Related questions

1 Answer

  • Best Answer
    Posted on Mar 27, 2009 at 06:23 PM

    Hi Juan,

    It's nice to finally see someone other than myself post questions about a web repository 😊

    I have found that the way to crawl external links is by changing the 'External Server URI Handling' setting in the web repository to either 'report' or 'rewrite'. I have not found any good documentation to explain why this is. The good news is that you can make this setting change without restarting the portal.

    I recommend turning on debugging for your crawler, if you haven't already done so. Set the 'Maximum Log Level' to 'info' on your crawler settings. Set your 'Path for log files' to something like 'crawler', and then next time after you've reindexed, you'll see the info and error log files in the folder:

    \usr\sap\J2E\<instance>\j2ee\cluster\server0\crawler

    If you don't find the folder under the 'server0' folder, check the other nodes' folders and you'll find it.

    Lastly, the reason why it's important to have the crawler logging enabled is so that you can see a little more info as to why items fail. Sometimes items fail to be crawled, and you don't see any mention of these files in TREX monitor.

    A problem that I had with trying to crawl external links is that my proxy server was preventing me from getting out. The way to get around that is to establish a valid username/password and a proxy link in the 'Default Proxy System' settings. You get to this by navigating to:

    System Admin -> System Config -> Knowledge Management -> Content Management -> Global Services -> System Landscape Definitions -> Systems -> Default Proxy System

    The funny thing about the default proxy system is that there's a check box in the web repository settings called 'Use System Default Proxy Settings'. I have found that if this is checked, it doesn't work. If I leave it unchecked, it seems to work. Go figure 😊

    You'll notice that I've recently posted a forum message asking for help determining some problems when trying to crawl external links. I'm getting a 'timeout' type of error for one and an 'authentication required' type message for another.

    Hope this helps!

    -StephenS

    Add a comment
    10|10000 characters needed characters exceeded

Before answering

You should only submit an answer when you are proposing a solution to the poster's problem. If you want the poster to clarify the question or provide more information, please leave a comment instead, requesting additional details. When answering, please include specifics, such as step-by-step instructions, context for the solution, and links to useful resources. Also, please make sure that you answer complies with our Rules of Engagement.
You must be Logged in to submit an answer.

Up to 10 attachments (including images) can be used with a maximum of 1.0 MB each and 10.5 MB total.