Main Menu

"Above root" problems.

Started by ShadowWizard, February 01, 2019, 10:12:57 PM

Previous topic - Next topic

ShadowWizard

So, i am trying to copy a part of a webpage, including photos.  However the location of some of the photos are above the url I am starting from.  However when I tell it to crawl "Above root" in the advanced options it crawls the whole site (As I assume there are links the take it back) how do I tell it to only go forward on URL links, but to get photos from anywhere?
No, I can not post the site I am pulling from for confidentiality reasons.
Example.  Page I want to pull from: www.site.com/first/second/third/index.html
Photos are stored in: www.site.com/photos
This means any photos that are on www.site.com/first/second/third/index.html will not download.  I need them to.
However it should NOT browse to www.site.com/first/second/index.html even if there is a link to it on www.site.com/first/second/third/index.html
However it SHOULD go to www.site.com/first/second/third/fourth/index.html (Assuign there is a link on www.site.com/first/second/third/index.html to it of course)

Richard Moss

Hello,

Welcome to the forums and thanks for the question.

The "Above Root" setting is a legacy of the oldest versions of WebCopy and really should be removed. After all, if you wanted to copy all pages, then you'd simply enter the domain rather than a nested page. At some point this year I'll be rewriting the crawl engine using more modern components and I'll be removing some of these obsolete settings and reworking the engine to actually make sense behind the scenes. There is a "Download All Resources" option (set by default) which allows you to automatically download non-HTML resources regardless of their location.

I started looking into your issue and realised that when I introduced said "Download All Resources" option it was conflicting rather badly with "above the root" pages and downloaded them in full. This has now been fixed and will be available as a nightly build this evening.

With this fix in place, you should now find your scenario works "out of the box" - if you are copying from /first/second/third/index.html and the "Download All Resources" option is set then it will include any non-HTML resources found elsewhere on the site, e.g. linked files in /photos.

Thanks again for finding this bug!

Regards;
Richard Moss