Webcopy is rather slow in updating a website

thelogicmatrix · March 26, 2019, 02:54:16 PM

So I have a situation where I only have internet for an set amount of hours a day. This causes the download to cut off midway and it having to return to crawl through the previous pages. This process is rather slow and it takes a couple of hours to just get back to where it was the last time. How do I make this faster?

P.S I am aware of hw to make it update without redownloading the files its just that scanning takes too long.

Richard Moss · March 28, 2019, 05:18:03 PM

Hello,

Welcome to the forums and thanks for the question. Unfortunately there's probably not a lot that can be done right now to make it faster. One of the long standing user requests has been for the multiple threading support, so that multiple URI's can be processed at once. In my testing, this makes downloading a website significantly faster. However, large parts of WebCopy's object model were not designed with thread safety in mind and so while my limited testing works, it still requires time to ensure all paths are thread safe.

Another user request is the ability to pause and resume a crawl but I decided to delay that until after mutli thread support was added.

The other thing to do is do more performance profiling, there are parts of the code that could be optimised but again that requires time I simply don't have right now.

Sorry if this answer doesn't help.

Regards;
Richard Moss

hajzlik · November 25, 2021, 01:08:51 PM

WebCopy really is ridiculously slow. Simple website takes on average 32 hours to download on 500/500 optical connection.

The biggest reason for this is that it pings ALL links on the websitesite. And waits 2-6 seconds for a response on each. Even in the "Site only" mode.

There is absolutely no reason for this behavior. Link to other domain = lets ignore it and continue crawling. This procces should take 2ms at most. Not 6 seconds.

Totally wasted time, CPU power and internet bandwidth.

hajzlik · November 26, 2021, 12:07:34 PM

OK now I understand this behavior.

When you check "Download all resources" option, WebCopy actually downloads HTML files too. Which makes the downloading much slower, but also creates a huge mess in the downloads folder.

This is a bug since the checkbox says "Download any non-HTML documents".

hajzlik · November 27, 2021, 10:11:12 AM

...but even with "Download all resources" unchecked, WebCopy often spends majority of download time on skipped external links.

hajzlik · November 28, 2021, 09:37:57 PM

Skipping over a external link literally takes longer than downloading it.

Richard Moss · December 02, 2021, 08:15:05 PM

Quote from: hajzlik on November 25, 2021, 01:08:51 PM
WebCopy really is ridiculously slow. Simple website takes on average 32 hours to download on 500/500 optical connection.

The biggest reason for this is that it pings ALL links on the websitesite. And waits 2-6 seconds for a response on each. Even in the "Site only" mode.

There is absolutely no reason for this behavior. Link to other domain = lets ignore it and continue crawling. This procces should take 2ms at most. Not 6 seconds.

Totally wasted time, CPU power and internet bandwidth.

There is a reason for this behaviour - for downloading all resources. Unfortunately, I'm in between a rock and hard place. If that option defaults to off, then people complain that content on CDN's etc isn't downloaded. Leave it on and... I get threads like this. In all honesty this default setting sometimes irks me as well because I often forget to switch it off as I myself mostly want it disabled.

I'll see if I can make the documentation clearer as I think the default should remain as it is in order for things to work "out the box" for novice users, especially given it can be a bit arcane to configure.

Quote from: hajzlik on November 26, 2021, 12:07:34 PM
OK now I understand this behavior.

When you check "Download all resources" option, WebCopy actually downloads HTML files too. Which makes the downloading much slower, but also creates a huge mess in the downloads folder.

This is a bug since the checkbox says "Download any non-HTML documents".

Can you provide more information on this - WebCopy should only download the external resources if its content type is anything other than text/html. It is possible that if head checking is disabled (or gets auto-disabled due to badly configured site that doesn't support it) that the full content will be downloaded (I still need to confirm this and fix it), but it shouldn't get kept in the download folder even if it is mistakenly downloaded.

Quote from: hajzlik on November 27, 2021, 10:11:12 AM
...but even with "Download all resources" unchecked, WebCopy often spends majority of download time on skipped external links.

Quote from: hajzlik on November 28, 2021, 09:37:57 PM
Skipping over a external link literally takes longer than downloading it.

Again, if you have some information that could help isolate this with this it would be useful because what you describe is not what I see. I'm not discounting the possibility of a bug, but in my testing I've never noted the behaviour you describe.

hajzlik · December 07, 2021, 05:32:07 PM

Hi Richard, thank you for taking the time to reply!

Here is a video of WebCopy skipping over external HTML links for 3 hours straight. Of course these skipped HTML files ends up (and stay) in the download folder. Webcopy is set to "Site only" mode.

https://youtu.be/gO_ShjG5PQk