Are skipped files remembered in project?  (Read 61 times)

nalle

Are skipped files remembered in project?
« on: June 28, 2020, 10:54:35 PM »
Hi, thanks for this amazing tool!

I have a question as to whether files that are listed as skipped in the "Skipped" tab due to an error are remembered between copy-sessions?
Let's say that I get a 403 error, solve the problem, then rerun copy website - would the program skip the previously skipped files?
It would make sense to permanently skip "known bad" links/files for efficiency but would cause a problem in the above scenario.

I would have thought that it isn't, but I wonder as the files remain listed in the skipped tab, even while rerunning the download (it may change status or get removed (the scan hasn't finished yet).

Richard Moss

Re: Are skipped files remembered in project?
« Reply #1 on: June 29, 2020, 06:35:00 AM »

Welcome to the forums and thanks for the interesting question.

WebCopy stores an internal map of all URLs it encounters, along with various bits of meta data - including if the URL was skipped and why. This map is stored in the project file by default, so from that point of view WebCopy "remembers" between sessions.

However, this doesn't impact future crawls. If you tell WebCopy to re-crawl the website, it will merrily start scanning the root URL again and will update its internal map accordingly - but otherwise it completely ignores it except to check the last downloaded date when trying skip content that is unchanged.

This is actually due to change in a future version (1.11+) as one of WebCopy's apparent pain points is updating a previously downloaded website. So I'm going to juggle things around a little so that when rescanning a previously saved website it makes use of that internal map from the get go. When I start work on this, I will make sure I introduce options so that you can control what happens with previously skipped files.

Hope this helps, if not please let me know.

Richard Moss

nalle

Re: Are skipped files remembered in project?
« Reply #2 on: June 29, 2020, 06:12:43 PM »
Hi Richard,
Thanks for your detailed reply! it's exciting to learn more about the inner workings of the application and the considerations that go into it.

If I follow you correctly the "link map" is currently not used at all for recrawling a website, but in the future, it can be used to update/recrawl a website more efficiently, rather than rescanning the entire website from the beginning.

At the moment, recrawling a website is no more efficient than the original crawl. Correct?

Thanks again for your time.