Remapping of links

epiktek · December 09, 2018, 02:56:14 PM

Hi,

I'm archiving a site that's 50GB+ with external media (hosted on AWS). Since this job was so large, there were errors partway through which made the process stop. What I didn't realize was that links would not get remapped until the very END of the job. Cyotek did a great job of downloading everything but since the operation was interrupted partway through, a lot of the links were not remapped.

What are my options now?
1) Can I exclude all external media and start the process to download only the HTML/ASP files and have the process complete to get the links remapped? My concern is that links to the external media files will not get remapped due to the fact that I excluded WebCopy from downloading external media.
2) Can I force WebCopy to only update the HTML/ASP files. If I set "Always download latest version" will it re-download all the media files?
3) Delete all the HTML/ASP files (keeping the large media files), UNCHECK "Always download latest version" (so it won't force a re-download of the media files), and letting WebCopy run though to completion so it remaps the links?

Any suggestions?

Thanks.

Richard Moss · December 11, 2018, 05:13:44 PM

Hello,

Welcome to the forums and thank you for the question. Unfortunately if the WebCopy crashed then it won't have any reference to anything you've already downloaded and so won't be able to use these files.

What you could do is try splitting the job into two - I tested this process on the demonstration website which worked fine.

Firstly, make sure the "Folder | Empty website folder before copy" and "Links | Clear link information before scan" options are not set.

Secondly, add a rule for the expression .* with the Exclude and Crawl Content options set. This will instruction WebCopy to scan all HTML files, but not to download any resources or keep any of the HTML.

Next, add a rule to download the media. What expression you use depends on the contents of your site, but I used (\.png|\.jpg) to download images. For this rule you need to set the Include option. This rule overrides the first rule and states that any matching URL's should be fully downloaded.

Now run the job. (Perhaps this should be tested on a subset of your site first to avoid having to download everything again only for it to fail).

Once complete, save the project so we can at least resume from this point.

Now, change the second rule from Include to Exclude and completely disable (or delete) the first rule.

If the job is then ran again, WebCopy will start downloading all the HTML, CSS, and other files that it ignored before. It knows about the media files that were previously downloaded and so when it remaps the files, it will use local filenames of that existing media.

In regards to your "Always download latest version" question, again this relies on WebCopy already having the meta data for a given URL, and that the website returns last modified or etag headers. So again, if WebCopy crashed and the project file wasn't saved, as far as it is concerned it doesn't know anything about any local files - it will ignore then and download fresh copies. I shall log something to reconsider this behaviour.

I hope this helps and I do apologise for the poor behaviour on the part of WebCopy. I do have plans to fix this issue (it is logged as #326), but I was leaving this until after #61 (multi threaded crawling) was implemented.

Regards;
Richard Moss

Cyotek Forums

Remapping of links

epiktek

Richard Moss