Run webcopy again but avoid duplicates

Started by granimati, May 11, 2020, 11:10:09 PM

Previous topic - Next topic

granimati

I have a site that updates files about once every month I'd like to download frequently.

I'd like to make sure that I'm only downloading new or changed files, so I went to https://www.cyotek.com/support/kb/cyotek-webcopy/how-can-i-copy-only-changed-files-from-a-website.

I did the following selections in my current project:


  • The Save link information in project option should be set (General) - this is selected in General > Links
  • The Empty website folder before copy option should not be set (Copy) - this was in Folder Options
    The Always download latest version option should not be set (Advanced) - in Advanced as advertised

However, when I try to run the project again to see what's new and updated on the site, WebCopy creates duplicates of the files (example: filename.pdf and filename-1.pdf). I'd like to skip over the files I've already downloaded without creating these duplicate files, only downloading the latest files.

Hope that makes sense!

Richard Moss

Hello,

Welcome to the forums and thanks for the feedback.

This functionality currently requires a little co-operation from the website itself. If you use the "Test URL" tool to get one of the URLs that hasn't changed, and see if the "Response Headers" section lists Last-Modified or ETag headers - without these WebCopy currently has no way of knowing if the file has changed or not and so downloads it again.

With that said, I've spent the last few days rewriting the core logic of the crawler that decides if something should be download - after 10 years of organic growth, the code was an unmaintainable mess with logic scattered all over. I completed that yesterday and got all tests passing again but there were a number of issues I noted - one of these is that it looked like if the HEAD verb was disabled then any last modified / etag values were being ignored. I haven't yet started writing new tests to expose this potential bug and then get it fixed. Oh yes, and in an effort to resolve a problem where websites would sometimes return a non-standard status code such as 404 when a HEAD request wasn't supported, I made a change where this would cause header checking to silently disable to allow these sites to copy without the users having to think. Unfortunately I discovered when redoing the logic that this was too broad and even it was previous working fine, as soon as it hit a problem it would disable it.

It shouldn't be creating duplicates as it should know where the existing file is already to overwrite (but still download!), this is something I will look into in case there's a bug.

I had been considering an option which would skip downloading of resources if the local file already existed locally as you're not the first person to have this sort of issue - I've made a note of this and will look into adding it to a future version.

Also thank you for pointing out the KB article is wrong - I forgot to update it after I moved a bunch of options around to try and make things easier to find by end users!

Regards;
Richard Moss
Read "Before You Post" before posting. Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

granimati

No worries, and thanks for taking the time to respond. I actually find WebCopy far easier and intuitive than wget or Httrack, so I'd love to continue using it instead!

Not to push, but I would definitely prefer avoiding resources if the same name (or file size) exists locally. Thanks so much again for responding!

Richard Moss

Hello,

I can look into adding the option as part of 1.8.1 or 1.9 as it isn't a big change. Did you run the Test URL tool and check to see if any appropriate headers were being returned?

Regards;
Richard Moss
Read "Before You Post" before posting. Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

rumia

#4
i am currently using version 1.9.0.822
is there any news regarding the "-1" problem?
Currently I have the same problem although I have set the 3 settings from link on first post.
filename-1.typ and filename.typ files have same sha1 hash

if not, is it possible to get the function, if file exists locally (path\file) then the download should be skipped?

szetakyu

Quote from: rumia on July 26, 2023, 08:00:35 PMi am currently using version 1.9.0.822
is there any news regarding the "-1" problem?
Currently I have the same problem although I have set the 3 settings from link on first post.
filename-1.typ and filename.typ files have same sha1 hash

if not, is it possible to get the function, if file exists locally (path\file) then the download should be skipped?

Same question here.

dlyaverablyamit

Quote from: granimati on May 11, 2020, 11:10:09 PMI have a site that updates files about once every month I'd like to download frequently.

I'd like to make sure that I'm only downloading new or changed files, so I went to https://www.cyotek.com/support/kb/cyotek-webcopy/how-can-i-copy-only-changed-files-from-a-website.

I did the following selections in my current project:

  • The Save link information in project option should be set (General) - this is selected in General > Links
  • The Empty website folder before copy option should not be set (Copy) - this was in Folder Options
    The Always download latest version option should not be set (Advanced) - in Advanced as advertised

However, when I try to run the project again to see what's new and updated on the site, WebCopy creates duplicates of the files (example: filename.pdf and filename-1.pdf). I'd like to skip over the files I've already downloaded without creating these duplicate files, only downloading the latest files.

Hope that makes sense!
I understand your concern about WebCopy creating duplicate files when attempting to download only new or changed files from a website. To avoid the creation of duplicate files, you can try the following steps:

1. Enable the "Rename existing files" option: In the WebCopy project settings, navigate to the "File Options" section. Look for an option similar to "Rename existing files" or "Rename files with conflicts." Enable this option to ensure that files with conflicting names are automatically renamed instead of creating duplicates.

2. Adjust the "Rename format" settings: In the same "File Options" section, you may find an option to specify the format for renaming conflicting files. This allows you to customize how the files are renamed when conflicts occur. You can choose a format that includes a timestamp or a unique identifier to ensure that the renamed files are distinct from the originals.