Unexplained filename changing results in improper copies of websites

Started by forsajt, April 25, 2015, 02:32:43 PM

Previous topic - Next topic

forsajt

Hi!

I have just downloaded Cyotek WebCopy and to my disappointment I have encountered a very odd problem when trying to copy my first website. The website I'm trying to copy is using jQuery with some plugins:

jquery.soc-share.js
jquery.scrollTo.js
jquery.queryloader2.min.js
jquery.parallax.min.js
jquery.min.js
jquery.mb.YTPlayer.js
jquery.magnific-popup.min.js
jquery.isotope.sloppy-masonry.min.js

The problem is that Cyotek WebCopy renames these scripts, both the filename and the links in html source, as follows:

jquery.soc-share.js -> jquery.js
jquery.scrollTo.js -> jquery.js
jquery.queryloader2.min.js -> jquery.queryloader2.js
jquery.parallax.min.js -> jquery.parallax.js
jquery.min.js -> jquery.js
jquery.mb.YTPlayer.js -> jquery.mb.js
jquery.magnific-popup.min.js -> jquery.magnific-popup.js
jquery.isotope.sloppy-masonry.min.js -> jquery.isotope.sloppy-masonry.js

As you can see, some scripts are renamed to the same filename (jquery.js), so the file on disk gets overwritten multiple times during download. This bizarre behavior results in nothing more than a bad, non working copy of a website. Moreover, this behavior is not documented anywhere and there isn't any options to change it. Why is it so? How come Cyotek WebCopy does even touch these scripts' names? It seems that for the moment Cyotek WebCopy is basically unable to properly copy most of modern websites, which renders it useless piece of code.

Could you please explain and fix that?


P.S.
I have noticed, that it additionally changes Unix separator to Windows one in page content, not letting user to choose.

Richard Moss

Hello,

Thanks for the post. Bizarre behaviour, it definitely shouldn't be doing this and I have no explanation :) I'll take a look at this shortly. It's almost as if it is treating parts of the file name as the extension, but it makes no sense as to why this is happening.

Quote from: forsajt on April 25, 2015, 02:32:43 PM
P.S.
I have noticed, that it additionally changes Unix separator to Windows one in page content, not letting user to choose.

Actually, there is an option for directory characters - it's labelled as Use alternate directory character in the Copy section of a project properties. I'll update the option so it's a little more clearer for end users. It will default to Windows by default as ... well... it's a Windows program!

Regards;
Moss

forsajt

Thanks for quick response. Just for the record, this is happening for *.css files too, e.g.:

owl.carousel.css -> owl.css
font-awesome.min.css -> font-awesome.css

Regarding the directory separator, I stand corrected - the option is there, though I agree it would be clearer if we had a combobox named "Directory separator" with options "Unix type (/)", "Windows type (\)" for example ;)

Richard Moss

Hello,

Turns out this bug was introduced in the 1.0.10.x series - basically, the new code which ensures a file name is both unique  and fits within 260 characters is slightly flawed and is causing the second-to-last-period to be also treat as an extension, which it is removing. None of the test files that WebCopy uses have multiple periods in their file names, so this regression wasn't picked up.

This bug (and test oversight) has now been corrected and will be in the next update.

In the meantime, the "workaround" would be to uninstall that version, and install 1.0.9.1 instead - although this means you'd be at the mercy of any bugs fixed in 1.0.10.x.

Thanks again for the bug report - always happy to be able to make WebCopy a slightly less useless piece of code.

Regards;
Richard Moss

forsajt

Thanks Richard. I have already traced it down to GetWorkFile method and patched the assembly for my own use (hope you don't mind), since I have little time and plenty of websites which need to be urgently archived. I think this would be nice test as well and I'll surely report a bug if I find any! Cheers.

Richard Moss

Personally I don't mind given you used it to work around a bug, but speaking from the company perspective it's a violation of the license agreement that you agreed to (probably without reading like everyone else!) when installing the software, and the start of a slippery slope - after all, suppose the software was commercial and you patched out the license check. Just a few releases ago the software verified the digital signature was valid - you'll have invalidated it with your edit, so assuming that check was still there, would have removed that too?

Given the time argument you used, I would have thought it was much faster to install the previous version as recommended rather than poking around the guts of the program.

forsajt

Perhaps another bug in the same routine?

http://example.com/feed/ which is a xml feed gets saved as c:\Downloaded Web Sites\example.com\index.htm, overwriting original index.htm. Could you please check this too?

Thanks.

Richard Moss

Hello,

Yup, confirmed. Again, this was never noticed because the test URL was more than two levels deep. The method checking that the segment count is > 2 before it decides if it should be doing anything with the URI path, rather than just being smarter about checking for document names. That one has been there since 2011, strange it hasn't been noticed before now.

Thanks for the catch, now writing tests for it, and it will be in the next update.

Regards;
Richard Moss

Richard Moss

There's a new beta setup for the next release of WebCopy available here. It has a fair list of fixes which you can find here.

I think I've resolved all of the regressions introduced with the .10.x series, but you never know. Certainly all tests pass, it just depends if I'm covering all possible cases.

This isn't an official release as such, but given the issues you were experiencing, and the number of bugs resolved I wanted to get it out there for you sooner rather than later - I have some more non-bug related tinkering to do before I do a final release.

If you get a chance, please see if this resolves your problems (note: I haven't looked at issues to do with directory characters) and let me know if this helps (or not!)

(Note that this build requires .NET 4.5 and no longer supports Windows XP)

Thanks;
Richard Moss