Encountering 404 error in URL's with special characters

Started by kb1985, April 21, 2015, 12:41:37 PM

Previous topic - Next topic

kb1985

Hi,
I am trying to make local copy of my blog and I have encountered a problem.

It seems that some of the files on the server are marked by the software as 404, while they definitely exist on the server.
I figured out the problem occurs with files that have special characters in their filenames (for example # or letter "ł" from Polish alpabet).

Could you please help me? I am no IT person so I can't figure that one out.

For example check this file which generaters a problem:
http://semp.pl/fotosy/sieciowe/Bukowski%233_02_20130926_221615_430.jpg

I am getting error 404 in your software and the URL in software looks like this (note # character).
http://www.semp.pl/fotosy/sieciowe/Bukowski#3_01_IMG_1713_430.jpg

Greetings
K.

Richard Moss

#1
Quote from: kb1985 on April 21, 2015, 12:41:37 PM
Hi,
I am trying to make local copy of my blog and I have encountered a problem.

It seems that some of the files on the server are marked by the software as 404, while they definitely exist on the server.
I figured out the problem occurs with files that have special characters in their filenames (for example # or letter "ł" from Polish alpabet).

Could you please help me? I am no IT person so I can't figure that one out.

For example check this file which generaters a problem:
http://semp.pl/fotosy/sieciowe/Bukowski%233_02_20130926_221615_430.jpg

I am getting error 404 in your software and the URL in software looks like this (note # character).
http://www.semp.pl/fotosy/sieciowe/Bukowski#3_01_IMG_1713_430.jpg

Greetings
K.

Hello,

Thanks for your message, sorry you're having problems with WebCopy. Looks like you've found a new bug for me to fix!

Basically, the file on your server is actually named Bukowski#3_01_IMG_1713_430.jpg. However, this isn't a valid character for a URL, so the # is automatically encoded as %23 (you can learn more about that here).

Unfortunately it looks like "somewhere" WebCopy is decoding the URL, but not re-encoding it prior to submitting it to your server - and as the # is used to mark a fragment, it's probably looking for a file named Bukowski instead, hence the 404.

Do you have an example URL with a Polish character that is failing? I'd like to verify that the same bug causes this too, and not something else - a sample URL will help me replicate that locally too and determine a solution.

As with many bugs, there's not going to be a lot you can do to workaround this until I get it fixed. As I don't know where it's happening yet, I need to create a new test case and do some debugging. I've replicated the bug here locally, I just need to find out where, fix it, and ensure it doesn't have any side effects.

Sorry for the inconvenience :(

Regards;
Richard Moss

kb1985

Quote from: Richard Moss on April 21, 2015, 05:04:15 PM
Quote from: kb1985 on April 21, 2015, 12:41:37 PM
Hi,
I am trying to make local copy of my blog and I have encountered a problem.

It seems that some of the files on the server are marked by the software as 404, while they definitely exist on the server.
I figured out the problem occurs with files that have special characters in their filenames (for example # or letter "ł" from Polish alpabet).

Could you please help me? I am no IT person so I can't figure that one out.

For example check this file which generaters a problem:
http://semp.pl/fotosy/sieciowe/Bukowski%233_02_20130926_221615_430.jpg

I am getting error 404 in your software and the URL in software looks like this (note # character).
http://www.semp.pl/fotosy/sieciowe/Bukowski#3_01_IMG_1713_430.jpg

Greetings
K.

Hello,

Thanks for your message, sorry you're having problems with WebCopy. Looks like you've found a new bug for me to fix!

Basically, the file on your server is actually named Bukowski#3_01_IMG_1713_430.jpg. However, this isn't a valid character for a URL, so the # is automatically encoded as %23 (you can learn more about that here).

Unfortunately it looks like "somewhere" WebCopy is decoding the URL, but not re-encoding it prior to submitting it to your server - and as the # is used to mark a fragment, it's probably looking for a file named Bukowski instead, hence the 404.

Do you have an example URL with a Polish character that is failing? I'd like to verify that the same bug causes this too, and not something else - a sample URL will help me replicate that locally too and determine a solution.

As with many bugs, there's not going to be a lot you can do to workaround this until I get it fixed. As I don't know where it's happening yet, I need to create a new test case and do some debugging. I've replicated the bug here locally, I just need to find out where, fix it, and ensure it doesn't have any side effects.

Sorry for the inconvenience :(

Regards;
Richard Moss

Thank you for a quick reply and for detailed explanations!

I guess I will just wait for the updated version and make the backu once again when update is available. Could you tell me if the software will have to redownload everything or will it just download the missing files?

Some other files causing problems (case of Polish characters):

http://www.semp.pl/fotosy/sieciowe/Zdj%eacie0218_430.jpg
(http://www.semp.pl/fotosy/sieciowe/Zdjęcie0218_430.jpg)

http://www.semp.pl/fotosy/sieciowe/Rezydencja_przek%b3adki_430.jpg
(http://www.semp.pl/fotosy/sieciowe/Rezydencja_przekładki_430.jpg)

I just looked at the list of errors and it seems all incidents of false 404 happen with either Polish character or #.
If you need any further info please let me know. I will be happy to provide you with it.

And thank you for your software - I find it really user-friendly!

Richard Moss

Hello,

Thanks for the further information - that should help very nicely! I'll drop a line to this thread when it's fixed and available for download.

Glad you like the software (warts and all!) although I don't think it's the most user friendly application myself. It gets the job done though!

Regards;
Richard Moss

Richard Moss

Well, the good news is the bug is fixed.

The bad news is to fix it I had to revert a fix made in June 2013. The comment in the code said that the fix was there to resolve an issue where a relative URI couldn't be combined, but didn't give an example and the commit log has no details either, nor can I find any error reports relating to the reason why the fix was made. And there doesn't seem to be an automated test for this fix either as none of the automated tests for URI combining fail when I revert the fix. Happy days.

At times we have a habit of reintroducing previously fixed bugs accidentally, so it would probably be better not to reintroduce them deliberately - I need to try and work out why this fix was made originally before I can decide what to do.

Richard Moss

There's a new beta setup for the next release of WebCopy available here. It has a fair list of fixes which you can find here.

It includes a fix for the # encoding you experienced. However, I was unable to find a problem with Polish characters - the test page I added for this isn't causing any problems with WebCopy.

If you get a chance, can you see if this new build resolves your issues?

(Note that this build requires .NET 4.5 and no longer supports Windows XP)

Thanks;
Richard Moss

kb1985

Thx! Redownloading the website now! Will let you know if I encounter any problems.
I have two questions:
1. Can I clear the error log?
2. Do I need to redownload the website everytime or will the program download only new files?

Richard Moss

Hello,

If by error log you mean the error tab, this is dynamically populated based on the crawl results. So the answer is ... "maybe". If I've successfully fixed the bug that was causing your 404's, then those should disappear. If there are other bugs causing some form of issue transforming URL's, or if there are actual problems with your website... maybe not.

WebCopy supports copying only changed files, but it depends on a specific combination of settings, some of which weren't set by default in the original builds of WebCopy for technical reasons. Now that I've been reminded of this, I'll think I'll change those defaults for the official release of this build.

There's a Knowledge Base article which tells you what you need to have set - bear in mind if your project isn't currently set up this way then it will probably re-download everything as it won't have the information it needs.

Regards;
Richard Moss

kb1985

Hello, here is some feedback.

Files seem to download without any problems - thank you!

However there is one thing - in place of every Polish character, there is black square with question mark in it. The file itself is not corrupted and web browser opens it without any problems, it's just Windows that doesn't display the filenames correctly. Maybe there is something that can be fixed here too. All the characters are marked with the same square with question mark symbol in it and yet web browser seems to open those files correctly.

Files with # download without any problems.

Also both files - those with # and those with Polish characters didn't disappear from error log. Can I clear it manually?

Richard Moss

Thanks for the follow up. Glad at least part of it is working, but ... sigh ... it should all be fixed.

Where exactly are you seeing the black squares? In the body of the HTML, in the file name of the actual file?

If you copy http://demo.cyotek.com/, does this exhibit the same behaviour? (http://demo.cyotek.com/features/specialcharacters.php links to http://demo.cyotek.com/features/specialcharacters-%C4%99.html) and are what I used for testing - for me, those particular files download fine, the file name includes the ę character and everything "just works".

If you have a link to a webpage that in turn links to pages that are coming up with "black squares" in the copy, if you can let me know this URL and I can scan it here using the source code and hopefully both reproduce your issue and potentially fix it.

In regards to the error tab - if you're saving the link map, it will have stored entries for the URI's that it transformed incorrectly. If you open the Tools menu then click Remove Missing Links it should offer to get rid of all the 404's, which should solve that problem.

Thanks;
Richard Moss

kb1985

Hi, here are examples of two files that have the symbol in their filename instead of proper character. Note that it doesn't bother me that much since web browser opens the files correctly and they are visible on the downloaded site. It is just curious that the file itself on HDD has this symbol instead of character. Also in the first file polish character is "ę" and in the second file it is "ź" and yet they are replaced by the same symbol on HDD. Maybe there are several ways of encoding the character and Windows can't display it properly while web browser can decode it? That's not mayor problem but I thought I would let you know.

http://www.semp.pl/fotosy/sieciowe/Zdj%eacia-0097_430.jpg
http://www.semp.pl/fotosy/sieciowe/TVN%20Warszawa%20-%20Zapowied%9f%20Kodu%20Dziewi%eaciu%20Symboli_430.jpg

Also, I have one more question. Doest the program crawl through all the files or does it give up when it gets too many errors? I thought because of some errors downloading stopped few times. Sorry to bother you with so many questions, still trying to figure this out. Many things are intuitive here and many are not.

Richard Moss

#11
Hello,

Thanks for the follow up. I managed to reproduce the problem using the URL you provided. It does seem to be an encoding issue of sorts. Your filename is Zdjêcia-0097_430.jpg, which is then encoded as %ea. However, for me, this character is encoded and decoded as %C3%AA. If I attempt to decode %ea, I get .

Right now, I haven't the foggiest what is causing that, and I'm going to have to do so more digging. However, it might have to wait though, as given the number of bugs in the currently live build, I really need to get the new update out sooner rather than later. I'm already going to have to ship without updating the help file (not that it's much good at the best of times) and some other minor enhancements, it's been a busy month for bugs :(

In regards to your other question (I don't mind answering them, but it's probably better to start new threads so they are easier to discover for other people with similar questions :) ) - by default WebCopy will ignore most errors and continue. If WebCopy catches an exception while crawling an URI, it will cancel the crawl. It should also abort if certain 5xx codes are returned but I suspect that might have been re-factored away, I butchered the library in the last 3 or 4 releases for various reasons. I've added a note to check that, and logged an enhancement to look at adding a new option to abort after a given number of errors.

Oh - and I personally think WebCopy is a bit unfriendly to use. As with most of our stuff, it started off as a personal tool to fill a specific need. One of these days I'll get around to trying to improve UI flows etc, so please feel free to let me know of bits of the program which seem more complicated to use (or understand) than they should be - thanks!

Regards;
Richard Moss

Edit: It's the default Windows 1252 encoding, not UTF8. Not even remotely sure how I'm going to resolve this one, if there are no headers to determine a specific type of encoding, I use UTF8 by default. Definitely need to think this one over.

kb1985

Maybe the problem is on the server?
The file one my HDD is definitely "Zdjęcia-0097_430.jpg" and yet you say on the server it is visible as "Zdjêcia-0097_430.jpg". It's like there is no difference between those two characters and the symbol. Anyway, that's not that big deal since web browser opens the file correctly despite changed filename.

As for your question regarding UI - that's good that all expert features are there but definitely great thing would be simple mode:
1. Enter website you wish to download.
2. Enter subfolders you would like to exclude.
3. Enter domains you would like to include.
4. Download + list files that were not downloaded and why.

Richard Moss

Quote from: kb1985 on May 08, 2015, 06:22:08 PM
As for your question regarding UI - that's good that all expert features are there but definitely great thing would be simple mode:
1. Enter website you wish to download.
2. Enter subfolders you would like to exclude.
3. Enter domains you would like to include.
4. Download + list files that were not downloaded and why.
I don't really want this thread to derail much, but that is fairly close to one of the ideas I was investigating for 2.0, except I got sidetracked by something else :)