Rules Exclusion Happens After Processing?

Started by titus, February 01, 2022, 07:11:56 PM

Previous topic - Next topic

titus



Hey there, newbie here, I'm mirroring a blogspot site and it is riddled with these links that I won't need later on when I read the site offline when I travel. Above you can see some of those links.

Now I've made a few rules to try to skip those things entirely:



The Rule Checker seems to indicate that these rules should catch instances of the share-post links but, as you can see, the Skip Reason is Failed not because of Exclusion. Am I doing it wrong? Or does the the program try to scan the link before deciding to exclude or not?

aussiewan

I have a similar concern.

I am trying to copy a site that requires me to be logged in. However, because WebCopy appears to follow each link before it processes the rules, it loads the logout links, which deletes cookies and possibly does server-side things too, and I can no longer access the content I want for the rest of the copy session.

This is making this product completely unusable for me at the moment unfortunately.

I have tried to use the Browser mode and "normal" mode.
I have tried restructuring the Rules to have an "allow everything that isn't logout" instead.
I tried to set the cookies for the page in the project profile, but I couldn't get them right, so I'm using the browser login page to achieve this instead.
I have tried the nightly release of 1.9.1.824 and it is still the same.

Any assistance with this would be appreciated.

aussiewan

Here is a bit more info about what I'm seeing. Please see the attached image.

I have rules that clearly match, as you can see in the Rule Tester, which should exclude the logoff pages.

The Results tab shows that the pages are being skipped because of "redirect", but they have a size, which suggests to me that they are hitting the web server and triggering the logout function.

When capturing the site, the index.htm file shows the "you're logged in as x" message, and then a after a little while it gets re-downloaded and has the "log in" link instead.

ThomasS

It has been reported a year ago (link to thread), but for me it still doesn't work: I try to crawl a phpbb site that contains many logoff links. Although I excluded these through a rule (which apparently works), it seems like WebCopy calls the excluded URLs anyway before deciding to exclude them.

WebCopyProb.png

When I look at the local pages, "memberlist" (and all the other pages above) still shows that I'm logged in, while "search" shows that I was logged out - although the "mode=logout" link was excluded.

Is there anything I can do to stay logged in? I tried 1.9.0 and 1.9.1 (nightly), but to no avail: both logged off during crawling.

Thank you for you support
Thomas

Richard Moss

#4
Welcome to another episode of "Richard Admits To Embarrassing" bug, courtesy of your host, the one writing embarrassing bugs.

I had checked this previously (multiple times!) and couldn't reproduce, but when I was tweaking the rules so that distance limits applied only to text/html, I finally twigged onto why this is happening, wrote a reproduction test and sighed.

Basically, there are some advanced filtering options that need information from the server

* The Download all resources option is set
* Rules that do processing based on content-type
* Minimum file sizes
* Maximum file sizes
* (1.9.1+ only) Maximum distance from base URI (this version wants to know the content type so it can only apply it to text/html)

If any of these options are set, then WebCopy makes an additional request as part of rule processing  so that it can read the content-type and content-length headers. Which means it hits the logout page that you excluded with a HEAD request and thus logs you out.  :-[

I don't know when this bug was introduced, file limits were introduced in 1.7 and content type rules in 1.8 but I do know that in 1.9 I spent a very long time drawing a flow chart of crawl decision logic then rewriting all of the code to be less spaghetti and more maintainable and I definitely (re-) introduced this in that build.

Not sure if this is the cause of all of the cases in this thread but it seems a good bet given the difficulty in reproducing, please let me know if this isn't the case. I've logged issues #481 for this and it will be part of the mammoth 1.9.1 bug fixing release.

Another case of better late than never...

Edited to add it also does it when "download all resources" is set. Which makes sense (needs to know the content type!), but also means this affects pretty much every single crawl. Oops.
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.