WC is copying way too much, because the site is in a sub directory

Started by gedanken, July 08, 2020, 10:44:15 PM

Previous topic - Next topic

gedanken

I am trying to copy a URl that looks like:

https://web.archive.org/web/20181104190605/http://justonesite.com/

but it is copying everything on web.archive.org.  or at least it finishes justonesite.com and begins downloading 20,000 docs from /web/(various dates)/http://linkedin.com

How do i get it to limit itself to just get justonesite.com?  please?

gedanken

I seem to have "fixed it" by specifying Do not follow redirects... but now all i have is index pages with dead links =/  Still trying to figure this out.

gedanken

It is scraping the index page, with no images, and none of the links work. any help appreciated.

Richard Moss

Hello,

Try making the following adjustments


  • Crawl mode should be Site only (General category)
  • Disable Download all Resources (General category)
  • Enable Crawl above Root URL (Advanced category)
  • Add a Does not match exclusion rule for the expression justonesite\.com
These settings will tell WebCopy to ignore any URL that doesn't contain "justonesite.com" and should therefore limit the crawl.
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

gedanken

thanks!

For the exclusion rule, i have

Compare with "Path and Query"

and for Options I just checked exclude, but the final creation says "Exclude, Reversed"

(and the other three I had two of them wrong :D )

does that Rule look right?  and thanks!  This is 12 years of data I am trying to save.

Tool is great, felt good donating to good people. thx again for replying!

matt

gedanken

This does a great job of not grabbing unwanted stuff, but it is still not copying any content at links.  For example, I get the index.htm homepage, which has a link that leads to a directory called event such as:

https://web.archive.org/web/20181104190605/http://justonesite.com/event/

on the website, that link is clickable and leads to content and images.  on my copy, its a link that leads to a 404 error.

As the index page has several of these links, how to I grab their content please?

Richard Moss

Hello,

Glad the settings worked. The reason it says reversed is because that is how that flag was originally implemented before being put on the pending deletion list, but now instead of a confusing checkbox that says "Reverse" there's instead the "Match", "Don't Match" option list. That flag is still technically deprecated and on its last legs as I plan on adding more options to "Match", "Don't Match" to make it possible for users to specify plain text instead of having to know or care about regular expressions. At the same time I'll add a new column to the list so that it shows the "Match", "Don't Match", "Contains", etc.

I visited the URL you linked in your reply but for me this is a 404 so WebCopy will ignore it. Did you paste the correct URL into your reply?

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

gedanken

thank you again for the reply!

OK it was 404 for me as well.  let me copy and paste the good and bad - not to be obtuse but bc the same error causing this 404 issue, i suspect, is causing the crawler one.

this works for me - i got it by clicking on the link on the **scraped** index page

https://web.archive.org/web/20180815191657/http://ripleyandfriends.com/events

now i will copy and paste the link i click on to get there by right clicking the index page link

https://web.archive.org/web/20180815191657/http://ripleyandfriends.com/events

and now the one i gave you

https://web.archive.org/web/20181104190605/http://justonesite.com/event/

OK so two issues, i forgot to change the domain name i had been hiding (general paranoia sorry) AND i mis-spelled it. 

I am going to try and add two snapshots - one is what i see on the archive site, and the other is the index page i get (with dead links, unlike the archive index page which has working links. ) 

Perhaps by just seeing what is missing in the retrieved version, versus the "live" (archived) version, you'll instantly know what i am missing.  I really appreciate this as its a lot of critical London history and art I am trying to rescue for the artists who made it.



and i get



and this is all i have in my folder - an index file and none of the subdirs or other pages (hence the links go nowhere for my copy)





Richard Moss

Hello,

Thanks for the information, that is helpful.

Judging from your image, it isn't downloading the CSS. When I checked the site via the URL you provided I see this is why it is missing images as well as they are defined in the CSS file.

The styles are coming from sites/all/themes/glossyblue/style.css?L so we need to work out why this is getting skipped. I created a brand new project, configured it as I described in the previous reply and also configured it to dump everything into one folder so I can see it easier. The download had been running for 5 mins before I cancelled it and I can see plenty of content - including `style.css` and your missing `body-bg.jpg`.

Actually, as I type this reply I realise I've configured a slightly different set of rules to my original reply


  • ripleyandfriends\.com (Path, Matches, Include, Stop Processing)
  • .* (Exclude)

If you open the Website Links dialogue (you can find it on either the View or Tools menu's (I moved it in the last update)), this will show a list of all URLs found during a crawl. Try typing style.css into the filter box for the URL column and see what is listed in the Status column for any matches - if it says _Skipped_ then right click, chose Properties  and then look for the reason in the Excluded field of the Properties dialogue (I didn't notice I'd omitted the reason from the links list, I will add that back).
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

gedanken

Thanks so much, I will donate more when I can! (Getting paid once a month in July stinks bc it has 5 wednesdays not 4...)

For the second rule, I chose

.*   (Path and Query String) (Matches) (Exclude)

I hope that's right, giving it a try now, thanks!!!

Matt

gedanken


Richard Moss

Hello,

You don't need to "fix" redirects... WebCopy reports them as skipped because it has, but it still follows the redirect. So if 1.html redirects to 2.html, WebCopy will report 1.html as skipped, but will show 2.html as downloaded (and any links that pointed to 1.html will be remapped to 2.html instead).

I'm sure I removed redirects from appearing in the Skipped report, I'll improve how they are presented in the Results pane as well as this seems to be a source of frequent confusion.

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

gedanken

I changed redirects in the Profile from Do not follow to Follow Internal (approx text - it is still running)

Its definitely picking up way more files this run, much longer by far.  From trying to read the scrawl, i think its working?  I will know more soon :D  Thanks again!  This project is hugely important to me and my client/friend both. Decades of work.

Richard Moss

Hello,

Yes, that sounds like it is working. Did you change the redirect setting originally? By default that shouldn't have been set to not follow.

The problem with using the Internet Archive is WebCopy won't be able to download a single view, as due to the nature of the Archive it will take multiple snapshots which tend to cross reference each other. WebCopy will have to download them all, so you'll likely get multiple copies of files (like the style.css) from the different snapshots. Unfortunately, I don't think there's a real way of avoiding this, you'd have to download everything and then try to tidy it up some.

Not that I'm complaining, the IA is a wonderful resource as so much information is being lost when old websites go offline, or get deindexed. Heck, even browsers insistence that anything not using SSL (and now anything not using TLS 1.2) is evil and should be swept under the rug. I hope you manage to recover your friends resources!

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

gedanken

AHH it worked, thanks a million!

to answer your question, I have no idea who/what/when/where or why it was set to do not follow redirects. I think I did it after it was grabbing non-ripley sites, which we fixed with a rule, but by then i did not recall.  Another example that sometimes starting over w a new profile would have been a good idea.

I am so so grateful for this product, and all you have done!

I have sent another donation, as best I can at the moment.  I hope it helps you in some small way!

Matt