Rules setup to copy external resources ?

Started by Neo, May 23, 2016, 11:07:51 AM

Previous topic - Next topic

Neo

Hello ,
I have tried the Cyotek Webcopy which in the case I use it works very well apart from one problem which is that I can not figure out how to make it copy files external to the site , like e.g. all external pdf files .
I have studied the rules help but there is simply no examples to show how it is used and I can not use the simple add every link cause there is too many of them...
I have tried to input stuff like 
^pdf
pdf   
into the "include" rules box to grab pdf files regardless if on site or on external site but it doesn't work...

Could you please be so kind as to give some examples on how to use rules to make the program include all files of a given type (like pdf) from external sites in the downloads ?

Richard Moss

Hello,

Welcome to the forums, I'm sorry you're having troubles with WebCopy. The documentation is indeed a bit lacking, I am working on it, along with about a billion other things to do with WebCopy! In regards to your specific issue, did you try adding the host for your external site to the Additional Hosts list?

For example, assuming you're copying http://www.mydomain.com, which has links to documents on another site - an example might be http://downloads.superdocs.org/files/document1.pdf. By default WebCopy won't copy anything from this second site. By adding downloads.superdocs.org to the Additional Hosts page in the Project Properties dialog, this will instruct WebCopy to also crawl this site. However, this may mean you need to add rules to stop it from crawling the entirety of the other site.

Hope this helps, if not please let me know. I'm currently reworking how WebCopy handles external sites which should make it much easier to do this without having to specify lots of different options. When I finally get a chance to update the documentation I'll try and include some examples to make it easier for new users to get to grips with WebCopy's sometimes obtuse settings.

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

Neo

#2
Hi Richard ,

Thank you very much for the kind and extensive answer.

Actually I had already considered the option of adding a second URL to hosts list but as you so wisely points out then that could make the program crawl the other site in an unwanted way.   And also being completely new to the program then I didn't know exactly what the consequences would be crawling wise and exactly how much I would have to add to the rules to control it properly. And considering the problems that I had trying just to make a rule to grab something like external files (not knowing that it appear to be impossible with simple rules for grabbing in-line external content like pdf listed as links on the original host page)  then I couldn't see how I could possibly control the crawling on a secondary URL....

I thought that there had to be a simple way to grab external files referred to on primary host sites page by adding rules (like in "HTTrack Website Copier" - which for some reason won't copy the site in question)  which would not cause the program to go crawling off to other sites generally...

Personally I think that a simple option to grab anything on other sites that fits a simple rules , like extension = "pdf" or pathname contains : "site-name.com/downloads/pdf-files" , but nothing else than that on external sites , that would be really great and very useful. 
Of course I do understand that what might look as a simple thing in the UI might take a lot of coding to accomplish ;)

Incidentally , since the rules syntax appear to not be explained then then it is still a bit confusing even to understand for a newbie how adding rules work   :-[   (I couldn't find the "confused" emoticon here so I will just go with embarrassed which I guess that I also am - just a tad   ;D )

Anyway , I am grateful that that I have the opportunity use your Webcopy program without pay and for the kindness that you show..
Thank you very much  :)

Neo

P.S.

I managed to use the interface to add two "Additional Hosts pages" (added from list after site scan) and by doing so I after that got the site resources that I needed copied to drive  :)
However when the "Webcopy" program were in phase of "Remapping local files" (or whatever it said) then program crashed with an error about unable to copy site and below that it said something about that URL scheme were invalid...   I filed two online reports to Cyotek about the crash.  As I wrote in the second report then the site in question had quite a large number of resources and links to resources placed on just a few pages and the program didn't seem to be that happy with that when initially trying to do a quick scan before the copy  (program seem to quickly starting to suffer from congestion) where the normal scan looked as it ran better , but both were aborted due to the large number of files. the copy itself ran surprisingly well and it looked as it ran OK until the crash at the remapping phase...  Windows process itself were 32 bit but system itself got 16GB RAM and is running 64 bit Windows 7 , so if there should be any lack of memory problems then it would be due to 32 bit limitations...

Anyway , I consider the site copy as done and over with so I would just like to say thank you for the help and for the use of this otherwise nice web-site copying program :)

Richard Moss

Hello,

Thanks for the detailed feedback. I'm glad you got your files, although I apologize for the crash you experienced when  WebCopy tried to do the final remapping of the downloaded files. I have your error report, it's not a frequent issue but you're not the only one to have it happen to - hopefully I can pin that one down and get it fixed.

WebCopy does include a 64bit experimental version, but I don't think Setup will create icons for it yet. With that said, I was doing some memory profiling of WebCopy a few weeks ago (or months  :-[) and had observed that some of WebCopy's data structures were using a lot more memory that I anticipated; resolving that is going to be quite a job. But it's on the list.

The next update should include some simpler options to allow the copying of all resources, regardless of source, without having to do a lot of complicated configuring or having to know all the different hosts in use.

Glad you find the program useful, thanks again for the feedback!

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

Neo

My pleasure , giving feedback and crash report were the least that I could do.
Thank you very much Richard  :)