Author Topic: Copy PDFs from redirected sites  (Read 52 times)

Offline dsdart

  • Newbie
  • *
  • Posts: 2
  • Karma: +0/-0
Copy PDFs from redirected sites
« on: November 07, 2019, 01:23:36 PM »
Hello,

I am trying to copy PDF files from a website: https://webje.yurls.net/nl/page/887516#boxes-container. This site contains several links to websites that look like: *.yurls.net/nl/page/[a number like 000001] (for example http://canon-pad-alettajacobs.yurls.net/nl/page/732302). On these websites, new links exist to the pdf files I would like to copy: web-jack.nl/Public/../*.pdf (for example https://web-jack.nl/Public/Kruiswoordpuzzels/Canon/aletta-kruiswoordpuzzel.pdf).

Does anyone has a solution on how to do this? Thanks in advance ;D

I have added the project file which I used so far. With this file it is possible to copy the PDFs connected to a single page like http://canon-pad-alettajacobs.yurls.net/nl/page/732302.

Offline Richard Moss

  • Cyotek Team
  • Administrator
  • Sr. Member
  • *****
  • Posts: 338
  • Karma: +17/-0
    • cyotek.com
Re: Copy PDFs from redirected sites
« Reply #1 on: November 08, 2019, 06:14:15 PM »
Hello,

Thanks for the question. You'll need to add the other sites (such as web-jack.nl) to the Additional Hosts page in the Project Properties dialog.

Also note that using the "Everything" crawl mode is really not recommended when you are deliberately breaking out of the main site and going to other sites - WebCopy will just keep going and going and going until it runs out of memory.

Regards;
Richard Moss

Offline dsdart

  • Newbie
  • *
  • Posts: 2
  • Karma: +0/-0
Re: Copy PDFs from redirected sites
« Reply #2 on: November 11, 2019, 11:01:56 AM »
Hello Richard,

Thanks for the quick responds. I have added all websites (such as web-jack.nl and canon-pad-alettajacobs.yurl.net/nl/page/732302) to the Additional Hosts, but still no pdfs are found in the output folder.

Subsequently, I changed the Everything craw to sibling domains crawl and tried enabling crawl above root URL. Now I can see that for example the website canon-pad-alettajacobs.yurl.net/nl/page/732302 is downloaded (also all other *.yurl.nl/nl/page/* are downloaded). Unfortunately, this did not result in the downloading of any pdfs from web-jack.nl.

Any ideas on how to download the pdfs?
I have included the latest project file.

Kind regards,
Frank van Berlo

 

anything