Main Menu

Recent posts

#61
WebCopy / Re: Run webcopy again but avoi...
Last post by szetakyu - January 03, 2024, 05:20:27 AM
Quote from: rumia on July 26, 2023, 08:00:35 PMi am currently using version 1.9.0.822
is there any news regarding the "-1" problem?
Currently I have the same problem although I have set the 3 settings from link on first post.
filename-1.typ and filename.typ files have same sha1 hash

if not, is it possible to get the function, if file exists locally (path\file) then the download should be skipped?

Same question here.
#62
WebCopy / Re: I can pause copy and resum...
Last post by szetakyu - January 03, 2024, 05:13:32 AM
Requesting this feature in 2024
#63
WebCopy / Re: Skip existing [downloaded]...
Last post by szetakyu - January 03, 2024, 05:09:56 AM
Same question here.
#64
WebCopy / Re: Is it possible to resume i...
Last post by szetakyu - January 03, 2024, 05:09:18 AM
+1
#65
WebCopy / Re: exclude duplicate files
Last post by szetakyu - January 03, 2024, 05:06:50 AM
+1
#66
WebCopy / Error message - request was ab...
Last post by beige - December 28, 2023, 11:31:25 PM
I get this immediately as soon as I click on Scan or Copy. Link to website, or rather page, is below. It's literally one page with a couple hundred thumbnails that each link to one photo and that's it. There's no navigation per se other than that sadistically annoying thing that blocks the page from loading as one page and instead forces you to hit Page Down then wait a second or two before hitting Page Down again. I don't know if this is relevant but HTTrack can save a broken version of the site where the homepage will open but you can't click on anything or page down. If WebCopy's unable to save this kind of site and someone has other suggestions such as another program or settings I could change in HTTrack to get it to work that'd be great.

https://www.archivepdf.net/maison-margiela-cream-issue-09-2008
#67
WebCopy / Webcopy vs a site with 200K + ...
Last post by scylla - December 16, 2023, 04:10:42 PM
How does it fare vs a site that has about 200K files or more?  Will it crash or have performance issues before finished?

Also does it follow robots tags in page?  Example html from one

<title>Printer friendly output for Sky Dancer 2</title>
<meta name="ROBOTS" content="NOINDEX">
#68
WebCopy / Unable to use Internet Explore...
Last post by Yaaaahuoo - December 04, 2023, 10:10:45 AM
When I try to "Manually logging into a website", webcopy calls IE as the browser to access the website to capture the login data, but the website I need to download no longer supports IE, is there any way to solve this problem. (e.g. can there be any way to change the default browser for grabbing login data to EDGE or Google, Firefox?)
#69
WebCopy / Re: Picture ref in <a> tag
Last post by Chris - November 18, 2023, 10:02:17 AM
Hi Richard,

Thank you very much for your detailed answer.
You've won a customer ;)

Best regards.
#70
WebCopy / Re: Picture ref in <a> tag
Last post by Richard Moss - November 18, 2023, 05:08:38 AM
Hello,

The issue appears to be that the image isn't actually referred to as part of the raw HTML, but is populated by JavaScript. As I note many times, WebCopy doesn't execute JavaScript.

So using an example from your screenshot, the following is part of the raw HTML downloaded by the browser or tools like WebCopy

<a tabindex="-1" href="#" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0">
The href attribute is #, essentially pointing back to the parent page, so WebCopy ignores it.

Once the JavaScript has ran, the href is populated.

<a tabindex="-1" href="https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0" data-lbox-init="true">

Unfortunately, while WebCopy can read data from custom attributes (such as data-album above), it wasn't really designed to extract bits out of them. However, by combining a couple of features, we can at least extract the images - but the a tags won't get updated with the true URL.

Firstly, you need to tell WebCopy where to find the extra URLS

  • Project Properties | Advanced | Custom Attributes
  • Value: //a/@data-album

(Documentation link: https://docs.cyotek.com/cyowcopy/current/customattributes.html)

As the blocks of JSON extracted by this method aren't valid URLs, we need to use URL Transforms to discard the bulk of the JSON and just keep the one attribute - I went with url in this case (again unfortunately WebCopy wasn't designed to be able to pull out multiple URLs from a single value except in some very specific places).

  • Project Properties | Advanced | URL Transforms
  • Add a new transform
  • Expression: \[{(.*?)"url":"(.*?)"}\]
  • Replacement: $2

(Documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html)

With the above in place "https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" (and more!) is generated as a URL to scan by WebCopy.

Regards;
Richard Moss