Main Menu

Recent posts

#41
WebCopy / Re: Skip existing [downloaded]...
Last post by szetakyu - January 03, 2024, 05:09:56 AM
Same question here.
#42
WebCopy / Re: Is it possible to resume i...
Last post by szetakyu - January 03, 2024, 05:09:18 AM
+1
#43
WebCopy / Re: exclude duplicate files
Last post by szetakyu - January 03, 2024, 05:06:50 AM
+1
#44
WebCopy / Error message - request was ab...
Last post by beige - December 28, 2023, 11:31:25 PM
I get this immediately as soon as I click on Scan or Copy. Link to website, or rather page, is below. It's literally one page with a couple hundred thumbnails that each link to one photo and that's it. There's no navigation per se other than that sadistically annoying thing that blocks the page from loading as one page and instead forces you to hit Page Down then wait a second or two before hitting Page Down again. I don't know if this is relevant but HTTrack can save a broken version of the site where the homepage will open but you can't click on anything or page down. If WebCopy's unable to save this kind of site and someone has other suggestions such as another program or settings I could change in HTTrack to get it to work that'd be great.

https://www.archivepdf.net/maison-margiela-cream-issue-09-2008
#45
WebCopy / Webcopy vs a site with 200K + ...
Last post by scylla - December 16, 2023, 04:10:42 PM
How does it fare vs a site that has about 200K files or more?  Will it crash or have performance issues before finished?

Also does it follow robots tags in page?  Example html from one

<title>Printer friendly output for Sky Dancer 2</title>
<meta name="ROBOTS" content="NOINDEX">
#46
WebCopy / Unable to use Internet Explore...
Last post by Yaaaahuoo - December 04, 2023, 10:10:45 AM
When I try to "Manually logging into a website", webcopy calls IE as the browser to access the website to capture the login data, but the website I need to download no longer supports IE, is there any way to solve this problem. (e.g. can there be any way to change the default browser for grabbing login data to EDGE or Google, Firefox?)
#47
WebCopy / Re: Picture ref in <a> tag
Last post by Chris - November 18, 2023, 10:02:17 AM
Hi Richard,

Thank you very much for your detailed answer.
You've won a customer ;)

Best regards.
#48
WebCopy / Re: Picture ref in <a> tag
Last post by Richard Moss - November 18, 2023, 05:08:38 AM
Hello,

The issue appears to be that the image isn't actually referred to as part of the raw HTML, but is populated by JavaScript. As I note many times, WebCopy doesn't execute JavaScript.

So using an example from your screenshot, the following is part of the raw HTML downloaded by the browser or tools like WebCopy

<a tabindex="-1" href="#" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0">
The href attribute is #, essentially pointing back to the parent page, so WebCopy ignores it.

Once the JavaScript has ran, the href is populated.

<a tabindex="-1" href="https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0" data-lbox-init="true">

Unfortunately, while WebCopy can read data from custom attributes (such as data-album above), it wasn't really designed to extract bits out of them. However, by combining a couple of features, we can at least extract the images - but the a tags won't get updated with the true URL.

Firstly, you need to tell WebCopy where to find the extra URLS

  • Project Properties | Advanced | Custom Attributes
  • Value: //a/@data-album

(Documentation link: https://docs.cyotek.com/cyowcopy/current/customattributes.html)

As the blocks of JSON extracted by this method aren't valid URLs, we need to use URL Transforms to discard the bulk of the JSON and just keep the one attribute - I went with url in this case (again unfortunately WebCopy wasn't designed to be able to pull out multiple URLs from a single value except in some very specific places).

  • Project Properties | Advanced | URL Transforms
  • Add a new transform
  • Expression: \[{(.*?)"url":"(.*?)"}\]
  • Replacement: $2

(Documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html)

With the above in place "https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" (and more!) is generated as a URL to scan by WebCopy.

Regards;
Richard Moss

#49
WebCopy / Picture ref in <a> tag
Last post by Chris - November 15, 2023, 06:14:18 PM
Hi,

It seems that all pictures that are referenced in the <a> tag are not dowloaded.

Thank you in advance for your help.
#50
WebCopy / Re: Help copying a responsive ...
Last post by Richard Moss - November 11, 2023, 08:18:41 PM
Hello,

I had a look at the website and the image URLs actually have a size parameter that defaults to 1024. When viewing the page source, it was 1024, but when looking at the DOM in the browser it was 2056 - which happens to be the width of my primary monitor. I haven't looked further but it would seem clear that some JavaScript is being used to manipulate the style URLs.

So from that perspective, WebCopy is doing exactly as it should - the HTML specifies a size of 1024 and thus that is what it downloads.

However, if you want to try to mimic this aspect of the JavaScript behaviour (a reminder, if needed, that WebCopy doesn't execute the JavaScript) - WebCopy has a "Transform URLs" feature that will take a given URL and replace part of it.

I tried adding a transform (Project Properties | Advanced | URL Transforms) with the following attributes which seemed to do the trick - just set the new size as appropriate.

  • Expression: &size=(\d+)
  • Replacement: &size=2056

You can learn more about transforming URLs at the following documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html

Edit: There doesn't seem to be a need for the data-original attribute rule as this does not exist in the original source, of this page at least.

Regards;
Richard Moss