Print Page - Picture ref in <a> tag

Title: Picture ref in <a> tag
Post by: Chris on November 15, 2023, 06:14:18 PM

Hi,

It seems that all pictures that are referenced in the <a> tag are not dowloaded.

Thank you in advance for your help.

Title: Re: Picture ref in <a> tag
Post by: Richard Moss on November 18, 2023, 05:08:38 AM

Hello,

The issue appears to be that the image isn't actually referred to as part of the raw HTML, but is populated by JavaScript. As I note many times, WebCopy doesn't execute JavaScript.

So using an example from your screenshot, the following is part of the raw HTML downloaded by the browser or tools like WebCopy

Code Select

<a tabindex="-1" href="#" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0">

The href attribute is #, essentially pointing back to the parent page, so WebCopy ignores it.

Once the JavaScript has ran, the href is populated.

Code Select

<a tabindex="-1" href="https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0" data-lbox-init="true">

Unfortunately, while WebCopy can read data from custom attributes (such as data-album above), it wasn't really designed to extract bits out of them. However, by combining a couple of features, we can at least extract the images - but the a tags won't get updated with the true URL.

Firstly, you need to tell WebCopy where to find the extra URLS

Project Properties | Advanced | Custom Attributes
Value: //a/@data-album

(Documentation link: https://docs.cyotek.com/cyowcopy/current/customattributes.html (https://docs.cyotek.com/cyowcopy/current/customattributes.html))

As the blocks of JSON extracted by this method aren't valid URLs, we need to use URL Transforms to discard the bulk of the JSON and just keep the one attribute - I went with url in this case (again unfortunately WebCopy wasn't designed to be able to pull out multiple URLs from a single value except in some very specific places).

Project Properties | Advanced | URL Transforms
Add a new transform
Expression: \[{(.*?)"url":"(.*?)"}\]
Replacement: $2

(Documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html (https://docs.cyotek.com/cyowcopy/current/uritransforms.html))

With the above in place "https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" (and more!) is generated as a URL to scan by WebCopy.

Regards;
Richard Moss

Title: Re: Picture ref in <a> tag
Post by: Chris on November 18, 2023, 10:02:17 AM

Hi Richard,

Thank you very much for your detailed answer.
You've won a customer ;)

Best regards.

Cyotek Forums

Products => WebCopy => Topic started by: Chris on November 15, 2023, 06:14:18 PM