Cyotek Forums

Products => WebCopy => Topic started by: Chris on November 15, 2023, 06:14:18 PM

Title: Picture ref in <a> tag
Post by: Chris on November 15, 2023, 06:14:18 PM
Hi,

It seems that all pictures that are referenced in the <a> tag are not dowloaded.

Thank you in advance for your help.
Title: Re: Picture ref in <a> tag
Post by: Richard Moss on November 18, 2023, 05:08:38 AM
Hello,

The issue appears to be that the image isn't actually referred to as part of the raw HTML, but is populated by JavaScript. As I note many times, WebCopy doesn't execute JavaScript.

So using an example from your screenshot, the following is part of the raw HTML downloaded by the browser or tools like WebCopy

<a tabindex="-1" href="#" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0">
The href attribute is #, essentially pointing back to the parent page, so WebCopy ignores it.

Once the JavaScript has ran, the href is populated.

<a tabindex="-1" href="https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0" data-lbox-init="true">

Unfortunately, while WebCopy can read data from custom attributes (such as data-album above), it wasn't really designed to extract bits out of them. However, by combining a couple of features, we can at least extract the images - but the a tags won't get updated with the true URL.

Firstly, you need to tell WebCopy where to find the extra URLS


(Documentation link: https://docs.cyotek.com/cyowcopy/current/customattributes.html (https://docs.cyotek.com/cyowcopy/current/customattributes.html))

As the blocks of JSON extracted by this method aren't valid URLs, we need to use URL Transforms to discard the bulk of the JSON and just keep the one attribute - I went with url in this case (again unfortunately WebCopy wasn't designed to be able to pull out multiple URLs from a single value except in some very specific places).


(Documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html (https://docs.cyotek.com/cyowcopy/current/uritransforms.html))

With the above in place "https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" (and more!) is generated as a URL to scan by WebCopy.

Regards;
Richard Moss

Title: Re: Picture ref in <a> tag
Post by: Chris on November 18, 2023, 10:02:17 AM
Hi Richard,

Thank you very much for your detailed answer.
You've won a customer ;)

Best regards.