Main Menu

Picture ref in <a> tag

Started by Chris, November 15, 2023, 06:14:18 PM

Previous topic - Next topic

Chris

Hi,

It seems that all pictures that are referenced in the <a> tag are not dowloaded.

Thank you in advance for your help.

Richard Moss

Hello,

The issue appears to be that the image isn't actually referred to as part of the raw HTML, but is populated by JavaScript. As I note many times, WebCopy doesn't execute JavaScript.

So using an example from your screenshot, the following is part of the raw HTML downloaded by the browser or tools like WebCopy

<a tabindex="-1" href="#" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0">
The href attribute is #, essentially pointing back to the parent page, so WebCopy ignores it.

Once the JavaScript has ran, the href is populated.

<a tabindex="-1" href="https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" class="pushed" data-caption="1" data-deep="gallery-742680_77759" data-lbox="ilightbox_gallery-742680_77759" data-options="width:2500,height:1866,thumbnail: 'https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg'" data-album='[{"title":"","caption":"","width":"2500","height":"1866","thumbnail":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg","url":"https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg"}]' data-lb-index="0" data-lbox-init="true">

Unfortunately, while WebCopy can read data from custom attributes (such as data-album above), it wasn't really designed to extract bits out of them. However, by combining a couple of features, we can at least extract the images - but the a tags won't get updated with the true URL.

Firstly, you need to tell WebCopy where to find the extra URLS

  • Project Properties | Advanced | Custom Attributes
  • Value: //a/@data-album

(Documentation link: https://docs.cyotek.com/cyowcopy/current/customattributes.html)

As the blocks of JSON extracted by this method aren't valid URLs, we need to use URL Transforms to discard the bulk of the JSON and just keep the one attribute - I went with url in this case (again unfortunately WebCopy wasn't designed to be able to pull out multiple URLs from a single value except in some very specific places).

  • Project Properties | Advanced | URL Transforms
  • Add a new transform
  • Expression: \[{(.*?)"url":"(.*?)"}\]
  • Replacement: $2

(Documentation link: https://docs.cyotek.com/cyowcopy/current/uritransforms.html)

With the above in place "https://vicenteromeroredondo.com/wp-content/uploads/2023/01/130x97-cm.jpg" (and more!) is generated as a URL to scan by WebCopy.

Regards;
Richard Moss

Read "Before You Post" before posting. Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

Chris

Hi Richard,

Thank you very much for your detailed answer.
You've won a customer ;)

Best regards.