'URL Transforms' do not work

Started by Paniz, January 14, 2024, 03:54:47 PM

Previous topic - Next topic

Paniz


Hi,

I'm new to WebCopy, but I have experiences with other Web-crawlers

Surprisingly, I found WebCopy almost as light as Teleport and almost as powerful as OE
Two contrasting advantages, in one package, plus some unique features


  The Late 'Tennyson Maxwell' Teleport Pro/Ultra/VLX: ‹http://www.tenmax.com/

  'MetaProducts' Offline Explorer ‹OE› Standard/Pro/Enterprise: ‹http://www.metaproducts.com/



───< 'Maybe Bug, maybe not' Report: >───

♦ Description:

  'URL Transforms' work fine during Test,
  but are ignored during Link Processing when downloading Files from the Internet



  I crawled a site's content by WebCopy, as per the following settings


  ↓ 'Cyotek WebCopy 1.9.1.872': Customize project properties

  ■ General

    ◊ Website

      URL: https://www.bulkrenameutility.co.uk/

    ◊ Crawl Mode

      (•) Sibling domains

    -----

    [√] Download all resources

    [ ] Limit crawl depth

    [ ] Limit distance from root URL

    ○ Folder

      ◊ Folder Options

        Save folder: C:\BR\

        [ ] Create folder for domain

        [ ] Flatten website folder

        [ ] Empty website folder before copy

            [ ] Use Recycle Bin

        Directory character: [Windows (\)]

    ○ Local Files

      ◊ File Options

        [√] Remap references within downloaded files

        [ ] Update local time stamps

        [ ] Use query string in local file names

      ◊ Remap file extensions by content type

        (•) Only for HTML

        [√] Keep original extension

    ○ Additional Hosts

      ...

    ○ Additional URLs

      ...

    ○ Content Types

      (•) Include all

    ○ Limits

      [ ] Maximum number of files: 0

      -----

      [ ] Minimum File Size: 0 KiB

      [ ] Maximum File Size: 0 KiB

    ○ Forms

      ...

    ○ Passwords

      ◊ Password Options

        [ ] Do not prompt for passwords

        [ ] Log in using web browser

      ◊ Saved Passwords

        ...

    ○ Rules

      Expression                                  |   Options
      ----------                                  |   -------
      [√] forum                                   |   Exclude
      [√] FileKicker/BRU_setup_3\.4\.4\.0\.exe$   |   Exclude

      ◊ Rule Properties

        Compare: [Path] [Matches]
        Expression: forum
        Options: [Exclude]

        ◊ Advanced
          [√] Enable this rule
          [ ] Stop processing more rules
          Download Priority [Normal]

        ↑ for ‹https://www.bulkrenameutility.co.uk/forum/

        -----

        Compare: [Path] [Matches]
        Expression: FileKicker/BRU_setup_3\.4\.4\.0\.exe$
        Options: [Exclude]

        ◊ Advanced
          [√] Enable this rule
          [ ] Stop processing more rules
          Download Priority [Normal]

        ↑ for ‹https://www.bulkrenameutility.co.uk/FileKicker/BRU_setup_3.4.4.0.exe

    ○ Speed Limits

      ◊ Limits

        (•) Limit to requests per second

            Maximum Requests per Second: 1

        -----

        [ ] Enforce Limit Checks

    ○ URL Normalization

      ◊ URL Normalization

        [ ] Honor canonical URI's {Disabled}

        [√] Ensure internal links match domain prefix

        [√] Ignore case (not recommended)

      ◊ Force HTTPS

        (•) Never

    ○ User Agent

        (•) Use custom user agent

            Mozilla/5.0 (Windows NT 6.1; rv:86.0) Gecko/20100101 Firefox/89.0

    ○ Web Browser

      [ ] Use web browser

    ○ Web Page Language

      ...

    ○ Sitemap

      [√] Create site map

      File name: index-wcopy.html

  ■ Advanced

    ◊ Crawl Behaviour

      [√] Use header checking (recommended)

      [√] Always download latest version

      -----

      [ ] Crawl above root URL (not recommended)

      [√] Keep alive (recommended)

      -----

      Origin report: [Create a single report for the entire project]

      [ ] Add to source HTML

    ◊ Abort HTTP Status Codes

      Status Codes:

    ○ Accepted Content Types

      ...

    ○ Cookies

      [ ] Discard session cookies

    ○ Custom Attributes

      ...

    ○ Custom Headers

      ...

    ○ HTTP Compression

      [√] Compress (Legacy)
      [√] Deflate
      [√] GZip
      [√] Brotli
      [√] BZip2

    ○ Link Map

      [√] Save link information in project

          [√] Include headers

      [ ] Clear link information before scan

    ○ Redirects

      (•) Follow internal redirects

      Maximum redirect chain length: 25

    ○ Security

      [√] Use SSL 3.0 (not recommended)
      [√] Use TLS 1.0 (not recommended)
      [√] Use TLS 1.1
      [√] Use TLS 1.2

      -----

      [ ] Ignore certificate errors (not recommended)

    ○ URL Transforms

      Expression         |   Replacement   |   URL Expression
      ----------         |   -----------   |   --------------
      [√] -400\.eot\?$   |   -400.eot      |
      [√] -900\.eot\?$   |   -900.eot      |

      ◊ Transform Properties

        Expression: -400\.eot\?$
        Replacement: -400.eot
        URL Expression:

        ↑ for:
          ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot?
          ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot?

        -----

        Expression: -900\.eot\?$
        Replacement: -900.eot
        URL Expression:

        ↑ for: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot?

  ■ Deprecated

    ○ Default Documents

      ...

    ○ Domain Aliases

      ...

    ○ Proxy

      ...

    ○ Query Strings

      [ ] Strip query string segments


  There are 6 links in the project, listed below:

  URL  1-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot
    File 1-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-brands-400.eot

    URL  1-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot?
    File 1-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-brands-400-1.eot

  URL  2-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot
    File 2-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-regular-400.eot

    URL  2-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot?
    File 2-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-regular-400-1.eot

  URL  3-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot
    File 3-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-solid-900.eot

    URL  3-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot?
    File 3-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-solid-900-1.eot

  I expect with 'URL Transforms: ■ Expression: \.eot\?$ , ■ Replacement: .eot',
  URLs 1-2, 2-2, 3-2  are considerd as  URLs 1-1, 2-1, 3-1, respectively

  So logically,

  1) I expect  Files 1-1, 2-1, 3-1  (*.eot)
     are not re-downloaded as  Files 1-2, 2-2, 3-2  (*-1.eot), respectively

  2) and that,  URLs 1-2, 2-2, 3-2  are translated offline
     and point to  Files 1-1, 2-1, 3-1  (*.eot), respectively

  But, It does NOT occur ‼ (i.e., 'URL Transforms' has no effect)

Paniz


By looking into the source, I found the following lines in a CSS file,

  ↓ https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/css/fontawesome-all.min.css

    src:url(../webfonts/fa-brands-400.eot);
    src:url(../webfonts/fa-brands-400.eot?#iefix) ...
    src:url(../webfonts/fa-regular-400.eot);
    src:url(../webfonts/fa-regular-400.eot?#iefix) ...
    src:url(../webfonts/fa-solid-900.eot);
    src:url(../webfonts/fa-solid-900.eot?#iefix) ...

So, I think the correct 'URL Transforms' is
' Expression: \.eot\?.* , Replacement: .eot'
not
' Expression: \.eot\?$  , Replacement: .eot'

Anyway, 'URL Transforms' still doesn't work

Maybe 'URL Transforms' only works for  "absolute" URLs  in HyperTexts ?


Suppose a Website is set up to run over both  HTTP  and  HTTPS
  and also the  non-WWW links  do NOT redirect to the  WWW links  (or vice-versa)

  http[s]://[www.]DownSite.com/index.html

  If all the links in HyperTexts are relative, that won't be a problem

  (e.g. < a href="[/]dir/faq.html" ...
        < img src="[/]images/banner.gif" ... )

  but just one absolute link with  different protocol  or  different prefix (WWW or nothing)
  can cause a Website to be downloaded twice

  (e.g. < a href="http[s]://[www.]DownSite.com/dir/faq.html" ...
        < img src="http[s]://[www.]DownSite.com/images/banner.gif" ... )

  So, in theory, a Website may be downloaded up to 4 times per session

  e.g.
  URL1: ‹https://www.DownSite.com/dir/faq.html File1: ‹RootWC\dir\faq.html
  URL2: ‹https://DownSite.com/dir/faq.html›      File2: ‹RootWC\dir\faq-1.html
  URL3: ‹http://www.DownSite.com/dir/faq.html›  File3: ‹RootWC\dir\faq-2.html
  URL4: ‹http://DownSite.com/dir/faq.html›      File4: ‹RootWC\dir\faq-3.html

  Ticking the option ‹ URL Normalization: [√] Ensure internal links match domain prefix
  helps URL2 is interpreted as URL1 and URL4 as URL3

  but to make WebCopy take URL1 & URL3 the same,
  I have no option but to ‹ Force HTTPS ›  ( i.e., URL3 as URL1, not vice-versa )

  ‹ Force HTTPS: (•) Always   or   (•) Only for the following hosts: www.DownSite.com

  What if a Website has a bad SSL Certificate and HTTP is preferable ?

  What if  Mixed Content ?
  ( i.e., the Website is being loaded over  HTTPS,
         but some of its resources are being loaded over  HTTP )

  In these cases, forcing  HTTPS  to avoid duplication
  can result in the loss of some URLs

  i.e., some URLs cannot be retrieved due to forcing the HTTPS protocol


With MetaProducts Offline Explorer ‹OE› Enterprise: 'URL Substitutes',
  which is  case-insensitive
  based on the Website specifications, the user can have a wide variety of Choices


  Apply to | URL | Replace     | With
  -------- | --- | -------     | ----

  [URLs]   |  *  | https://    | http://  Force HTTP  for  all hosts
  or
  [URLs]   |  *  | http://     | https:// Force HTTPS for  all hosts

  [URLs]   |  *  | ://www.     | ://      Remove WWW prefix from  all hosts
  or
  [URLs]   |  *  | ://         | ://www.
  [URLs]   |  *  | ://www.www. | ://www.  Add WWW prefix to  all hosts


  Apply to | URL                 | Replace      | With
  -------- | ---                 | -------      | ----

  Force HTTP , Remove WWW for  [www.]DownSite.com
  [URLs]   | ://DownSite.com     | https://     | http://
  [URLs]   | ://www.DownSite.com | http*://www. | http://

  Force HTTPS, Remove WWW for  [www.]DownSite.com
  [URLs]   | ://DownSite.com     | http://      | https://
  [URLs]   | ://www.DownSite.com | http*://www. | https://

  Force HTTP , Add WWW for  [www.]DownSite.com
  [URLs]   | ://DownSite.com     | http*://     | http://www.
  [URLs]   | ://www.DownSite.com | https://     | http://

  Force HTTPS, Add WWW for  [www.]DownSite.com
  [URLs]   | ://DownSite.com     | http*://     | https://www.
  [URLs]   | ://www.DownSite.com | http://      | https://


  By changing 'Apply to: 1) URLs' to 'Apply to: 2) Filenames',
  a matched URL is "virtually" converted to another URL

  If the converted URL has already been downloaded, nothing will be downloaded
  and the matched URL will be interpreted offline to point to the downloaded file

  If not, the matched URL will be downloaded as it is,
  but it is treated as if it was the converted URL
  i.e., the offline directory structure and filename for the downloaded file
  will be based on the converted URL, not the matched URL


  Or as is written in its manual,

  « Note: Apply to filenames means using the URL substitution rule only for downloaded files.
  I.e., URLs will be downloaded as they are,
  but offline filenames and links will be changed according to the substitution rule »



  So, the following lines mean,

  « Download as it is, if the converted URL has not already been downloaded
  but after download or not, consider it as ... »



  Apply to    | URL | Replace     | With
  --------    | --- | -------     | ----

  ... consider it as     http://etc
  [Filenames] |  *  | https://    | http://
  or
  ... consider it as     https://etc
  [Filenames] |  *  | http://     | https://

  ... consider it as     http[‎s]://etc
  [Filenames] |  *  | ://www.     | ://
  or
  ... consider it as     http[‎s]://www.etc
  [Filenames] |  *  | ://         | ://www.
  [Filenames] |  *  | ://www.www. | ://www.


  Apply to    | URL                 | Replace      | With
  --------    | ---                 | -------      | ----

  ... consider it as     http://DownSite.com/etc
  [Filenames] | ://DownSite.com     | https://     | http://
  [Filenames] | ://www.DownSite.com | http*://www. | http://

  ... consider it as     https://DownSite.com/etc
  [Filenames] | ://DownSite.com     | http://      | https://
  [Filenames] | ://www.DownSite.com | http*://www. | https://

  ... consider it as     http://www.DownSite.com/etc
  [Filenames] | ://DownSite.com     | http*://     | http://www.
  [Filenames] | ://www.DownSite.com | https://     | http://

  ... consider it as     https://www.DownSite.com/etc
  [Filenames] | ://DownSite.com     | http*://     | https://www.
  [Filenames] | ://www.DownSite.com | http://      | https://


By UnTicking the option ‹ URL Normalization: [ ] Ensure internal links match domain prefix
  Ticking the option ‹ URL Normalization: [√] Ignore case (not recommended)
  and setting ‹ Force HTTPS: (•) Never
  I expect similar functionality of OE: 'URL Substitutes: Apply to: 1) URLs'
  from WebCopy: 'URL Transforms', but it doesn't occur


  To be more precise, I guess 'URL Transforms' only works for
  absolute URLs  in HyperTexts and ignores  relative URLs



  Expression      | Replacement
  ----------      | -----------

  [√] https://    | http://    Force HTTP  for  all hosts  [?]
  or
  [√] http://     | https://    Force HTTPS for  all hosts  [?]

  [√] ://www\.    | ://        Remove WWW prefix from  all hosts  [?]
  or
  [√] ://(www\.)? | ://www.    Add WWW prefix to  all hosts  [?]


  Also, I find out that the 'URL Expression' field
  is a condition for parent URL, not the URL itself
  So, the setting below is incorrect


  Expression      | Replacement | URL Expression
  ----------      | ----------- | --------------

  Force HTTP  for  [www.]DownSite.com  [×]
  [√] https://    | http://     | https?://(www\.)?DownSite\.com
  or
  Force HTTPS for  [www.]DownSite.com  [×]
  [√] http://     | https://    | https?://(www\.)?DownSite\.com

  Remove WWW prefix from  [www.]DownSite.com  [×]
  [√] ://www\.    | ://         | https?://(www\.)?DownSite\.com
  or
  Add WWW prefix to  [www.]DownSite.com  [×]
  [√] ://(www\.)? | ://www.     | https?://(www\.)?DownSite\.com


  Instead, I must use the setting below,
  and that the 'URL Expression' field is useless in this case
  but adds extra functionality to WebCopy that OE lacks


  Expression                         | Replacement
  ----------                         | -----------

  Force HTTP , Remove WWW for  [www.]DownSite.com  [?]
  [√] https?://(www\.)?DownSite\.com | http://DownSite.com

  Force HTTPS, Remove WWW for  [www.]DownSite.com  [?]
  [√] https?://(www\.)?DownSite\.com | https://DownSite.com

  Force HTTP , Add WWW for  [www.]DownSite.com  [?]
  [√] https?://(www\.)?DownSite\.com | http://www.DownSite.com

  Force HTTPS, Add WWW for  [www.]DownSite.com  [?]
  [√] https?://(www\.)?DownSite\.com | https://www.DownSite.com