Hi,
I'm new to WebCopy, but I have experiences with other Web-crawlers
Surprisingly, I found WebCopy almost as light as Teleport and almost as powerful as OE
Two contrasting advantages, in one package, plus some unique features
■ The Late 'Tennyson Maxwell' Teleport Pro/Ultra/VLX: ‹http://www.tenmax.com/›
■ 'MetaProducts' Offline Explorer ‹OE› Standard/Pro/Enterprise: ‹http://www.metaproducts.com/›
───< 'Maybe Bug, maybe not' Report: >───
♦ Description:
'URL Transforms' work fine during Test,
but are ignored during Link Processing when downloading Files from the Internet
I crawled a site's content by WebCopy, as per the following settings
↓ 'Cyotek WebCopy 1.9.1.872': Customize project properties
■ General
◊ Website
URL: https://www.bulkrenameutility.co.uk/
◊ Crawl Mode
(•) Sibling domains
-----
[√] Download all resources
[ ] Limit crawl depth
[ ] Limit distance from root URL
○ Folder
◊ Folder Options
Save folder: C:\BR\
[ ] Create folder for domain
[ ] Flatten website folder
[ ] Empty website folder before copy
[ ] Use Recycle Bin
Directory character: [Windows (\)]
○ Local Files
◊ File Options
[√] Remap references within downloaded files
[ ] Update local time stamps
[ ] Use query string in local file names
◊ Remap file extensions by content type
(•) Only for HTML
[√] Keep original extension
○ Additional Hosts
...
○ Additional URLs
...
○ Content Types
(•) Include all
○ Limits
[ ] Maximum number of files: 0
-----
[ ] Minimum File Size: 0 KiB
[ ] Maximum File Size: 0 KiB
○ Forms
...
○ Passwords
◊ Password Options
[ ] Do not prompt for passwords
[ ] Log in using web browser
◊ Saved Passwords
...
○ Rules
Expression | Options
---------- | -------
[√] forum | Exclude
[√] FileKicker/BRU_setup_3\.4\.4\.0\.exe$ | Exclude
◊ Rule Properties
Compare: [Path] [Matches]
Expression: forum
Options: [Exclude]
◊ Advanced
[√] Enable this rule
[ ] Stop processing more rules
Download Priority [Normal]
↑ for ‹https://www.bulkrenameutility.co.uk/forum/›
-----
Compare: [Path] [Matches]
Expression: FileKicker/BRU_setup_3\.4\.4\.0\.exe$
Options: [Exclude]
◊ Advanced
[√] Enable this rule
[ ] Stop processing more rules
Download Priority [Normal]
↑ for ‹https://www.bulkrenameutility.co.uk/FileKicker/BRU_setup_3.4.4.0.exe›
○ Speed Limits
◊ Limits
(•) Limit to requests per second
Maximum Requests per Second: 1
-----
[ ] Enforce Limit Checks
○ URL Normalization
◊ URL Normalization
[ ] Honor canonical URI's {Disabled}
[√] Ensure internal links match domain prefix
[√] Ignore case (not recommended)
◊ Force HTTPS
(•) Never
○ User Agent
(•) Use custom user agent
Mozilla/5.0 (Windows NT 6.1; rv:86.0) Gecko/20100101 Firefox/89.0
○ Web Browser
[ ] Use web browser
○ Web Page Language
...
○ Sitemap
[√] Create site map
File name: index-wcopy.html
■ Advanced
◊ Crawl Behaviour
[√] Use header checking (recommended)
[√] Always download latest version
-----
[ ] Crawl above root URL (not recommended)
[√] Keep alive (recommended)
-----
Origin report: [Create a single report for the entire project]
[ ] Add to source HTML
◊ Abort HTTP Status Codes
Status Codes:
○ Accepted Content Types
...
○ Cookies
[ ] Discard session cookies
○ Custom Attributes
...
○ Custom Headers
...
○ HTTP Compression
[√] Compress (Legacy)
[√] Deflate
[√] GZip
[√] Brotli
[√] BZip2
○ Link Map
[√] Save link information in project
[√] Include headers
[ ] Clear link information before scan
○ Redirects
(•) Follow internal redirects
Maximum redirect chain length: 25
○ Security
[√] Use SSL 3.0 (not recommended)
[√] Use TLS 1.0 (not recommended)
[√] Use TLS 1.1
[√] Use TLS 1.2
-----
[ ] Ignore certificate errors (not recommended)
○ URL Transforms
Expression | Replacement | URL Expression
---------- | ----------- | --------------
[√] -400\.eot\?$ | -400.eot |
[√] -900\.eot\?$ | -900.eot |
◊ Transform Properties
Expression: -400\.eot\?$
Replacement: -400.eot
URL Expression:
↑ for:
‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot?›
‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot?›
-----
Expression: -900\.eot\?$
Replacement: -900.eot
URL Expression:
↑ for: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot?›
■ Deprecated
○ Default Documents
...
○ Domain Aliases
...
○ Proxy
...
○ Query Strings
[ ] Strip query string segments
There are 6 links in the project, listed below:
• URL 1-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot›
File 1-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-brands-400.eot›
URL 1-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-brands-400.eot?›
File 1-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-brands-400-1.eot›
• URL 2-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot›
File 2-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-regular-400.eot›
URL 2-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-regular-400.eot?›
File 2-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-regular-400-1.eot›
• URL 3-1: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot›
File 3-1: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-solid-900.eot›
URL 3-2: ‹https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/webfonts/fa-solid-900.eot?›
File 3-2: ‹C:\BR\assets\vendor\font-awesome\webfonts\fa-solid-900-1.eot›
I expect with 'URL Transforms: ■ Expression: \.eot\?$ , ■ Replacement: .eot',
URLs 1-2, 2-2, 3-2 are considerd as URLs 1-1, 2-1, 3-1, respectively
So logically,
1) I expect Files 1-1, 2-1, 3-1 (*.eot)
are not re-downloaded as Files 1-2, 2-2, 3-2 (*-1.eot), respectively
2) and that, URLs 1-2, 2-2, 3-2 are translated offline
and point to Files 1-1, 2-1, 3-1 (*.eot), respectively
But, It does NOT occur ‼ (i.e., 'URL Transforms' has no effect)
By looking into the source, I found the following lines in a CSS file,
↓ https://www.bulkrenameutility.co.uk/assets/vendor/font-awesome/css/fontawesome-all.min.css
src:url(../webfonts/fa-brands-400.eot);
src:url(../webfonts/fa-brands-400.eot?#iefix) ...
src:url(../webfonts/fa-regular-400.eot);
src:url(../webfonts/fa-regular-400.eot?#iefix) ...
src:url(../webfonts/fa-solid-900.eot);
src:url(../webfonts/fa-solid-900.eot?#iefix) ...
So, I think the correct 'URL Transforms' is
'■ Expression: \.eot\?.* , ■ Replacement: .eot'
not
'■ Expression: \.eot\?$ , ■ Replacement: .eot'
Anyway, 'URL Transforms' still doesn't work
Maybe 'URL Transforms' only works for "absolute" URLs in HyperTexts ?
○ Suppose a Website is set up to run over both HTTP and HTTPS
and also the non-WWW links do NOT redirect to the WWW links (or vice-versa)
http[s]://[www.]DownSite.com/index.html
If all the links in HyperTexts are relative, that won't be a problem
(e.g. < a href="[/]dir/faq.html" ...
< img src="[/]images/banner.gif" ... )
but just one absolute link with different protocol or different prefix (WWW or nothing)
can cause a Website to be downloaded twice
(e.g. < a href="http[s]://[www.]DownSite.com/dir/faq.html" ...
< img src="http[s]://[www.]DownSite.com/images/banner.gif" ... )
So, in theory, a Website may be downloaded up to 4 times per session
e.g.
• URL1: ‹https://www.DownSite.com/dir/faq.html› → File1: ‹RootWC\dir\faq.html›
• URL2: ‹https://DownSite.com/dir/faq.html› → File2: ‹RootWC\dir\faq-1.html›
• URL3: ‹http://www.DownSite.com/dir/faq.html› → File3: ‹RootWC\dir\faq-2.html›
• URL4: ‹http://DownSite.com/dir/faq.html› → File4: ‹RootWC\dir\faq-3.html›
Ticking the option ‹ URL Normalization: [√] Ensure internal links match domain prefix ›
helps URL2 is interpreted as URL1 and URL4 as URL3
but to make WebCopy take URL1 & URL3 the same,
I have no option but to ‹ Force HTTPS › ( i.e., URL3 as URL1, not vice-versa )
‹ Force HTTPS: (•) Always or (•) Only for the following hosts: www.DownSite.com ›
What if a Website has a bad SSL Certificate and HTTP is preferable ?
What if Mixed Content ?
( i.e., the Website is being loaded over HTTPS,
but some of its resources are being loaded over HTTP )
In these cases, forcing HTTPS to avoid duplication
can result in the loss of some URLs
i.e., some URLs cannot be retrieved due to forcing the HTTPS protocol
○ With MetaProducts Offline Explorer ‹OE› Enterprise: 'URL Substitutes',
which is case-insensitive
based on the Website specifications, the user can have a wide variety of Choices
Apply to | URL | Replace | With
-------- | --- | ------- | ----
[URLs] | * | https:// | http:// → Force HTTP for all hosts
or
[URLs] | * | http:// | https:// → Force HTTPS for all hosts
[URLs] | * | ://www. | :// → Remove WWW prefix from all hosts
or
[URLs] | * | :// | ://www.
[URLs] | * | ://www.www. | ://www. → Add WWW prefix to all hosts
Apply to | URL | Replace | With
-------- | --- | ------- | ----
↓ Force HTTP , Remove WWW for [www.]DownSite.com
[URLs] | ://DownSite.com | https:// | http://
[URLs] | ://www.DownSite.com | http*://www. | http://
↓ Force HTTPS, Remove WWW for [www.]DownSite.com
[URLs] | ://DownSite.com | http:// | https://
[URLs] | ://www.DownSite.com | http*://www. | https://
↓ Force HTTP , Add WWW for [www.]DownSite.com
[URLs] | ://DownSite.com | http*:// | http://www.
[URLs] | ://www.DownSite.com | https:// | http://
↓ Force HTTPS, Add WWW for [www.]DownSite.com
[URLs] | ://DownSite.com | http*:// | https://www.
[URLs] | ://www.DownSite.com | http:// | https://
By changing 'Apply to: 1) URLs' to 'Apply to: 2) Filenames',
a matched URL is "virtually" converted to another URL
If the converted URL has already been downloaded, nothing will be downloaded
and the matched URL will be interpreted offline to point to the downloaded file
If not, the matched URL will be downloaded as it is,
but it is treated as if it was the converted URL
i.e., the offline directory structure and filename for the downloaded file
will be based on the converted URL, not the matched URL
Or as is written in its manual,
« Note: Apply to filenames means using the URL substitution rule only for downloaded files.
I.e., URLs will be downloaded as they are,
but offline filenames and links will be changed according to the substitution rule »
So, the following lines mean,
« Download as it is, if the converted URL has not already been downloaded
but after download or not, consider it as ... »
Apply to | URL | Replace | With
-------- | --- | ------- | ----
↓ ... consider it as http://etc
[Filenames] | * | https:// | http://
or
↓ ... consider it as https://etc
[Filenames] | * | http:// | https://
↓ ... consider it as http[s]://etc
[Filenames] | * | ://www. | ://
or
↓ ... consider it as http[s]://www.etc
[Filenames] | * | :// | ://www.
[Filenames] | * | ://www.www. | ://www.
Apply to | URL | Replace | With
-------- | --- | ------- | ----
↓ ... consider it as http://DownSite.com/etc
[Filenames] | ://DownSite.com | https:// | http://
[Filenames] | ://www.DownSite.com | http*://www. | http://
↓ ... consider it as https://DownSite.com/etc
[Filenames] | ://DownSite.com | http:// | https://
[Filenames] | ://www.DownSite.com | http*://www. | https://
↓ ... consider it as http://www.DownSite.com/etc
[Filenames] | ://DownSite.com | http*:// | http://www.
[Filenames] | ://www.DownSite.com | https:// | http://
↓ ... consider it as https://www.DownSite.com/etc
[Filenames] | ://DownSite.com | http*:// | https://www.
[Filenames] | ://www.DownSite.com | http:// | https://
○ By UnTicking the option ‹ URL Normalization: [ ] Ensure internal links match domain prefix ›
Ticking the option ‹ URL Normalization: [√] Ignore case (not recommended) ›
and setting ‹ Force HTTPS: (•) Never ›
I expect similar functionality of OE: 'URL Substitutes: Apply to: 1) URLs'
from WebCopy: 'URL Transforms', but it doesn't occur
To be more precise, I guess 'URL Transforms' only works for
absolute URLs in HyperTexts and ignores relative URLs
Expression | Replacement
---------- | -----------
[√] https:// | http:// → Force HTTP for all hosts [?]
or
[√] http:// | https:// → Force HTTPS for all hosts [?]
[√] ://www\. | :// → Remove WWW prefix from all hosts [?]
or
[√] ://(www\.)? | ://www. → Add WWW prefix to all hosts [?]
Also, I find out that the 'URL Expression' field
is a condition for parent URL, not the URL itself
So, the setting below is incorrect
Expression | Replacement | URL Expression
---------- | ----------- | --------------
↓ Force HTTP for [www.]DownSite.com [×]
[√] https:// | http:// | https?://(www\.)?DownSite\.com
or
↓ Force HTTPS for [www.]DownSite.com [×]
[√] http:// | https:// | https?://(www\.)?DownSite\.com
↓ Remove WWW prefix from [www.]DownSite.com [×]
[√] ://www\. | :// | https?://(www\.)?DownSite\.com
or
↓ Add WWW prefix to [www.]DownSite.com [×]
[√] ://(www\.)? | ://www. | https?://(www\.)?DownSite\.com
Instead, I must use the setting below,
and that the 'URL Expression' field is useless in this case
but adds extra functionality to WebCopy that OE lacks
Expression | Replacement
---------- | -----------
↓ Force HTTP , Remove WWW for [www.]DownSite.com [?]
[√] https?://(www\.)?DownSite\.com | http://DownSite.com
↓ Force HTTPS, Remove WWW for [www.]DownSite.com [?]
[√] https?://(www\.)?DownSite\.com | https://DownSite.com
↓ Force HTTP , Add WWW for [www.]DownSite.com [?]
[√] https?://(www\.)?DownSite\.com | http://www.DownSite.com
↓ Force HTTPS, Add WWW for [www.]DownSite.com [?]
[√] https?://(www\.)?DownSite\.com | https://www.DownSite.com