Main Menu

Recent posts

#1
WebCopy / Re: Limit distance from root U...
Last post by david72 - Yesterday at 03:05:36 PM
Hey

I found only scan folder is limited to 1000 max page only.

Is there any other way  ?

Also i don't understand Max Depth and Limit Crawl depth are they same ?

Thanks
#2
WebCopy / Re: Want to clone a website wh...
Last post by Richard Moss - March 19, 2023, 01:10:41 PM
Hello,

Belated Happy New Year! And 2023 is shaping up nicely so far, hence even replying to support tickets and forums post, months late as they mostly are.

Simply put, you probably can't. Captcha's require JavaScript, WebCopy doesn't execute it.

Regards;
Richard Moss
#3
WebCopy / Re: Only http and https URI sc...
Last post by Richard Moss - March 19, 2023, 01:08:28 PM
Necro reply, sorry.

This was logged as issue #441 and is fixed in WebCopy 1.9.1.

When using that positional argument that can either be a URI or a file, files can be relative or absolute, but URIs must be fully qualified.

Regards;
Richard Moss
#4
WebCopy / Re: Limit distance from root U...
Last post by Richard Moss - March 19, 2023, 09:37:48 AM
Quote from: david72 on March 19, 2023, 08:04:24 AMI am trying to only scan the folders without saving html files

Anyone have idea how to do the same ?
Due to huge files and deep folders , its difficult store index files

Thanks
Dav


I haven't quite worked out if this is an AI generated spam post, but decided to err on the side of caution this time instead of deleting account and post as I usually do. I did edit the post the remove the link though.

Open the Project menu, click Scan Website
#5
WebCopy / Re: Limit distance from root U...
Last post by Richard Moss - March 19, 2023, 09:35:02 AM
Quote from: hajzlik on December 09, 2021, 11:02:09 AMThere should be an option to set different limits for HTML and for non-HTML files.

Let's say you have a sitemap. You want to download all the linked pages, but not any further links.

You can limit the distance from root URL to 1, but then you end up with HTML pages without images and other content.

You can limit the distance from root URL to 2, and the other content will download, but you will end up with a bunch of unwanted HTML files (but without any images they contain, which makes them even more useless).

Hello,

Belated thanks for the feedback. This makes a lot of sense! I don't think it makes sense to add another option (or maybe add it but not expose it for now, or hide it away somewhere), but I do think it makes sense to completely ignore distance for non-HTML. I've logged that as issue #464.

Thanks again!

Regards;
Richard Moss
#6
WebCopy / Re: Limit distance from root U...
Last post by david72 - March 19, 2023, 08:04:24 AM
I am trying to only scan the folders without saving html files

Anyone have idea how to do the same ?
Due to huge files and deep folders , its difficult store index files

Thanks
Dav
#7
WebCopy / Re: Could not create SSL/TLS s...
Last post by Jimbo - March 14, 2023, 09:44:34 AM
I have just encountered this error message using an existing Webcopy project which was working fine back on Feb 25.

From what Richard says, does:

Quote... it doesn't yet support TLS 1.3

... sort of imply that the TLS version on the target site has been changed in the interim?



I'm no expert at all in this sort of thing. I Googled how to check a site's TLS level and one site that actually does work without generating that message is showing v 1.3 (see attachment).

sec.jpg


The site that generates the error message has this:

gg sec.jpg

They look the same to me. I am puzzled.
#8
WebCopy / Keeps logging out during crawl
Last post by ThomasS - March 13, 2023, 07:25:34 PM
It has been reported a year ago (link to thread), but for me it still doesn't work: I try to crawl a phpbb site that contains many logoff links. Although I excluded these through a rule (which apparently works), it seems like WebCopy calls the excluded URLs anyway before deciding to exclude them.

WebCopyProb.png

When I look at the local pages, "memberlist" (and all the other pages above) still shows that I'm logged in, while "search" shows that I was logged out - although the "mode=logout" link was excluded.

Is there anything I can do to stay logged in? I tried 1.9.0 and 1.9.1 (nightly), but to no avail: both logged off during crawling.

Thank you for you support
Thomas
#9
WebCopy / Re: Could not create SSL/TLS s...
Last post by BlueSky - March 01, 2023, 03:40:16 AM
Richard,

Thanks for the in-depth dissection.   I did not notice that first url, the target website contains a trove of information on undocumented Microsoft protocols.  I am not a certficate, crptographic expert at all.  I just read that TLS is a subset of https.   I did try httrack, it works, except it doesn't replicate the html layout of the tree menu adjacent to the menu content, which is ok.
#10
WebCopy / Re: Could not create SSL/TLS s...
Last post by Richard Moss - February 28, 2023, 07:36:03 AM
Hello,

Thanks for the report. WebCopy already supports TLS 1.2 but it doesn't yet support TLS 1.3 - that is planned for the next-but-one version (first a point release to resolve some bugs, then a major release to update to .NET 4.8 and add TLS 1.3 support). Trying to work around it with context switches probably won't work because WebCopy allows you to choose which protocols you want to support, so it always sets a value.

Some other points - firstly, the only URL from that scan which had the communication issue was a third party site linked by the main site.

2023-02-28 06_44_26-www.geoffchappell.com.cwp - Cyotek WebCopy.png

Secondly, I did a test where I bumped to support TLS 1.3 and it still didn't work. Instead of "The request was aborted: Could not create SSL/TLS secure channel." I got "The client and server cannot communicate, because they do not possess a common algorithm".

Interestingly, I tried using cURL and it couldn't do it either: "curl: (35) schannel: next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE (0x80090326) - This error usually occurs when a fatal SSL/TLS alert is received (e.g. handshake failed). More detail may be available in the Windows System event log.". This was logged as an OS level error, so I wonder if there's some ciphers disabled by default. Tested in Windows 10 Pro 22H2, so not that far out of date. More interestingly, Firefox is quite happy to load the page.

2023-02-28 07_27_30-C__WINDOWS_system32_cmd.exe.png

2023-02-28 07_15_55-Event Viewer.png

With a URL of `hacke.rs` it's almost inevitable they'll be doing something clever. WebCopy's crawl engine focuses on what to do with the content it downloads, it leaves encryption and networking to the framework, which works many nines most of the time so I don't have much I can offer at this point.

I need to do some more digging, but there's not much more to be done until TLS 1.3 support is added which won't be for a little while yet. Meanwhile I need to rethink why Quick Scan feels the need to display a blocking error message for a secondary site.

Regards;
Richard Moss