Download dynamic pages without extension

Started by rasved, August 21, 2020, 01:43:29 PM

Previous topic - Next topic

rasved

I'm trying to make a local copy of Atlassian confluence/jira sites (only the pages I can access with my login - I am not interested in information I'm not allowed to see).

The problem is: It seems as WebCopy only download pages with an extension.
Any idea how I can make WebCopy download pages without extension?

Example: How can I download https://community.atlassian.com/ where the subpages (eg https://community.atlassian.com/t5/Confluence-questions/Cannot-see-the-edit-button/qaq-p/1463638) don't have an extension.

Thanks in advance,

Rasmus

Richard Moss

Hello,

Apologies for the delay in responding, but congratulations on the number of different ways you have tried in an attempt to get a response. Unfortunately, I am not in a position to be able to provide prompt support so it usually takes me a while to get around to forum posts and helpdesk tickets.

WebCopy doesn't care if a URL has an extension or not, it is, for the most part, meaningless. All WebCopy cares about is if given a URL, can it download and scan said URL.

As far as I can see, WebCopy is downloading the main index and sub pages for Confluence.

However, what WebCopy cannot do is download Jira sites. This is because Jira uses JavaScript to build the page, and WebCopy doesn't execute JavaScript.

I recently added an experimental mode where WebCopy uses a headless browser in conjunction with its own crawling to try and work with JavaScript applications such as Jira. However, it is a) experimental and b) currently uses Internet Explorer which many modern websites rightfully don't support. But I don't have the time to extend it at this point to support Chromium and Gecko instead.

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

rasved

;-)

Thank you for the compliment ;-)
(I get the tone - I'm just really struggling to get the site downloaded)

Can I ask you for a configuration file for a confluence-site? I have tried the headless "web browser"-option, but don't get a significant other result - probably due to the reasons you mention.

Glad WebCopy don't care about the extension - I agree, that wouldn't make much sense.
Do you have any idea, why I just get an atlassian-logo (see the attached file), and none of the pages which are on the pages below?

If I manage to solve this issue, I have a support-pool, I'll gladly donate to your project.



Richard Moss

Hello,

I don't have access to any Confluence sites as far as I know, I just started scanning the community link you posted. However, after leaving the scan for a few minutes it had looked at dozens of URL's and had several thousand in a queue which would indicate it was working.

The overview file you posted doesn't have much in the way of HTML which implies that JavaScript is used to build it in the browser which is why WebCopy isn't having any luck with it. If the experimental headless mode doesn't work then there's probably not a lot that the current versions of WebCopy can do.

I did a quick test on a public facing Jira instance I'm familiar with - scan only, externals disabled, and with a rule to exclude anything with "login" in the URL. As expected, this only detected a few dozen URLs, mostly images, and completed after a short time.

On checking the root URL for this instance, I confirmed that it doesn't have many links in the actual HTML, most of it is constructed on the client via JS.

Next I enabled the experimental browser mode and tried again, and waited, and waited (using this mode is slow!). However, I could see that it was working - the queue quickly filled up with around a thousand or so items and I could see it crawling issue lists that it previously couldn't. After 5 mins I cancelled the scan.

So, this implies that the experimental mode should work.

If you try this in your confluence site, but once it has finished, take a look at the Skipped tab and see if there's things in there that you wanted downloaded. For example, one common cause of confusion is users downloading something like example.com/level1/level2 which links to example.com/levela and wondering why levela is missing. (It is missing because it is "above the root" which by default WebCopy assumes you don't want as you deliberately started with a deep link).

Hope something in this ramble helps!

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.