Absolute Paths Not Being Rewritten

Started by shikage, March 08, 2019, 12:34:36 AM

Previous topic - Next topic

shikage

Greetings, I've done a cursory search of the site but nothing recent seems to cover the issue I am encountering. I am attempting to save a site for offline viewing but most links in this site use absolute paths for the hrefs such as /mypage/childpage1 and these are not being rewritten when saving the site so what should be a url to file:///C:/savedsites/mysite/mypage/childpage1 is just a link to file:///mypage/childpage1 which does not work. Is there a way to address this within the settings for the project which I'm missing?

Richard Moss

Hello,

WebCopy should process these links correctly and convert them into the appropriate relative path for the offline copy. Are you able to share the address of a page that is affected by the issue so that I can run some tests? (If you don't want to share publicly you can send a message or an email)

Thanks;
Richard Moss

shikage

#2
It's an LMS site, I have enrolled in a number of their courses and want to save these web based courses offline to read through them when on the bus and traveling more easily. You can find them at https://courses.nihongoshark.com/

Unfortunately, when pulling the content without logging in the links all work and get rewritten, at least for the limited content it has access to and most of the pages just request that you login to actually view the content. When I use the steps to login using the web browser I am able to authenticate and it is able to pull down all of the pages but the tiles that link to courses have the incorrect URLs and if I open a course directly none of the side nav links work.

EDIT: I tried copying again and while the main page isn't working if I go to one of the courses I can navigate the course and read the content which is really my primary goal. So I think I am fine.

RichardDavies

There's definitely a problem with it not rewriting absolute urls. I've experienced this issue on two different web sites. For example, http://www.csszengarden.com/

Richard Moss

Hello both,

Thanks for the follow up. I'm just starting work on 1.8 now and I'll definitely run some tests on CSS Zen Garden (wow, can't beleive that is still around, I remember buying the book a long time ago). It should still remap absolute URL's, but if there's a bug and I can find it I'll get it fixed as soon as.

Regards;
Richard Moss

Richard Moss

Hello,

I just tested absolute paths and can confirm they aren't being remapped. I should be used to writing stupid bugs by now but even so this is pretty silly  :-[. I'll probably push a new 1.7 release in the next day or two with a fix for this, rather than waiting for 1.8.

Regards;
Richard Moss

Richard Moss

Hello,

In contradiction to my previous post, I haven't been able to reproduce any issues with absolute paths at all. When I ran a test yesterday, I misread the results (the demo sites includes the URL's as title attributes as well, naturally WebCopy doesn't look at title attributes so when I opened the file and saw an absolute path I immediately assumed the worst instead of spending half a second more to look harder).

In addition, today I downloaded the CSS Zengarden website. Out of the 6000+ HTML files generated by this, only one has an absolute reference left behind - this is for the RSS link on the main page, although I wasn't able to reproduce this with a smaller scan (the full scan took long enough that I'm not willing to do it again!). Definitely something to investigate but I'm unable to categorically reproduce the problem in order to fix it.

The only concrete method I know of causing this is by cancelling a crawl - if a crawl is cancelled (either manually or automatically), the stage to remap local files will be skipped. However, this means all links aren't remapped, not just absolute ones.

If you're able to specify an exact set of circumstances where absolute links aren't being remapped that would be of great help.

Regards;
Richard Moss

zephyrus00jp

Hi,

I am new and am not familiar with the options of the tool very well.

But could it be that the original poster enabled the option to capture the subdomain (or was it the siblings) of the named site?

When I captured a site, the absolute path was  correctly rewritten.
However it was not the case when subdomain/sibling (I can't recall exactly) capture was enabled.

Come to think of it, in such multiple domain cases, it is not entirely clear to which root the relative path was to be created, and so I presume
rewriting is very difficult, thus I suppose the absolute path is used automatically.

I was not sure if the target web site uses subdomains or siblings initially so I enabled the option and then
the path became absolute and could not be used as the original author mentioned.
When I disabled the option (meaning that a copy of single site was atttempted), the path was rewritten as relative path and all was well.

I think I have to learn the options so that I can copy web site effectively.

TIA

Jelle_S

I'm experiencing the same issue.

When, for examle you're copying a website (https://example.com), and on that page there is an a tag: <a href="/page">page</a>, that will link to https://example.com/page

When you're browsing the copy of that website it becomes file:///C:/Downloads/example.com for the homepage, but when clicking that same link, it'll redirect you to file:///C:/page and not file:///C:/Downloads/example.com/page as expected


Richard Moss

Hello,

This isn't something I can reproduce and I need a concrete example. The demonstration website used for integration testing is littered with absolute paths and these are remapping fine.

For example, this html from https://demo.cyotek.com/html/elements/a.php:

  <ul>
    <li><a href="/">Home Page</a></li>
    <li><a href="/statuscodes/index.php">HTTP Status Codes</a></li>
  </ul>


comes out as

  <ul>
    <li><a href="..\..\index.htm">Home Page</a></li>
    <li><a href="..\..\statuscodes\index.php.html">HTTP Status Codes</a></li>
  </ul>


And this is correct. If someone is able to provide the URL of a publicly accessible page that will instantly trigger this issue it would be very helpful - if there is a bug in this area, it needs to be fixed.

Regards;
Richard Moss

sullivan1337

Not sure if this is the same issue, but remapping doesn't seem to be working for this publicly accessible site?

https://zilvia.net/f/archive/index.php

brewmaxwell649

#11
I think I have to learn the options so that I can copy web site effectively.
Mobdro Vip