Main Menu

Problem with URL rewriting

Started by matthew.zimmerman, April 12, 2021, 06:29:19 PM

Previous topic - Next topic

matthew.zimmerman

Hi,

First of all, thanks for working on this great tool!

I'm trying to use WebCopy to generate a static copy of a Bugzilla site. My goal is to "archive" the information that I have hosted in bugzilla and then shut down the Bugzilla server. I'd like to crawl a Bugzilla buglist (that I have previously generated from a saved search query) and download each of the listed bug pages. The URL of each bug page is of the format `PATH/show_bug.cgi?id=XXX` where `XXX` is a numeric id. However, when each bug page is downloaded, their file names are of the format `show_bug.cgi-N.html` where `N` is an incrementing integer. This means that the "data" of each bug is downloaded but the URL from the downloaded buglist does not link correctly to the downloaded bug page.

My understanding of WebCopy is if I have "remap references within downloaded files" checked, then the URLs in downloaded files will be rewritten so that they work internally. However, this does not appear to be happening.

An example of this problem can be reproduced from this public bugzilla server:

  • Crawl this URL as the root of the WebCopy: https://bugzilla.mozilla.org/buglist.cgi?product=Core&component=DMD&resolution=---
  • The URL for the first bug on the bugzilla page and the downloaded copy is: show_bug.cgi?id=1082934
  • The downloaded file name for the first bug is: show_bug.cgi.html
  • The downloaded file from the root URL is buglist.cgi but it does not appear to have rewritten URLs for its links to each bug page

I'm using `Version 1.8.3.768  (64bit)`.

Do you have any suggestions as to what I'm doing wrong?

Thanks in advance.

matthew.zimmerman

Here's a .cwp that demonstrates the problem with the example bugzilla site.

Richard Moss

Hello,

Thanks for the information, and more importantly the details to reproduce. I believe I have reproduced the issue from this and will next do some debugging to try and find a cause and fix.

Regards;
Richard Moss

matthew.zimmerman

Hi Richard,

Thanks for looking into this. Please let me know if there is anything I can do to assist.

-Matt

matthew.zimmerman

Hi Richard,

Anything else I can do to assist in the debugging or troubleshooting process?

-Matt

matthew.zimmerman

Hi Richard,

Anything else I can do to assist in the debugging or troubleshooting process?

-Matt

matthew.zimmerman

Thanks for your work on the latest release. Unfortunately, I can confirm that this issue is not resolved in version 1.9.0.822

The_yoyo

Bump this.
 I have a very similair issue:
The website is fairly basic structure, but slightly complex URI's, like:
www.example.com/m/?Cat=Selection&V_Sub=X&Page=1&SortBy=Z
and
www.example.com/m/?Cat=Selection&V_Sub=X&Page=2&SortBy=Z

I limited the crawl length from root so as to limit the page range (no sure if it worked).
Anyway, the files I get are named:
Index.htm
Index-1.htm
Index-2.htm
...
Index-45678.htm

While the HREF inside the htm's refer to the URI's without the index, like:
C/DownloadedWebsites/example.com/m/?Cat=Selection&V_Sub=X&Page=2&SortBy=Z
This means the internal links don't work (404's).

Any thoughts?