Print Page - Collecting links from forums, no downloads

Title: Collecting links from forums, no downloads
Post by: LegoDruid on September 05, 2020, 02:57:23 PM

Hello!

I would like to collect links from forum pages. Read: use your product for something it wasn't designed for. ;+)

It seems to be able to do this. Can you help me to refine the method?

Requirements:

- Start from a forum page
- Collect links for articles / posts on that page
- Walk the page controls to see all pages in forum
- Stay within that forum; don't wander into another forum on the same website

Example, the Cyotek WebCopy forum:

https://forums.cyotek.com/cyotek-webcopy/

There are 9 pages within the WebCopy forum topic, with 20 posts / links on each page.

WebCopy will copy all of the individual forum post links into the RESULTS tab, which I can export to CSV. That will give me the start page (page 1), and pages 2, 3, and 9 (links visible in the page controls on page 1).

I think that I have to create a rule to crawl the page controls, which should work through all of the forum pages (ie: 1 through 9, inclusive).

No downloads. To keep it fast, WebCopy shouldn't crawl through each post.

What would the project settings and rules look like?

Title: Re: Collecting links from forums, no downloads
Post by: Richard Moss on September 10, 2020, 07:27:28 AM

Hello,

It should be able to do this, although forum scanning can be tricky - there's so many links in this "edit, reply" etc that you would generally want to exclude from downloads so often you would need to spend a lot of time creating rules. You would still probably want to do this even in this case as you might not be making a permanent copy, but WebCopy still needs to scan the HTML to discover where else to search. As you are looking for links, that implies ones embedded in content so the usual edit, reply, new topic etc stuff should probably be excluded.

If you have a look at the tutorial documentation on how to copy only images (https://docs.cyotek.com/cyowcopy/current/examplecopyimages.html) this describes how to create a rule that will scan the HTML, but not preserve it.

You could use this rule to scan HTML content but not keep it. You'd then need to another other rules to prevent the downloading of everything else.

However, there might be a simpler way - WebCopy includes two modes; Scan and Copy.

When you Scan a website, WebCopy will essentially do what I described above automatically - it will look at all HTML, but not keep a copy of anything. It will automatically ignore any non-HTML content.

You would still need to add rules to exclude bits of the forums you don't care about, and possibly tinker some of the options on the General page of the Project Properties (such as Download All Resources as if this is set WebCopy will make requests of external URLs to determine their content type, although this shouldn't be necessary for just a scan).

Once you have completed your scan you have two options:

1. You can export (https://docs.cyotek.com/cyowcopy/current/csvexport.html) the entire links list to CSV and then do whatever filtering you need.

2. There is a built in report which will only list external URLs which you can then export.

Unfortunately however, this will be a manual process using the GUI - the CLI is currently sorely lacking in features and so it isn't possible to automate.

Hope this helps.

Regards;
Richard Moss

Cyotek Forums

Products => WebCopy => Topic started by: LegoDruid on September 05, 2020, 02:57:23 PM