Author Topic: Critical speed decrease for big web sites (pull request)  (Read 112 times)

Offline unlimited

  • Newbie
  • *
  • Posts: 1
  • Karma: +0/-0
Critical speed decrease for big web sites (pull request)
« on: April 20, 2021, 11:16:22 AM »
Hi,

I'm crawling web site with about 200K+ pages to be downloaded and another 3M+ pages ignored by rules.

Web server generates all those pages for 0.1-0.3 second each (this is my web site, I can see the stats).

Initially WebCopy could process at least 1 page per second.

After 12 hours of crawling the speed of the software decresed significantly.
Now it shows statistics in the bottom of the window: 309000 / 12 hours / 985 MiB
It takes up to 20 seconds to process one page now.

It seems WebCopy is using some non-effective data structure inside.
I'm sure all functionality of WebCopy related to processing web page could be done by O(1) (no dependency on the previous page count).
Speed decrease shows like you have dependency near to O(log N).

Maybe you are using binary tree instead of hashmap?
Maybe you are using priority queue instead of using 3 non-priority queues?

If I had sources available, I could try to find the reason and prepare pull request to fix this problem.


P.S. Downloading and parsing pages in parallel using few HTTP connections could be a great performance feature too. But I saw you are already working on this feature.
« Last Edit: April 21, 2021, 06:53:21 PM by unlimited »

Offline Richard Moss

  • Cyotek Team
  • Administrator
  • Sr. Member
  • *****
  • Posts: 450
  • Karma: +25/-0
    • cyotek.com
Re: Critical speed decrease for big web sites (pull request)
« Reply #1 on: April 26, 2021, 04:32:10 AM »
Hello,

Thanks for the information. While I have mulled over releasing the source code to the crawler component in the past, it isn't something that will be happening at the moment. I've logged issue #399 to look into this further. My initial assumption given your description is the slowdown is querying the link map but I won't know until I do some profiling once I have set up a somewhat larger dataset than the one usually used for testing.

Regards;
Richard Moss

Offline Richard Moss

  • Cyotek Team
  • Administrator
  • Sr. Member
  • *****
  • Posts: 450
  • Karma: +25/-0
    • cyotek.com
Re: Critical speed decrease for big web sites (pull request)
« Reply #2 on: May 03, 2021, 08:28:27 AM »
Hello,

Profiling a test website with 1 million pages via the CLI didn't show any major issues and speed was consistent throughout. While I found nothing major I did tweak some functionality to improve performance slightly based on this.

That meant my original assumption was wrong (which I'm actually happy about!), but while the CLI profile was running I started looking through the UI code and found a facepalm inducing possible issue. When I virtualised the reporting lists I wrote the lookup code in a very inefficient way and profiling this with just 10 thousand pages indicated this was a major issue.

Reworking that bad code yielded around a 30% performance increase on the 10 thousand pages test. As that code no longer operates as o(n) performance should be static rather than getting worse and worse over time as you reported.

The original virtual list implementation was for fixed data sets and wasn't designed to work with one that was continuously extended, so there's still more improvements to be made (although they won't match the scale of this first fix).

A build with at least the lookup fix in place should be available by the end of the day (build 789 or above) but I will also do some more work in the other improvements I noted I could make.

Lesson learned, profile the UI more often (I usually profile just the CLI as it is simpler to automate, but I've now extended the GUI to support the same parameters as the CLI so this can be automated as well).

Many thanks for finding this issue and helping to improve WebCopy for other users.

Regards;
Richard Moss