Desperate for help - WebSphere Portal Site copy

Started by JohnnyJohn, January 13, 2016, 08:33:26 PM

Previous topic - Next topic

JohnnyJohn

Hello all,

I really need some help here..  No matter what i've tried to do, I can never succeed at copying a Websphere portal website.   the one i'm trying to do is 99.9% identical in its infrastructure to: https://www.simplemobile.com/wps/portal/home

Can someone please assist with this?  What settings are needed?  i've tried to webcopy and httrack.  I specified URL's, 8.3 url format/extension, limited crawling depth to 0 & 1, limited external crawling to 0 and many other options with no luck.

Can someone please assist???   :( :( :(

Thank you in advance

Richard Moss

Hello,

Can you provide some detail on what actually isn't working with WebCopy? Some of the options you mention aren't supported by WebCopy so I assume is the other program you mention - I can't really provide support for that :)

Regards;
Richard Moss

JohnnyJohn

Hi Richard,

Sure,

Using webcopy to skin/copy a regular html/wordpress/CMS site is a breeze yet when doing so with a Websphere Portal site, the URL's that are generated are usually dynamic since the site does not have clean URL's.  So a website with about 30 regular page url's might take 2 minutes to download, yet for portal, it would stay for a day or two trying to download a million variations of a URL.   I don't know if there's a way to stop it from doing so and haven't been successful at copying a site running on that platform.

You can try for yourself and see what i mean.  I've tried the copy website option on another site: https://www.straighttalk.com/wps/portal/home  which has cleaner URL's and it actually downloaded the site in around 5 minutes but had CSS problems all over the site.

Any thoughts on this?

Thank you

Richard Moss

Hello,

Thanks for the information. I'll take a look at it as soon as I can, but it's likely to be the weekend before I get a chance to look at this in any depth.

I had a very quick look at the site you referenced, and as you noted it has some pretty nasty URL's that seem to break pretty much every URL based SEO guideline you could think of. I suspect that you might not have any luck with WebCopy either however, partially as the current version doesn't have the ability to manipulate URL's, e.g. to chop of everything after the first ! character.

I'll get back to you when I have had a chance to run some actual tests.

What a peculiar system...

Regards;
Richard Moss

JohnnyJohn

Thank you so much richard.

Yes, these websites are hosted on major platforms - IBM WebSphere Portal - which are extremely expensive and can handle loads beyond imagination (used by airlines, telecom and fortune 100 corps worldwide). 

It's too bad that many companies don't take into consideration SEO best practices but it is what it is. 

I did run webcopy over night and it finally finished downloading the entire site with all million variations of the pages.  Everything did work find yet after a while, the cookies expired i guess and that broke CSS.   

Hope you can suggest a workaround to this problem.. I would highly appreciate it!


JohnnyJohn

Hi Richard,

Any updates on this?  Were you successful copying the site with its assets?

thank you

Richard Moss

Hello,

I looked into it further but don't have a solution yet unfortunately. I'll be taking another look shortly, once I've made some changes to WebCopy.

Regards;
Richard Moss

JohnnyJohn

Great!!  Waiting anxiously for your results!

thanks once again richard!!

Richard Moss

I've spent some time now scanning this site and these things are just monsters - I'm going to have to concede defeat for the time being. The amount of redirects, missing resources, and the URL patterns are just awful.

Their practice of using the different URL's for the same actions (forgot username and forgot password is one pair of actions I spotted doing this) doesn't help either. The length of the URL's and the randomized/encoded/whatever nature of them just makes my eyes bleed.

Unfortunately I'm not going to be able to help further on this issue as I've spent longer on it that I normally do with support requests and I'm not getting anywhere with it at all - I can't even pin down if it's a WebCopy issue or just a massive mass of redirecting pages. If possible I will take another look at this when time allows, but right now there's nothing more I can do.

Sorry for the inconvenience, I hope you find a tool that handles it.

JohnnyJohn

Richard,

Thank you for trying and i'm 99% sure it's not your software.  It's definitely portal and since it doesn't have any clean URL's, this is bound to happen as you suggested with how the URL's are dynamically generated.

I'm going to attempt this on another site with clean URL's and see if this helps.   Without spending too much time, do you recommend any settings for a site with clean URL's -- for example: https://www.straighttalk.com/wps/portal/home

maybe another idea/suggestion, can i have your software crawl a certain set of URL's (that i get from the sitemap for example?)

Thank you once again

Richard Moss

Hello,

I think the default settings should be fine - the only one that I think is problematic is extension remapping. By default, it will keep extensions, so if you have a file named index.php, the downloaded version will also have the extension php, even though it's HTML. This has been on the todo list to overhaul for a while.

You can give WebCopy a fixed list of URL's to scan, but you can't effectively say "only scan what you find in a given location", that's a limitation of the current rules system. I did a prototype of a new rule engine a good couple of years ago and due to lack of resource that has never been integrated with WebCopy either.

Regards;
Richard Moss