Is manual user authentication possible?

Started by keyboarder, May 04, 2016, 08:01:08 PM

Previous topic - Next topic

keyboarder

I am aware that others have posted questions about authentication on Google Sites.  I have read through those posts, but didn't really find a solution that works.  I appreciate any help to get this working.

Before trying automated solutions like forms, I wonder if manual authentication is possible.  Can I simply log in to the website in a browser window (like when I click "Capture"?) and then have WebCopy continue from there?

For instance, if I click the "Capture" button, the main web address is loaded in a browser window (using MSIE as far as I can tell).  I can log in, navigate to various pages, etc.  If I close the Capture window and come back, I am still logged in.  However, trying to download the website fails because it seems that WebCopy is somehow navigating back to the authentication and account pages. 

After adding some entries like *.google.com to the Additional Hosts list, I finally get WebCopy to crawl through some pages, but it seems like it is just an endless loop of google authentication and account pages.  It never gets back to my primary URI to crawl from there.  When I browse the website myself, all the pages I want are simply in the same domain with the same base URI, e.g. https://sites.google.com/site/mysitename, so it should be simple if I could only get it to through the authentication.

Richard Moss

Hello,

Thanks for your message. Interesting question!

Yes, the Capture Form tool uses Internet Explorer, but I wasn't aware it was "keeping" cookies after the window had been closed - I need to see if I can influence that behaviour.

As for manually logging it, it's an interesting idea. I'd rather the software automatically logged in, I need to take another look at Google Sites given the changes that have been made to WebCopy since the original request. Have you tried the latest nightly builds which included changes to how logins are processed, would be interesting to know if those changes let you login to Google Sites now.

I'll try and get a test site created on Google Sites and see what's going wrong shortly.

Thanks;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

keyboarder

Thanks Richard.  I appreciate you looking into this.

I downloaded the latest build (1.1.2 build 160). I entered my website URI then clicked Capture and a crash reporter window popped up (4 windows actually).  I submitted the report, so I suppose you'll see it with comments, so I suppose you'll see it.  I uninstalled nightly build and went back to stable 1.1.1.4.

keyboarder

Richard, I like the idea of automated login and appreciate your effort to have WebCopy handle it, but I asked about manual authentication because I probably just don't understand how to completely and properly configure WebCopy for automated login.  Here's what I experience with some question in between:

My primary website URL is https://sites.google.com/site/siborlab/.
Using the Test URL tool, here's what I see:
GET https://sites.google.com/site/siborlab/ returns a redirect like this
<HTML>
<HEAD>
<TITLE>Moved Temporarily</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Temporarily</H1>
The document has moved <A HREF="https://www.google.com/a/UniversalLogin?service=jotspot&amp;passive=1209600&amp;continue=https://sites.google.com/site/siborlab/&amp;followup=https://sites.google.com/site/siborlab/">here</A>.
</BODY>
</HTML>


If I then inspect the redirect URL, I get a sequence of redirects:
However, when I browse the website, I end up at https://accounts.google.com/AccountLoginInfo.  I never see the other redirects.  I created a Form and/or Password entry for this page.  Without changing any other settings, I first get this:


I assume that "External" means it is outside the primary URL domain.  As mentioned in the original post, I added some google domains to Additional Hosts in Project Properties.  Following image is the results tab after that.  The password and forms don't seem to work and I don't even see https://accounts.google.com/AccountLoginInfo in the list.  I don't understand how WebCopy navigates differently from what I would see and do in a browser.  Am I configuring WebCopy incorrectly?  Do I need to establish rules for login pages instead of adding domains to Additional Hosts?  I'm a bit confused as to what options to use.




Richard Moss

Details, details, details... thanks for the detailed follow up.

Quote from: keyboarder on May 06, 2016, 03:06:44 PM
Thanks Richard.  I appreciate you looking into this.

I downloaded the latest build (1.1.2 build 160). I entered my website URI then clicked Capture and a crash reporter window popped up (4 windows actually).  I submitted the report, so I suppose you'll see it with comments, so I suppose you'll see it.  I uninstalled nightly build and went back to stable 1.1.1.4.

Ah that one. Quite a lot of people of getting that Access Denied crash. I dislike IE quite a lot... I think I know what's going wrong with that, I'll get that corrected in the next update.

Quote from: keyboarder on May 06, 2016, 04:21:31 PM
Richard, I like the idea of automated login and appreciate your effort to have WebCopy handle it, but I asked about manual authentication because I probably just don't understand how to completely and properly configure WebCopy for automated login.  Here's what I experience with some question in between:

(lots of stuff)

In theory there shouldn't be a difference. Make request, read response, make new request. It's hardly rocket science. And yet logging into sites remains WebCopy's number 1 point of failure.  :-[ Of course, if JavaScript is involved all bets are off, and WebCopy won't be able to handle it right now, possibly never.

When posting a form WebCopy will follow redirects (well, it will follow the first at least, something else to check - I know it will follow a chain when crawling, but I can't remember offhand if I let it do that when doing that first post) but it won't display them in the log - technically it hasn't started crawling at that point, so the various event handlers that populate the log are never called.

Manual authentication is a horrible hack. And yet, it might just solve problems until I can figure out a better solution. And besides, manual authentication is already possible for 401 challenges, so why not forms too. I'll mull that one over as well.

Thanks again for the feedback. I'll try and take another look at WebCopy's login handling over the weekend if all goes well and will report back when I have some better news.

Regards;
Richard Moss
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.

Richard Moss

Just to add a short update, I managed to login to a test website I created from a new Google Account (as my own account uses 2 factor auth, WebCopy will never be able to login to that without some manual intervention!) and get a copy of the website, without adding any extra config over setting the URI.

Lucky for me, even though Google's login is a two step process (page 1 enter username, page 2 enter password) sticking both parameters into one request works nicely. But that's something else I now have to think about, as while there's nothing to stop you from posting two forms in a chain, you can't link them, e.g. so that the second post can read values from the first posts response.

However, it meant tinkering WebCopy's code to relax some restrictions which I need to carefully think about before such changes even go into a nightly. That's not to mention about half a dozen other issues I found that need addressing. Wheesh!

So, there's light at the end of the tunnel, but there's a substantial amount of work to be done, and I suspect more bugs to come as a result of these changes and some consolidation work I'm going to have to do.
Read "Before You Post" before posting (https://forums.cyotek.com/cyotek-webcopy/before-you-post/). Do not send me private messages. Do not expect instant replies.

All responses are hand crafted. No AI involved. Possibly no I either.