反爬虫机制:抓取amazon服务端 request cookie 自动配置参数

发布于 2016-10-19  13.28k 次阅读


作业背景:

由于amazon的反爬虫机制相当严格,详细参见http://www.zhihu.com/question/27768393

爬虫机制在只有ip (没有cookie) 的时候防ip,在有cookie的时候,是防ip+cookie,也即对于一个ip,一个cookie被防了可以换一个cookie,而cookie的有效期经初步验证在5-7天左右

 

作业描述:

cookie 分为两种:

1、request cookie,即请求cookie,(关键反爬虫配置的参数在此)。

2、response cookie,即响应cookie,(这个只要服务端有响应都能拿到)。

经初步验证,cookie 有效期在5-7天,目前request cookie 现在只能通过人工手动获取,

需要将amazon服务端的request cookie 缓存在爬虫节点中,绕过反爬机制,request cookie必须由服务端响应后的document获取,

反爬虫机制:抓取amazon服务端 request cookie 自动配置参数

解决的思路大致如下:

1、直接通过 http url connection请求,获取cookie(经测试,只能获取response cookie,不可行)

2、通过iframe 加载,操作children document 获取(经测试,amazon禁止嵌入非本源host的iframe,Refused to display 'https://www.amazon.com/' in a frame because it set 'X-Frame-Options' to 'SAMEORIGIN'.  不可行)

3、通过jquery load 到div中(经测试,amazon禁止跨域,XMLHttpRequest cannot load https://www.amazon.com/. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:808' is therefore not allowed access.  不可行)

4、htmlunit 操作response page,运行js,return document.cookie(取不到值,甚至运行js无效,js支持too weak

5、htmlunit 操作cookie,不确定能不能返回request cookie(可行,代码见下

6、java selenium框架 操作chrome,执行js获取(暂未测试)

 

反爬虫机制:抓取amazon服务端 request cookie 自动配置参数

 

附录代码:

package com.focusorder.mirror.amazon;

import java.util.Iterator;
import java.util.Set;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class test {
	public static void main(String[] args) {

		WebClient webClient = new WebClient();
		webClient.getCookieManager().setCookiesEnabled(true);
		webClient.getOptions().setJavaScriptEnabled(true);
		webClient.getOptions().setCssEnabled(true);
		webClient.getOptions().setRedirectEnabled(true);
		webClient.getOptions().setThrowExceptionOnScriptError(false);
		webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
		webClient.getOptions().setTimeout(60000);
		/** 1、打开amazom.com */
		try {
			
			webClient.getCookieManager().clearCookies();
            webClient.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36");
            webClient.addRequestHeader("Accept", "application/json, text/javascript, */*; q=0.01");
            webClient.addRequestHeader("Accept-Encoding", "gzip, deflate");
            webClient.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.8");
			HtmlPage page = webClient.getPage("http://www.amazon.com/");
			webClient.waitForBackgroundJavaScript(10000);
			System.err.println(page.getReadyState());
			Thread.sleep(10000);
//			ScriptResult result = page.executeJavaScript("javascript:function(){document.body.innerHTML='';}");
//			System.err.println(page.getHead());
//			System.err.println(page.asText());
//			System.err.println(result.getNewPage().getWebResponse().getWebRequest().getAdditionalHeaders());
//			System.err.println(result.getNewPage().getWebResponse().getContentAsString());
			
			
//			Set set = webClient.getCookies(newpage.getBaseURL());
//			for (Iterator iterator = set.iterator(); iterator.hasNext();) {
//				Object object = (Object) iterator.next();
//				System.err.println(object);
//			}
			Set set2 = webClient.getCookieManager().getCookies();
			for (Iterator iterator = set2.iterator(); iterator.hasNext();) {
				Object object = (Object) iterator.next();
				System.err.println(object);
			}
			
		} catch (Exception e) {
			e.printStackTrace();
		}
		

	}
}