作业背景:
由于amazon的反爬虫机制相当严格,详细参见http://www.zhihu.com/question/27768393
爬虫机制在只有ip (没有cookie) 的时候防ip,在有cookie的时候,是防ip+cookie,也即对于一个ip,一个cookie被防了可以换一个cookie,而cookie的有效期经初步验证在5-7天左右
作业描述:
cookie 分为两种:
1、request cookie,即请求cookie,(关键反爬虫配置的参数在此)。
2、response cookie,即响应cookie,(这个只要服务端有响应都能拿到)。
经初步验证,cookie 有效期在5-7天,目前request cookie 现在只能通过人工手动获取,
需要将amazon服务端的request cookie 缓存在爬虫节点中,绕过反爬机制,request cookie必须由服务端响应后的document获取,
解决的思路大致如下:
1、直接通过 http url connection请求,获取cookie(经测试,只能获取response cookie,不可行)
2、通过iframe 加载,操作children document 获取(经测试,amazon禁止嵌入非本源host的iframe,Refused to display 'https://www.amazon.com/' in a frame because it set 'X-Frame-Options' to 'SAMEORIGIN'. 不可行)
3、通过jquery load 到div中(经测试,amazon禁止跨域,XMLHttpRequest cannot load https://www.amazon.com/. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:808' is therefore not allowed access. 不可行)
4、htmlunit 操作response page,运行js,return document.cookie(取不到值,甚至运行js无效,js支持too weak)
5、htmlunit 操作cookie,不确定能不能返回request cookie(可行,代码见下)
6、java selenium框架 操作chrome,执行js获取(暂未测试)
附录代码:
package com.focusorder.mirror.amazon;
import java.util.Iterator;
import java.util.Set;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class test {
public static void main(String[] args) {
WebClient webClient = new WebClient();
webClient.getCookieManager().setCookiesEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(60000);
/** 1、打开amazom.com */
try {
webClient.getCookieManager().clearCookies();
webClient.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36");
webClient.addRequestHeader("Accept", "application/json, text/javascript, */*; q=0.01");
webClient.addRequestHeader("Accept-Encoding", "gzip, deflate");
webClient.addRequestHeader("Accept-Language", "zh-CN,zh;q=0.8");
HtmlPage page = webClient.getPage("http://www.amazon.com/");
webClient.waitForBackgroundJavaScript(10000);
System.err.println(page.getReadyState());
Thread.sleep(10000);
// ScriptResult result = page.executeJavaScript("javascript:function(){document.body.innerHTML='';}");
// System.err.println(page.getHead());
// System.err.println(page.asText());
// System.err.println(result.getNewPage().getWebResponse().getWebRequest().getAdditionalHeaders());
// System.err.println(result.getNewPage().getWebResponse().getContentAsString());
// Set set = webClient.getCookies(newpage.getBaseURL());
// for (Iterator iterator = set.iterator(); iterator.hasNext();) {
// Object object = (Object) iterator.next();
// System.err.println(object);
// }
Set set2 = webClient.getCookieManager().getCookies();
for (Iterator iterator = set2.iterator(); iterator.hasNext();) {
Object object = (Object) iterator.next();
System.err.println(object);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Comments | NOTHING