专栏名称: 狗厂

为爬虫框架构建Selenium模块、DSL模块(Kotlin实现)

狗厂 · 掘金 · · 2018-06-13 02:16

正文

冲浪.jpg

NetDiscover 是一款基于Vert.x、RxJava2实现的爬虫框架。我最近添加了两个模块：Selenium模块、DSL模块。

一. Selenium模块

添加这个模块的目的是为了让它能够模拟人的行为去操作浏览器，完成爬虫抓取的目的。

Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。支持的浏览器包括IE（7, 8, 9, 10, 11），Mozilla Firefox，Safari，Google Chrome，Opera等。这个工具的主要功能包括：测试与浏览器的兼容性——测试你的应用程序看是否能够很好得工作在不同浏览器和操作系统之上。测试系统功能——创建回归测试检验软件功能和用户需求。支持自动录制动作和自动生成 .Net、Java、Perl等不同语言的测试脚本。

Selenium包括了一组工具和API：Selenium IDE，Selenium RC，Selenium WebDriver，和Selenium Grid。

其中，Selenium WebDriver 是一个支持浏览器自动化的工具。它包括一组为不同语言提供的类库和“驱动”（drivers）可以使浏览器上的动作自动化。

1.1 适配多个浏览器

正是得益于Selenium WebDriver ，Selenium模块可以适配多款浏览器。目前在该模块中支持Chrome、Firefox、IE以及PhantomJS（PhantomJS是一个无界面的,可脚本编程的WebKit浏览器引擎）。

package com.cv4j.netdiscovery.selenium;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.ie.InternetExplorerDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

/**
 * Created by tony on 2018/1/28.
 */
public enum Browser implements WebDriverInitializer {

    CHROME {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.chrome.driver", path);
            return new ChromeDriver();
        }
    },
    FIREFOX {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.gecko.driver", path);
            return new FirefoxDriver();
        }
    },
    IE {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.ie.driver", path);
            return new InternetExplorerDriver();
        }
    },
    PHANTOMJS {
        @Override
        public WebDriver init(String path) {

            DesiredCapabilities capabilities = new DesiredCapabilities();
            capabilities.setCapability("phantomjs.binary.path", path);
            capabilities.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
            capabilities.setJavascriptEnabled(true);
            capabilities.setCapability("takesScreenshot", true);
            capabilities.setCapability("cssSelectorsEnabled", true);

            return new PhantomJSDriver(capabilities);
        }
    }
}

1.2 WebDriverPool

之所以使用WebDriverPool，是因为每次打开一个WebDriver进程都比较耗费资源，所以创建一个对象池。我使用Apache的Commons Pool组件来实现对象池化。

package com.cv4j.netdiscovery.selenium.pool;

import org.apache.commons.pool2.impl.GenericObjectPool;
import org.openqa.selenium.WebDriver;

/**
 * Created by tony on 2018/3/9.
 */
public class WebDriverPool {

    private static GenericObjectPool<WebDriver> webDriverPool = null;

    /**
     * 如果需要使用WebDriverPool，则必须先调用这个init()方法
     *
     * @param config
     */
    public static void init(WebDriverPoolConfig config) {

        webDriverPool = new GenericObjectPool<>(new WebDriverPooledFactory(config));
        webDriverPool.setMaxTotal(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.total", "20"))); // 最多能放多少个对象
        webDriverPool.setMinIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.min.idle", "1")));   // 最少有几个闲置对象
        webDriverPool.setMaxIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.idle", "20"))); // 最多允许多少个闲置对象

        try {
            webDriverPool.preparePool();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static WebDriver borrowOne() {

        if (webDriverPool!=null) {

            try {
                return webDriverPool.borrowObject();
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        return null;
    }

    public static void returnOne(WebDriver driver) {

        if (webDriverPool!=null) {

            webDriverPool.returnObject(driver);
        }
    }

    public static void destory() {

        if (webDriverPool!=null) {

            webDriverPool.clear();
            webDriverPool.close();
        }
    }

    public static boolean hasWebDriverPool() {

        return webDriverPool!=null;
    }
}

1.3 SeleniumAction

Selenium 可以模拟浏览器的行为，例如点击、滑动、返回等等。这里抽象出一个SeleniumAction类，用于表示模拟的事件。

package com.cv4j.netdiscovery.selenium.action;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/**
 * Created by tony on 2018/3/3.
 */
public abstract class SeleniumAction {

    public abstract SeleniumAction perform(WebDriver driver);

    public SeleniumAction doIt(WebDriver driver) {

        return perform(driver);
    }

    public static SeleniumAction clickOn(By by) {
        return new ClickOn(by);
    }

    public static SeleniumAction getUrl(String url) {
        return new GetURL(url);
    }

    public static SeleniumAction goBack() {
        return new GoBack();
    }

    public static SeleniumAction closeTabs() {
        return new CloseTab();
    }
}

1.4 SeleniumDownloader

Downloader是爬虫框架的下载器组件，例如可以使用vert.x的webclient、okhttp3等实现网络请求的功能。如果需要使用Selenium，必须要使用SeleniumDownloader来完成网络请求。

SeleniumDownloader类可以添加一个或者多个SeleniumAction。如果是多个SeleniumAction会按照顺序执行。

尤为重要的是，SeleniumDownloader类中webDriver是从WebDriverPool中获取，每次使用完了会将webDriver返回到连接池。

package com.cv4j.netdiscovery.selenium.downloader;

import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.Request;
import com.cv4j.netdiscovery.core.domain.Response;
import com.cv4j.netdiscovery.core.downloader.Downloader;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool;
import com.safframework.tony.common.utils.Preconditions;
import io.reactivex.Maybe;
import io.reactivex.MaybeEmitter;
import io.reactivex.MaybeOnSubscribe;
import io.reactivex.functions.Function;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;

import java.util.LinkedList;
import java.util.List;

/**
 * Created by tony on 2018/1/28.
 */
public class SeleniumDownloader implements Downloader {

    private WebDriver webDriver;
    private List<SeleniumAction> actions = new LinkedList<>();

    public SeleniumDownloader() {

        this.webDriver = WebDriverPool.borrowOne(); // 从连接池中获取webDriver
    }

    public SeleniumDownloader(SeleniumAction action) {

        this.webDriver = WebDriverPool.borrowOne(); // 从连接池中获取webDriver
        this.actions.add(action);
    }

    public SeleniumDownloader(List<SeleniumAction> actions) {

        this.webDriver = WebDriverPool.borrowOne(); // 从连接池中获取webDriver
        this.actions.addAll(actions);
    }

    @Override
    public Maybe<Response> download(Request request) {

        return Maybe.create(new MaybeOnSubscribe<String>(){

            @Override
            public void subscribe(MaybeEmitter emitter) throws Exception {

                if (webDriver!=null) {
                    webDriver.get(request.getUrl());

                    if (Preconditions.isNotBlank(actions)) {

                        actions.forEach(
                                action-> action.perform(webDriver)
                        );
                    }

                    emitter.onSuccess(webDriver.getPageSource());
                }
            }
        }).map(new Function<String, Response>() {

            @Override
            public Response apply(String html) throws Exception {

                Response response = new Response();
                response.setContent(html.getBytes());
                response.setStatusCode(Constant.OK_STATUS_CODE);
                response.setContentType(getContentType(webDriver));
                return response;
            }
        });
    }

    /**
     * @param webDriver
     * @return
     */
    private String getContentType(final WebDriver webDriver) {

        if (webDriver instanceof JavascriptExecutor) {

            final JavascriptExecutor jsExecutor = (JavascriptExecutor) webDriver;
            // TODO document.contentType does not exist.
            final Object ret = jsExecutor
                    .executeScript("return document.contentType;");
            if (ret != null) {
                return ret.toString();
            }
        }
        return "text/html";
    }


    @Override
    public void close() {

        if (webDriver!=null) {
            WebDriverPool.returnOne(webDriver); // 将webDriver返回到连接池
        }
    }
}

1.5 一些有用的工具类

此外，Selenium模块还有一个工具类。它包含了一些scrollTo、scrollBy、clickElement等浏览器的操作。

还有一些有特色的功能是对当前网页进行截幕，或者是截取某个区域。

    public static void taskScreenShot(WebDriver driver,String pathName){

        //指定了OutputType.FILE做为参数传递给getScreenshotAs()方法，其含义是将截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具类的copyFile()方法保存getScreenshotAs()返回的文件对象。

        try {
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void taskScreenShot(WebDriver driver,WebElement element,String pathName) {

        //指定了OutputType.FILE做为参数传递给getScreenshotAs()方法，其含义是将截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具类的copyFile()方法保存getScreenshotAs()返回的文件对象。

        try {
            //获取元素在所处frame中位置对象
            Point p = element.getLocation();
            //获取元素的宽与高
            int width = element.getSize().getWidth();
            int height = element.getSize().getHeight();
            //矩形图像对象
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(p.getX(), p.getY(), rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 截取某个区域的截图
     * @param driver
     * @param x
     * @param y
     * @param width
     * @param height
     * @param pathName
     */
    public static void taskScreenShot(WebDriver driver,int x,int y,int width,int height,String pathName) {

        //指定了OutputType.FILE做为参数传递给getScreenshotAs()方法，其含义是将截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具类的copyFile()方法保存getScreenshotAs()返回的文件对象。

        try {
            //矩形图像对象
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(x, y, rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

1.6 使用Selenium模块的实例

在京东上搜索我的新书《RxJava 2.x 实战》，并按照销量进行排序，然后获取前十个商品的信息。