电脑知识|欧美黑人一区二区三区|软件|欧美黑人一级爽快片淫片高清|系统|欧美黑人狂野猛交老妇|数据库|服务器|编程开发|网络运营|知识问答|技术教程文章 - 好吧啦网

您的位置:首頁技術(shù)文章
文章詳情頁

springboot+WebMagic+MyBatis爬蟲框架的使用

瀏覽:41日期:2023-02-18 18:25:39
目錄1.添加maven依賴2.項目配置文件 application.properties3.數(shù)據(jù)庫表結(jié)構(gòu)4.實體類5.mapper接口6.CrawlerMapper.xml文件7.知乎頁面內(nèi)容處理類ZhihuPageProcessor8.知乎數(shù)據(jù)處理類ZhihuPipeline9.知乎爬蟲任務(wù)類ZhihuTask10.Spring boot程序啟動類

WebMagic是一個開源的java爬蟲框架。WebMagic框架的使用并不是本文的重點,具體如何使用請參考官方文檔:http://webmagic.io/docs/。

本文是對spring boot+WebMagic+MyBatis做了整合,使用WebMagic爬取數(shù)據(jù),然后通過MyBatis持久化爬取的數(shù)據(jù)到mysql數(shù)據(jù)庫。本文提供的源代碼可以作為java爬蟲項目的腳手架。

springboot+WebMagic+MyBatis爬蟲框架的使用

1.添加maven依賴

<?xml version='1.0' encoding='UTF-8'?><project xmlns='http://maven.apache.org/POM/4.0.0' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:schemaLocation='http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd'> <modelVersion>4.0.0</modelVersion> <groupId>hyzx</groupId> <artifactId>qbasic-crawler</artifactId> <version>1.0.0</version> <parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>1.5.21.RELEASE</version><relativePath/> <!-- lookup parent from repository --> </parent> <properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.test.skip>true</maven.test.skip><java.version>1.8</java.version><maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version><maven.resources.plugin.version>3.1.0</maven.resources.plugin.version><mysql.connector.version>5.1.47</mysql.connector.version><druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version><mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version><fastjson.version>1.2.58</fastjson.version><commons.lang3.version>3.9</commons.lang3.version><joda.time.version>2.10.2</joda.time.version><webmagic.core.version>0.7.3</webmagic.core.version> </properties> <dependencies><dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-devtools</artifactId> <scope>runtime</scope> <optional>true</optional></dependency><dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope></dependency><dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-configuration-processor</artifactId> <optional>true</optional></dependency><dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>${mysql.connector.version}</version></dependency><dependency> <groupId>com.alibaba</groupId> <artifactId>druid-spring-boot-starter</artifactId> <version>${druid.spring.boot.starter.version}</version></dependency><dependency> <groupId>org.mybatis.spring.boot</groupId> <artifactId>mybatis-spring-boot-starter</artifactId> <version>${mybatis.spring.boot.starter.version}</version></dependency><dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson.version}</version></dependency><dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>${commons.lang3.version}</version></dependency><dependency> <groupId>joda-time</groupId> <artifactId>joda-time</artifactId> <version>${joda.time.version}</version></dependency><dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>${webmagic.core.version}</version> <exclusions><exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId></exclusion> </exclusions></dependency> </dependencies> <build><plugins> <plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><version>${maven.compiler.plugin.version}</version><configuration> <source>${java.version}</source> <target>${java.version}</target> <encoding>${project.build.sourceEncoding}</encoding></configuration> </plugin> <plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-resources-plugin</artifactId><version>${maven.resources.plugin.version}</version><configuration> <encoding>${project.build.sourceEncoding}</encoding></configuration> </plugin> <plugin><groupId>org.springframework.boot</groupId><artifactId>spring-boot-maven-plugin</artifactId><configuration> <fork>true</fork> <addResources>true</addResources></configuration><executions> <execution><goals> <goal>repackage</goal></goals> </execution></executions> </plugin></plugins> </build> <repositories><repository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases><enabled>true</enabled> </releases></repository> </repositories> <pluginRepositories><pluginRepository> <id>public</id> <name>aliyun nexus</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases><enabled>true</enabled> </releases> <snapshots><enabled>false</enabled> </snapshots></pluginRepository> </pluginRepositories></project>2.項目配置文件 application.properties

配置mysql數(shù)據(jù)源,druid數(shù)據(jù)庫連接池以及MyBatis的mapper文件的位置。

# mysql數(shù)據(jù)源配置spring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# druid數(shù)據(jù)庫連接池配置spring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test-on-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000# mybatis配置mybatis.mapperLocations=classpath:mapper/**/*.xml3.數(shù)據(jù)庫表結(jié)構(gòu)

CREATE TABLE `cms_content` ( `contentId` varchar(40) NOT NULL COMMENT ’內(nèi)容ID’, `title` varchar(150) NOT NULL COMMENT ’標題’, `content` longtext COMMENT ’文章內(nèi)容’, `releaseDate` datetime NOT NULL COMMENT ’發(fā)布日期’, PRIMARY KEY (`contentId`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=’CMS內(nèi)容表’;4.實體類

import java.util.Date;public class CmsContentPO { private String contentId; private String title; private String content; private Date releaseDate; public String getContentId() {return contentId; } public void setContentId(String contentId) {this.contentId = contentId; } public String getTitle() {return title; } public void setTitle(String title) {this.title = title; } public String getContent() {return content; } public void setContent(String content) {this.content = content; } public Date getReleaseDate() {return releaseDate; } public void setReleaseDate(Date releaseDate) {this.releaseDate = releaseDate; }}5.mapper接口

public interface CrawlerMapper { int addCmsContent(CmsContentPO record);}6.CrawlerMapper.xml文件

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE mapper PUBLIC '-//mybatis.org//DTD Mapper 3.0//EN' 'http://mybatis.org/dtd/mybatis-3-mapper.dtd'><mapper namespace='com.hyzx.qbasic.dao.CrawlerMapper'> <insert parameterType='com.hyzx.qbasic.model.CmsContentPO'>insert into cms_content (contentId, title, releaseDate, content)values (#{contentId,jdbcType=VARCHAR},#{title,jdbcType=VARCHAR},#{releaseDate,jdbcType=TIMESTAMP},#{content,jdbcType=LONGVARCHAR}) </insert></mapper>7.知乎頁面內(nèi)容處理類ZhihuPageProcessor

主要用于解析爬取到的知乎html頁面。

@Componentpublic class ZhihuPageProcessor implements PageProcessor { private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) {page.addTargetRequests(page.getHtml().links().regex('https://www.zhihu.com/question/d+/answer/d+.*').all());page.putField('title', page.getHtml().xpath('//h1[@class=’QuestionHeader-title’]/text()').toString());page.putField('answer', page.getHtml().xpath('//div[@class=’QuestionAnswer-content’]/tidyText()').toString());if (page.getResultItems().get('title') == null) { // 如果是列表頁,跳過此頁,pipeline不進行后續(xù)處理 page.setSkip(true);} } @Override public Site getSite() {return site; }}8.知乎數(shù)據(jù)處理類ZhihuPipeline

主要用于將知乎html頁面解析出的數(shù)據(jù)存儲到mysql數(shù)據(jù)庫。

@Componentpublic class ZhihuPipeline implements Pipeline { private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class); @Autowired private CrawlerMapper crawlerMapper; public void process(ResultItems resultItems, Task task) {String title = resultItems.get('title');String answer = resultItems.get('answer');CmsContentPO contentPO = new CmsContentPO();contentPO.setContentId(UUID.randomUUID().toString());contentPO.setTitle(title);contentPO.setReleaseDate(new Date());contentPO.setContent(answer);try { boolean success = crawlerMapper.addCmsContent(contentPO) > 0; LOGGER.info('保存知乎文章成功:{}', title);} catch (Exception ex) { LOGGER.error('保存知乎文章失敗', ex);} }}9.知乎爬蟲任務(wù)類ZhihuTask

每十分鐘啟動一次爬蟲。

@Componentpublic class ZhihuTask { private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class); @Autowired private ZhihuPipeline zhihuPipeline; @Autowired private ZhihuPageProcessor zhihuPageProcessor; private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor(); public void crawl() {// 定時任務(wù),每10分鐘爬取一次timer.scheduleWithFixedDelay(() -> { Thread.currentThread().setName('zhihuCrawlerThread'); try {Spider.create(zhihuPageProcessor)// 從https://www.zhihu.com/explore開始抓.addUrl('https://www.zhihu.com/explore')// 抓取到的數(shù)據(jù)存數(shù)據(jù)庫.addPipeline(zhihuPipeline)// 開啟2個線程抓取.thread(2)// 異步啟動爬蟲.start(); } catch (Exception ex) {LOGGER.error('定時抓取知乎數(shù)據(jù)線程執(zhí)行異常', ex); }}, 0, 10, TimeUnit.MINUTES); }}10.Spring boot程序啟動類

@SpringBootApplication@MapperScan(basePackages = 'com.hyzx.qbasic.dao')public class Application implements CommandLineRunner { @Autowired private ZhihuTask zhihuTask; public static void main(String[] args) throws IOException {SpringApplication.run(Application.class, args); } @Override public void run(String... strings) throws Exception {// 爬取知乎數(shù)據(jù)zhihuTask.crawl(); }}

到此這篇關(guān)于springboot+WebMagic+MyBatis爬蟲框架的使用的文章就介紹到這了,更多相關(guān)springboot+WebMagic+MyBatis爬蟲內(nèi)容請搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)!

標簽: Spring
相關(guān)文章:
主站蜘蛛池模板: 洛阳网站建设_洛阳网站优化_网站建设平台_洛阳香河网络科技有限公司 | 体感VRAR全息沉浸式3D投影多媒体展厅展会游戏互动-万展互动 | 碳纤维复合材料制品生产定制工厂订制厂家-凯夫拉凯芙拉碳纤维手机壳套-碳纤维雪茄盒外壳套-深圳市润大世纪新材料科技有限公司 | 隐形纱窗|防护纱窗|金刚网防盗纱窗|韦柏纱窗|上海青木装潢制品有限公司|纱窗国标起草单位 | 玻纤土工格栅_钢塑格栅_PP焊接_单双向塑料土工格栅_复合防裂布厂家_山东大庚工程材料科技有限公司 | 保温杯,儿童婴童奶瓶,运动水壶「广告礼品杯定制厂家」超朗保温杯壶 | 诸城网站建设-网络推广-网站优化-阿里巴巴托管-诸城恒泰互联 | Safety light curtain|Belt Sway Switches|Pull Rope Switch|ultrasonic flaw detector-Shandong Zhuoxin Machinery Co., Ltd | 防渗土工膜|污水处理防渗膜|垃圾填埋场防渗膜-泰安佳路通工程材料有限公司 | 集菌仪厂家_全封闭_封闭式_智能智能集菌仪厂家-上海郓曹 | 线材成型机,线材折弯机,线材成型机厂家,贝朗自动化设备有限公司1 | 彭世修脚_修脚加盟_彭世修脚加盟_彭世足疗加盟_足疗加盟连锁_彭世修脚技术培训_彭世足疗 | 石家庄律师_石家庄刑事辩护律师_石家庄取保候审-河北万垚律师事务所 | 昆山新莱洁净应用材料股份有限公司-卫生级蝶阀,无菌取样阀,不锈钢隔膜阀,换向阀,离心泵 | 共享雨伞_共享童车_共享轮椅_共享陪护床-共享产品的领先者_有伞科技 | 山西3A认证|太原AAA信用认证|投标AAA信用证书-山西AAA企业信用评级网 | 硅PU球场、篮球场地面施工「水性、环保、弹性」硅PU材料生产厂家-广东中星体育公司 | 品牌设计_VI设计_电影海报设计_包装设计_LOGO设计-Bacross新越品牌顾问 | 电子书导航网_电子书之家_电子书大全_最新电子书分享发布平台 | 一点车讯-汽车网站,每天一点最新车讯! | 联系我们-腾龙公司上分客服微信19116098882 | 珠海网站建设_响应网站建设_珠海建站公司_珠海网站设计与制作_珠海网讯互联 | 电池挤压试验机-自行车喷淋-车辆碾压试验装置-深圳德迈盛测控设备有限公司 | 电镀电源整流器_高频电解电源_单脉双脉冲电源 - 东阳市旭东电子科技 | 护腰带生产厂家_磁石_医用_热压护腰_登山护膝_背姿矫正带_保健护具_医疗护具-衡水港盛 | 塑木弯曲试验机_铜带拉伸强度试验机_拉压力测试台-倾技百科 | 针焰试验仪,灼热丝试验仪,漏电起痕试验仪,水平垂直燃烧试验仪 - 苏州亚诺天下仪器有限公司 | 水厂自动化-水厂控制系统-泵站自动化|控制系统-闸门自动化控制-济南华通中控科技有限公司 | 米顿罗计量泵(科普)——韬铭机械 | 裹包机|裹膜机|缠膜机|绕膜机-上海晏陵智能设备有限公司 | 吉祥新世纪铝塑板_生产铝塑板厂家_铝塑板生产厂家_临沂市兴达铝塑装饰材料有限公司 | 我爱古诗词_古诗词名句赏析学习平台 | 山东氧化铁红,山东铁红-淄博科瑞化工有限公司 | 机床导轨_导轨板_滚轮导轨-上海旻佑精密机械有限公司 | 通用磨耗试验机-QUV耐候试验机|久宏实业百科 | 折弯机-刨槽机-数控折弯机-数控刨槽机-数控折弯机厂家-深圳豐科机械有限公司 | 广东风淋室_广东风淋室厂家_广东风淋室价格_广州开源_传递窗_FFU-广州开源净化科技有限公司 | 电气控制系统集成商-PLC控制柜变频控制柜-非标自动化定制-电气控制柜成套-NIDEC CT变频器-威肯自动化控制 | 选宝石船-陆地水上开采「精选」色选机械设备-青州冠诚重工机械有限公司 | 据信,上课带着跳 D 体验-别样的课堂刺激感受引发网友热议 | QQ房产导航-免费收录优秀房地产网站_房地产信息网 |