一区二区三区在线-一区二区三区亚洲视频-一区二区三区亚洲-一区二区三区午夜-一区二区三区四区在线视频-一区二区三区四区在线免费观看

服務器之家:專注于服務器技術及軟件下載分享
分類導航

PHP教程|ASP.NET教程|Java教程|ASP教程|編程技術|正則表達式|C/C++|IOS|C#|Swift|Android|VB|R語言|JavaScript|易語言|vb.net|

服務器之家 - 編程語言 - Java教程 - SpringBoot+WebMagic+MyBaties實現爬蟲和數據入庫的示例

SpringBoot+WebMagic+MyBaties實現爬蟲和數據入庫的示例

2022-02-21 13:23非空子集 Java教程

WebMagic是一個開源爬蟲框架,本項目通過在SpringBoot項目中使用WebMagic去抓取數據,最后使用MyBatis將數據入庫。具有一定的參考價值,感興趣的小伙伴們可以參考一下

WebMagic是一個開源爬蟲框架,本項目通過在SpringBoot項目中使用WebMagic去抓取數據,最后使用MyBatis將數據入庫

本項目代碼地址:ArticleCrawler: SrpingBoot+WebMagic+MyBaties實現爬蟲和數據入庫 (gitee.com)

創建數據庫:

本示例中庫名為article,表名為cms_content,表中包含contentId、title、date三個字段。

?
1
2
3
4
5
6
CREATE TABLE `cms_content` (
  `contentId` varchar(40) NOT NULL COMMENT '內容ID',
  `title` varchar(150) NOT NULL COMMENT '標題',
  `date` varchar(150) NOT NULL COMMENT '發布日期',
  PRIMARY KEY (`contentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS內容表';

新建SpringBoot項目:

1、配置依賴pom.xml

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.5.5</version>
        <relativePath/>
    </parent>
    <groupId>com.example</groupId>
    <artifactId>Article</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>Article</name>
    <description>Article</description>
    <properties>
        <java.version>1.8</java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.test.skip>true</maven.test.skip>
        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>
        <maven.resources.plugin.version>3.1.0</maven.resources.plugin.version>
 
        <mysql.connector.version>5.1.47</mysql.connector.version>
        <druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version>
        <mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version>
        <fastjson.version>1.2.58</fastjson.version>
        <commons.lang3.version>3.9</commons.lang3.version>
        <joda.time.version>2.10.2</joda.time.version>
        <webmagic.core.version>0.7.5</webmagic.core.version>
    </properties>
 
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
 
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
 
 
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
            <optional>true</optional>
        </dependency>
 
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.connector.version}</version>
        </dependency>
 
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid-spring-boot-starter</artifactId>
            <version>${druid.spring.boot.starter.version}</version>
        </dependency>
 
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>${mybatis.spring.boot.starter.version}</version>
        </dependency>
 
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>${fastjson.version}</version>
        </dependency>
 
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>${commons.lang3.version}</version>
        </dependency>
 
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>${joda.time.version}</version>
        </dependency>
 
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>${webmagic.core.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
 
    </dependencies>
 
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven.compiler.plugin.version}</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
 
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <version>${maven.resources.plugin.version}</version>
                <configuration>
                    <encoding>${project.build.sourceEncoding}</encoding>
                </configuration>
            </plugin>
 
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <fork>true</fork>
                    <addResources>true</addResources>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
 
    <repositories>
        <repository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
        </repository>
    </repositories>
 
    <pluginRepositories>
        <pluginRepository>
            <id>public</id>
            <name>aliyun nexus</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories>
 
</project>

2、創建CmsContentPO.java

數據實體,和表中3個字段對應。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package site.exciter.article.model;
 
public class CmsContentPO {
    private String contentId;
 
    private String title;
 
    private String date;
 
    public String getContentId() {
        return contentId;
    }
 
    public void setContentId(String contentId) {
        this.contentId = contentId;
    }
 
    public String getTitle() {
        return title;
    }
 
    public void setTitle(String title) {
        this.title = title;
    }
 
    public String getDate() {
        return date;
    }
 
    public void setDate(String date) {
        this.date = date;
    }
}

3、創建CrawlerMapper.java

?
1
2
3
4
5
6
7
8
9
package site.exciter.article.dao;
 
import org.apache.ibatis.annotations.Mapper;
import site.exciter.article.model.CmsContentPO;
 
@Mapper
public interface CrawlerMapper {
    int addCmsContent(CmsContentPO record);
}

4、配置映射文件CrawlerMapper.xml

在resources下新建mapper文件夾,在mapper下創建CrawlerMapper.xml

?
1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="site.exciter.article.dao.CrawlerMapper">
 
    <insert id="addCmsContent" parameterType="site.exciter.article.model.CmsContentPO">
        insert into cms_content (contentId,
        title,
        date)
        values (#{contentId,jdbcType=VARCHAR},
        #{title,jdbcType=VARCHAR},
        #{date,jdbcType=VARCHAR})
    </insert>
</mapper>

5、配置application.properties

配置數據庫和mybatis映射關系。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# mysql
spring.datasource.name=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://10.201.61.184:3306/article?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=true
spring.datasource.username=root
spring.datasource.password=root
 
# druid
spring.datasource.druid.initial-size=5
spring.datasource.druid.min-idle=5
spring.datasource.druid.max-active=10
spring.datasource.druid.max-wait=60000
spring.datasource.druid.validation-query=SELECT 1 FROM DUAL
spring.datasource.druid.test-on-borrow=false
spring.datasource.druid.test-on-return=false
spring.datasource.druid.test-while-idle=true
spring.datasource.druid.time-between-eviction-runs-millis=60000
spring.datasource.druid.min-evictable-idle-time-millis=300000
spring.datasource.druid.max-evictable-idle-time-millis=600000
 
# mybatis
mybatis.mapperLocations=classpath:mapper/CrawlerMapper.xml

6、創建ArticlePageProcessor.java

解析html的邏輯。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package site.exciter.article;
 
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
 
@Component
public class ArticlePageProcessor implements PageProcessor {
 
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
 
    @Override
    public void process(Page page) {
        String detail_urls_Xpath = "//*[@class='postTitle']/a[@class='postTitle2']/@href";
        String next_page_xpath = "//*[@id='nav_next_page']/a/@href";
        String next_page_css = "#homepage_top_pager > div:nth-child(1) > a:nth-child(7)";
        String title_xpath = "//h1[@class='postTitle']/a/span/text()";
        String date_xpath = "//span[@id='post-date']/text()";
        page.putField("title", page.getHtml().xpath(title_xpath).toString());
        if (page.getResultItems().get("title") == null) {
            page.setSkip(true);
        }
        page.putField("date", page.getHtml().xpath(date_xpath).toString());
 
        if (page.getHtml().xpath(detail_urls_Xpath).match()) {
            Selectable detailUrls = page.getHtml().xpath(detail_urls_Xpath);
            page.addTargetRequests(detailUrls.all());
        }
 
        if (page.getHtml().xpath(next_page_xpath).match()) {
            Selectable nextPageUrl = page.getHtml().xpath(next_page_xpath);
            page.addTargetRequests(nextPageUrl.all());
 
        } else if (page.getHtml().css(next_page_css).match()) {
            Selectable nextPageUrl = page.getHtml().css(next_page_css).links();
            page.addTargetRequests(nextPageUrl.all());
        }
    }
 
    @Override
    public Site getSite() {
        return site;
    }
}

7、創建ArticlePipeline.java

處理數據的持久化。

  1. package site.exciter.article; 
  2.  
  3. import org.slf4j.Logger; 
  4. import org.slf4j.LoggerFactory; 
  5. import org.springframework.beans.factory.annotation.Autowired; 
  6. import org.springframework.stereotype.Component; 
  7. import site.exciter.article.model.CmsContentPO; 
  8. import site.exciter.article.dao.CrawlerMapper; 
  9. import us.codecraft.webmagic.ResultItems; 
  10. import us.codecraft.webmagic.Task; 
  11. import us.codecraft.webmagic.pipeline.Pipeline; 
  12.  
  13. import java.util.UUID; 
  14.  
  15. @Component 
  16. public class ArticlePipeline implements Pipeline { 
  17.  
  18.     private static final Logger LOGGER = LoggerFactory.getLogger(ArticlePipeline.class); 
  19.  
  20.     @Autowired 
  21.     private CrawlerMapper crawlerMapper; 
  22.  
  23.     public void process(ResultItems resultItems, Task task) { 
  24.         String title = resultItems.get("title"); 
  25.         String date = resultItems.get("date"); 
  26.  
  27.         CmsContentPO contentPO = new CmsContentPO(); 
  28.         contentPO.setContentId(UUID.randomUUID().toString()); 
  29.         contentPO.setTitle(title); 
  30.         contentPO.setDate(date); 
  31.  
  32.         try { 
  33.             boolean success = crawlerMapper.addCmsContent(contentPO) > 0; 
  34.             LOGGER.info("保存成功:{}", title); 
  35.         } catch (Exception ex) { 
  36.             LOGGER.error("保存失敗", ex); 
  37.         } 
  38.     } 

8、創建ArticleTask.java

執行抓取任務。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package site.exciter.article;
 
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Spider;
 
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
 
@Component
public class ArticleTask {
    private static final Logger LOGGER = LoggerFactory.getLogger(ArticlePipeline.class);
 
    @Autowired
    private ArticlePipeline articlePipeline;
 
    @Autowired
    private ArticlePageProcessor articlePageProcessor;
 
    private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor();
 
    public void crawl() {
        // 定時任務,每10分鐘爬取一次
        timer.scheduleWithFixedDelay(() -> {
            Thread.currentThread().setName("ArticleCrawlerThread");
 
            try {
                Spider.create(articlePageProcessor)
                        .addUrl("http://www.cnblogs.com/dick159/default.html?page=2")
                        // 抓取到的數據存數據庫
                        .addPipeline(articlePipeline)
                        // 開啟5個線程抓取
                        .thread(5)
                        // 異步啟動爬蟲
                        .start();
            } catch (Exception ex) {
                LOGGER.error("定時抓取數據線程執行異常", ex);
            }
        }, 0, 10, TimeUnit.MINUTES);
    }
}

9、修改Application

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package site.exciter.article;
 
import org.mybatis.spring.annotation.MapperScan;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
 
@SpringBootApplication
@MapperScan(basePackages = "site.exciter.article.interface")
public class ArticleApplication implements CommandLineRunner {
 
    @Autowired
    private ArticleTask articleTask;
 
    public static void main(String[] args) {
        SpringApplication.run(ArticleApplication.class, args);
    }
 
    @Override
    public void run(String... args) throws Exception {
        articleTask.crawl();
    }
}

10、執行application,開始抓數據并入庫

SpringBoot+WebMagic+MyBaties實現爬蟲和數據入庫的示例

SpringBoot+WebMagic+MyBaties實現爬蟲和數據入庫的示例

到此這篇關于SrpingBoot+WebMagic+MyBaties實現爬蟲和數據入庫的示例的文章就介紹到這了,更多相關SrpingBoot+WebMagic+MyBaties爬蟲和數據入庫內容請搜索服務器之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持服務器之家!

原文鏈接:https://juejin.cn/post/7018897037219332104

延伸 · 閱讀

精彩推薦
  • Java教程MyBatis攔截器:給參數對象屬性賦值的實例

    MyBatis攔截器:給參數對象屬性賦值的實例

    下面小編就為大家帶來一篇MyBatis攔截器:給參數對象屬性賦值的實例。小編覺得挺不錯的,現在就分享給大家,也給大家做個參考。一起跟隨小編過來看看...

    Java之家5632020-09-10
  • Java教程使用Java編寫一個簡單的Web的監控系統

    使用Java編寫一個簡單的Web的監控系統

    這篇文章主要介紹了使用Java編寫一個簡單的Web的監控系統的例子,并且將重要信息轉為XML通過網頁前端顯示,非常之實用,需要的朋友可以參考下 ...

    snoopy77135162020-01-21
  • Java教程IKAnalyzer結合Lucene實現中文分詞(示例講解)

    IKAnalyzer結合Lucene實現中文分詞(示例講解)

    下面小編就為大家帶來一篇IKAnalyzer結合Lucene實現中文分詞(示例講解)。小編覺得挺不錯的,現在就分享給大家,也給大家做個參考。一起跟隨小編過來看看...

    funnyboy012811312021-01-17
  • Java教程一篇文章教你如何在SpringCloud項目中使用OpenFeign

    一篇文章教你如何在SpringCloud項目中使用OpenFeign

    這篇文章主要介紹了SpringCloud 使用Open feign 優化詳解,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧...

    小小張自由—>張有博10542021-11-19
  • Java教程MyBatisPlus PaginationInterceptor分頁插件的使用詳解

    MyBatisPlus PaginationInterceptor分頁插件的使用詳解

    這篇文章主要介紹了MyBatisPlus PaginationInterceptor分頁插件的使用詳解,文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,...

    BADAO_LIUMANG_QIZHI8312021-08-29
  • Java教程Java集合之Map接口的實現類精解

    Java集合之Map接口的實現類精解

    Map提供了一種映射關系,其中的元素是以鍵值對(key-value)的形式存儲的,能夠實現根據key快速查找value;Map中的鍵值對以Entry類型的對象實例形式存在;鍵...

    葉綠體不忘呼吸7072022-01-22
  • Java教程Spring Boot的Profile配置詳解

    Spring Boot的Profile配置詳解

    本篇文章主要介紹了Spring Boot的Profile配置詳解,小編覺得挺不錯的,現在分享給大家,也給大家做個參考。一起跟隨小編過來看看吧...

    DT部落6012020-09-27
  • Java教程關于集合和字符串的互轉實現方法

    關于集合和字符串的互轉實現方法

    下面小編就為大家帶來一篇關于集合和字符串的互轉實現方法。小編覺得挺不錯的,現在就分享給大家,也給大家做個參考。一起跟隨小編過來看看吧 ...

    jingxian2872020-06-07
主站蜘蛛池模板: 香蕉久久网 | 精品国产自在在线在线观看 | 色综合九九 | 九九热在线视频观看这里只有精品 | 成人高清网站 | 国产一区二区视频在线观看 | 手机在线观看国产精选免费 | 成人影院在线观看免费 | 狠狠色成人综合网图片区 | 精品国产美女AV久久久久 | 亚洲AV久久无码精品九号 | www.青草视频| china精品对白普通话 | 日韩在线天堂免费观看 | 青青草原在线免费 | 欧美免赞性视频 | 啪哆哆 | 国产东北三老头伦一肥婆 | 花房乱爱在线观看 | 好姑娘在线完整版视频 | 日韩精品国产自在欧美 | 欧美a在线| 国产精品嫩草影院一二三区 | 办公室操秘书 | 青春草在线观看精品免费视频 | 爽好舒服把腿张小说 | 精品视频99 | www四虎影院 | 91精品国产91久久 | 狠狠的撞进去嗯啊h女强男视频 | 色一情一区二区三区四区 | 国产精品欧美日韩一区二区 | 暖暖日本在线观看免费 | 高清视频在线观看+免费 | 国产亚洲精品日韩香蕉网 | 国产亚洲精品一区二区在线播放 | 99九九精品免费视频观看 | 欧美美女一区二区三区 | 亚洲AV久久无码精品九号软件 | 色婷婷综合久久久 | 极品ts赵恩静和直男激战啪啪 |