《TypeScript》来吧-做个爬虫爬B站

admin • 2021-12-06 20:27 • 前端

前言

好吧，标题党了，就是无意中在npm上发现了一个东西，叫做：superagent， 官网的解释是：nodejs里一个非常方便的客户端请求代理模块（链接：中文官网），不大理解，随手一搜，根据文中解释：它是一个强大并且可读性很好的轻量级ajaxAPI，是一个关于HTTP方面的一个库（哦～原来是一个HTTP的库），它可以设置请求头的各种字段，这点对写爬虫非常友好，等等，爬虫程序？这东西能写爬虫，说到这个，突然就不困了，继续一搜，还真可以，那么，我突然有一个大胆的想法，爬个B站吧！

代码

本文代码已上传CSDN，具体地址：CSDN资源下载地址，如果资源还没审核通过或者资源无法下载，请联系博主留下邮箱，博主如果在线会及时发送代码；

创建项目

既然最近一直在记录TS相关的博客，那么，就用TS吧，首先肯定初始化package.json以及tsconfig.json；

新建一个文件夹，作为项目的主目录，接着初始化package.json

npm init

接着初始化tsconfig.json，用来配置typescript

tsc --init

然后，安装typescript

npm install typescript --save

为了方便编译，可以选择性的安装ts-node

npm install ts-node --save

如果选择安装了ts-node，那么就顺便在package.json中新增一条编译的命令吧，打开package.json，在script中新增一个dev的命令

{
    "name": "bilibilicrawler",
    "version": "1.0.0",
    "description": "",
    "main": "index.js",
    "scripts": {
        "test": "echo "Error: no test specified" && exit 1",
        "dev": "ts-node ./app.ts"
    },
    "author": "",
    "license": "ISC",
    "dependencies": {
        "superagent": "^6.1.0",
        "typescript": "^4.5.2"
    },
    "devDependencies": {
        "@types/superagent": "^4.1.13",
        "ts-node": "^10.4.0"
    }
}

安装superagent， 因为是TS项目，所以superagent的types也是要一起安装的

npm install superagent @types/superagent -D

好了，项目初始化完毕

正文

目的

首先当然要定一个目标，程序的目标是：爬取B站首页上所有的信息，并对信息进行处理，看看首页上推荐的视频，作者是谁，视频标题是什么，播放量是多少，封面图片地址等等信息；

爬取HTML

定好了目标，那么第一步，那肯定是爬取到B站首页上整个页面的HTML代码，通过官网可知，superagent它有一些列的方法，并且对应各种请求，比如：get，post，delete等等，既然如此，我们只需要使用Get就可以了，打开项目文件夹，新建了app.ts文件，赶紧写上代码

/**
 * @description 程序本体
 */

import superagent from "superagent";

export default class BC {
    private static Bilibili: BC;
		// url就是对应的要爬取的网站地址
    constructor(private url: string) {
        this.start();
    }
    private async start() {
        // 获取HTML
        const BHtml = await superagent
                .get(this.url);
      
      	// 打印结果
      	console.log(BHtml)
    }
    static init(url: string, cookie: string) {
        if (!BC.Bilibili) {
            BC.Bilibili = new BC(url, cookie);
        }
        return BC.Bilibili;
    }
}


BC.init("https://www.bilibili.com/")

写完，运行编译命令

npm run dev

好家伙，直接403被拒绝了，这就尴尬了，老老实实打开B站，打开控制台，打开network，看看具体请求，是不是漏了什么，确实漏了，漏了cookie，通过请求得知，不管有没有登录都一定会带cookie信息，并且即使没有登录这个cookie也是有值的，只是不知道这个值代表什么意思，既然发现了，那就加上

/**
 * @description 程序本体
 */

import superagent from "superagent";

export default class BC {
    private static Bilibili: BC;
    constructor(private url: string, private cookie: string) {
        this.start();
    }
    private async start() {
        // 获取HTML
        const BHtml = await superagent
                .get(this.url)
                .set("cookie", this.cookie);
				
      	// 打印结果
      	console.log(BHtml)
    }
    static init(url: string, cookie: string) {
        if (!BC.Bilibili) {
            BC.Bilibili = new BC(url, cookie);
        }
        return BC.Bilibili;
    }
}

继续使用命令，运行代码，好家伙，真的取到了整个HTML代码，虽然这个HTML代码完全没有办法阅读，但是没关系，到这里基本已经确定了，方向上是对的，只是继续优化而已；

分析HTML

这里就要说到另外一个工具了，强烈推荐，cheerio，对应的官网在这里：cheerio官网，这是一个对JQ核心功能的实现：对DOM的操作，它能在服务端，对有DOM组成的字符串直接进行操作，直接先看个例子吧：

var cheerio = require('cheerio'),
    $ = cheerio.load('<h2 class = "title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

$.html();

这个例子来自于网上，虽然简单，但是它一定程度上说明了这个插件的强大，有了这个插件，服务器端快乐的对DOM进行操作了

自然，这个插件也是需要安装的，同时因为是TS需要安装这个插件的type，安装代码如下：

npm install cheerio @types/cheerio -D

安装完成之后，写上对应的代码，代码如下：

/**
 * @description 程序本体
 */

import superagent from "superagent";
import cheerio from "cheerio";

export default class BC {
    private static Bilibili: BC;
    constructor(private url: string, private cookie: string) {
        this.start();
    }
    private async start() {
        // 获取HTML
        const BHtml = await superagent.get(this.url).set("cookie", this.cookie);
        if (BHtml.text === "") return false;

        //	初始化整个HTML
				const bc = cheerio.load(BHtml.text);
		
        //	初步处理HTML，获取对应区域的内容
        const infoBox = this._handleBHtml(bc);
        console.log(infoBox);
    }
    private _handleBHtml(bc: cheerio.Root) {
        return bc(".info-box");
    }
    static init(url: string, cookie: string) {
        if (!BC.Bilibili) {
            BC.Bilibili = new BC(url, cookie);
        }
        return BC.Bilibili;
    }
}

代码很简单，就是通过cheerio的load方法，将html加载完成，之后使用jq的语法，将我们需要的进一步分析的部分提取出来，值的注意的是，这里的“.info-box”是直接看B站上面的网页代码，取到我们需要对应部分的类名，熟悉JQ语法的同学应该很熟悉，甚至可以这样：

private _handleBHtml(html: string) {
    const $ = cheerio.load(html);

    return $(".info-box");
}

这样是不是更像JQ了…

转换HTML成数据

在这一部分，主要的目的就是将获取的区域DOM，进行进一步处理，这个处理其实也简单，说到底就是做一个遍历，遍历这个DOM，将需要的DOM下的各种文字提取出来，然后组成一个数组，在为这个数组加上时间戳，那么当前这个时间节点上的首页信息就完成了，不说了，上代码：

/**
 * @description 程序本体
 */

import superagent from "superagent";
import cheerio from "cheerio";

// 引入node中的核心模块，文件处理以及路径处理
import fs from "fs";
import path from "path";

type typeInfoItem = {
    title: string;
    author: string;
    img: string | undefined;
    url: string | undefined;
    play: string | undefined;
};
type InfoData = {
    time: number;
    data: typeInfoItem[];
};

export default class BC {
    private static Bilibili: BC;
    constructor(private url: string, private cookie: string) {
        this.start();
    }
    private async start() {
        // 获取HTML
        const BHtml = await superagent.get(this.url).set("cookie", this.cookie);
        if (BHtml.text === "") return false;

        //	初始化整个HTML
        const bc = cheerio.load(BHtml.text);

        //	初步处理HTML，获取对应区域的内容
        const infoBox = this._handleBHtml(bc);

        // 进一步处理，将infoBox转成数组格式
        const infoArray: InfoData = this._handleBArray(bc, infoBox);
				console.log(infoArray)
    }
    private _handleBHtml(bc: cheerio.Root) {
        return bc(".info-box");
    }
    private _handleBArray(bc: cheerio.Root, info: cheerio.Cheerio) {
        const data: typeInfoItem[] = [];

        info.map((index, element) => {
            // 链接
            const infoUrl = bc(element)
                .find("a")
                .attr("href");

            // 图片地址
            const imageSrc = bc(element)
                .find("img")
                .attr("src");

            // 标题
            const title = bc(element)
                .find(".info")
                .find("p.title")
                .text();

            // up主
            const author = bc(element)
                .find(".info")
                .find("p.up")
                .text();

            // 播放量
            const playNumber = bc(element)
                .find(".info")
                .find("p.play")
                .text();

            data.push({
                title,
                author,
                img: imageSrc,
                url: infoUrl,
                play: playNumber
            });
        });

        return {
            time: new Date().getTime(),
            data
        };
    }
    static init(url: string, cookie: string) {
        if (!BC.Bilibili) {
            BC.Bilibili = new BC(url, cookie);
        }
        return BC.Bilibili;
    }
}

存储数据

到这里，我们的数据其实已经有了，但是我们取到数据后，还是要存储起来的，不存储数据那不是白搭了，要存储数据，需要使用到node里面的两个核心功能，这里需要一点点node知识，这两个核心功能模块，分别是：fs和path，简单的说fs是可以对文件进行操作，path是可以对路径进行操作，有了这两个模块，那么就可以在本地创建文件以及保存数据；
另外，实现代码前，我们需要对数据的结构进行一个简单的设计，我们期望存储的数据结构是这样的：

{
	"时间戳":[
    {
    	"title":"xxx",
      "author":"xxx",
      "img":"xxx",
      "url":"xxx",
      "play":"xxx"
    }
  ],
  "时间戳":[
    {
    	"title":"xxx",
      "author":"xxx",
      "img":"xxx",
      "url":"xxx",
      "play":"xxx"
    }
  ]
}

这样，我们就可以根据时间戳判断当前时间下推荐的视频是什么了，具体代码如下：

/**
 * @description 程序本体
 */

import superagent from "superagent";
import cheerio from "cheerio";

// 引入node中的核心模块，文件处理以及路径处理
import fs from "fs";
import path from "path";

type typeInfoItem = {
    title: string;
    author: string;
    img: string | undefined;
    url: string | undefined;
    play: string | undefined;
};
type InfoData = {
    time: number;
    data: typeInfoItem[];
};
type JSONData = {
    [propName: number]: typeInfoItem[];
};

export default class BC {
    private static Bilibili: BC;
    constructor(private url: string, private cookie: string) {
        this.start();
    }
    private async start() {
        // 获取HTML
        const BHtml = await superagent.get(this.url).set("cookie", this.cookie);
        if (BHtml.text === "") return false;

        //	初始化整个HTML
        const bc = cheerio.load(BHtml.text);

        //	初步处理HTML，获取对应区域的内容
        const infoBox = this._handleBHtml(bc);

        // 进一步处理，将infoBox转成数组格式
        const infoArray: InfoData = this._handleBArray(bc, infoBox);

        // 存储数据
        const saveStatus = this._saveBArray(infoArray);
        console.log(saveStatus);
    }
    private _handleBHtml(bc: cheerio.Root) {
        return bc(".info-box");
    }
    private _handleBArray(bc: cheerio.Root, info: cheerio.Cheerio) {
        const data: typeInfoItem[] = [];

        info.map((index, element) => {
            // 链接
            const infoUrl = bc(element)
                .find("a")
                .attr("href");

            // 图片地址
            const imageSrc = bc(element)
                .find("img")
                .attr("src");

            // 标题
            const title = bc(element)
                .find(".info")
                .find("p.title")
                .text();

            // up主
            const author = bc(element)
                .find(".info")
                .find("p.up")
                .text();

            // 播放量
            const playNumber = bc(element)
                .find(".info")
                .find("p.play")
                .text();

            data.push({
                title,
                author,
                img: imageSrc,
                url: infoUrl,
                play: playNumber
            });
        });

        return {
            time: new Date().getTime(),
            data
        };
    }
    private _saveBArray(data: InfoData) {
        // 获取存储的文件的路径
        const filePath = path.resolve(__dirname, "../data/bilibiliData.json");

        let jsonData: JSONData = {};
        // 如果路径存在路径，那么代表要先读取，合并
        if (fs.existsSync(filePath)) {
            jsonData = JSON.parse(fs.readFileSync(filePath, "utf-8"));
        }
        jsonData[data.time] = data.data;

        // 写入
        fs.writeFileSync(filePath, JSON.stringify(jsonData));
        return "写入成功";
    }
    static init(url: string, cookie: string) {
        if (!BC.Bilibili) {
            BC.Bilibili = new BC(url, cookie);
        }
        return BC.Bilibili;
    }
}

到这里，实际上基本代码流程性的部分就完成了，剩下的就是做优化，比如，抽离一些具体的实现型代码，接着就是写一些定时器，让程序周期性的去执行，在周期性的执行了一轮又一轮，那么就可以对数据进行分析了；

小结

本文主要记录了如何使用node+TS开发一个爬虫，具体利用到了superagent做数据的爬取，使用cheerio去获取我们需要分析的部分，之后通过遍历将内容拆解成我们需要的格式，之后利用node的核心模块fs以及path将爬取到的数据存储到本地；

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。

THE END

二维码

Vue 项目接入使用超图 SuperMap

< <上一篇

React组件基础

下一篇>>

搜索内容