python代码实现论文〖文献引用顺序〗修改校对

python代码实现论文〖文献引用顺序〗修改校对

问题描述

之前师兄让我帮忙修改校对论文中文献引用的顺序,我在手动标注更改了十几个左右后实在盯不住了,所以才有了这篇博客。

部分期刊或者会议要求的文献引用格式是按照顺序引用的(即第一个引用出现的文献标号为【1】,第二个引用出现的文献标号为【2】,(下文多次引用同一文献时按原来唯一标号引用)以此类推)。但是在后期修改论文内容时我们往往会新加入论文,比如在IntroductionRelated works中新加入几篇论文,这些论文在文献引用部分中是添加在最后的,这样新加入的论文破坏了全文的论文引用顺序。

下面举一个正例和反例。

符合要求(假如这是文章的第一段)

Global community detection attempts to divide the overall network into multiple communities[1]. The main detection methods include label propagation algorithm[2], non-negative matrix decomposition, deep learning, and evolutionary clustering[3]. Local and global structure information are vital in community detection [4,5,6]. However, with the increase of data scale, the running time and memory required for global community detection increase [3], especially the evolutionary algorithms [5]. Many times, the overall community structure may often not be necessary [7,8].

不符合要求(假如这是文章的第一段)

Global community detection attempts to divide the overall network into multiple communities[1]. The main detection methods include label propagation algorithm[2], non-negative matrix decomposition, deep learning, and evolutionary clustering[9]. Local and global structure information are vital in community detection [10, 11, 12, 13, 14]. However, with the increase of data scale, the running time and memory required for global community detection increase [3], especially the evolutionary algorithms [4, 5]. Many times, the overall community structure may often not be necessary [6].

假如这时我们对文献顺序进行人工更改校对,文献数量有二三十篇还可以(如果是你自己写的文章,都可以记住哪个位置是那些文献),但文献数量达一旦达到几十篇或者上百篇,改着改着脑子肯定一片混乱???

下面讲述如何利用代码对文献顺序进行校对修改

解决思路

经常刷题的我一想,这不就是一个文献映射表吗,假如此时有100篇论文,增加文献后文献引用部分的顺序是【1-100】,文章内容中引用的顺序是混乱的。现在需要修改的就是按照文献的出现顺序对内容引用部分进行顺序更改,最后文献引用部分将所有论文按出现顺序进行排序。

解决办法就是模式匹配引用文献的数字标号,按出现的顺序对其从1开始赋值,建立一个旧文献对应其正确顺序的文献映射表。得到映射表后,为方便快速定位修改,可检索输出每段文献原来序号和对应的正确序号

字典的键值对为

{

k

e

y

:

v

a

l

u

e

}

{

旧的文献序号

:

对应正确序号

}

{key : value} →{旧的文献序号:对应正确序号}

{key:value}{旧的文献序号:对应正确序号}

代码实现

1、需要的库

from docx import Document  # word文档读写
import re                  # 模式匹配
import xlwt                # xls表格数据读写(可不用)

2、读取word文档得到每个段落的内容

def get_paragraphs(docx_path):
    # 打开word文档
    document = Document(docx_path)
    # 获取所有段落
    all_paragraphs = document.paragraphs
    paragraph_texts = []
    # 循环读取列表
    for paragraph in all_paragraphs:
        paragraph_texts.append(paragraph.text)
    return paragraph_texts

3、获取文献映射表

关键点,文献引用模式匹配,匹配形如[1], [2,3,4,5],[4,7,8]的引用文献编号

匹配模板为pattern = "([[1-9][0-9]?(,[1-9][0-9]?)*])"

def get_order(paragraph_texts):
    """
    :param paragraph_texts: 段落list
    :return: paper_order {num : num}
    """
    paper_nums = 73
    paper_map = {}
    for i in range(1, paper_nums):
        paper_map[i] = -1
    index = 1

    # 模式匹配  形如 [1], [1,2,3]
    pattern = "([[1-9][0-9]?(,[1-9][0-9]?)*])"

    for texts in paragraph_texts:
        resList = re.findall(pattern=pattern, string=texts)
        for res in resList:
            # print(res[0])
            nums = get_nums(res[0])

            for num in nums:
                key = (int)(num)
                if paper_map[key] == -1:
                    paper_map[key] = index
                    index += 1
    return paper_map


def get_nums(sres : str):
    """
    形如 "[1]", "[1,2,3,4,5]" 的字符串返回数字列表
    :param sres:
    :return: list [str]
    """
    ssub = sres[1:-1]
    sp = ssub.split(',')
    return sp

4、输出需要更改的内容

def print_location(paragraph_texts, paper_map):
    """
    输出paper对应修改的内容
    :param paragraph_texts:
    :return:
    """
    # 模式匹配  形如 [1], [1,2,3]
    pattern = "([[1-9][0-9]?(,[1-9][0-9]?)*])"

    for texts in paragraph_texts:
        resList = re.findall(pattern=pattern, string=texts)

        if len(resList) != 0:
            print(texts[0:100])

        for res in resList:
            # 提取元组
            # 第一个内容 ('[1,2,3,4,5]', ',5')
            nums = get_nums(res[0])
            to_nums = []

            for num in nums:
                key = (int)(num)
                to_nums.append(paper_map[key])
            print(nums, "  --->  ", to_nums)

在这里我们输出的是需要修改的段落段首前100个字符(方便后期定位),需要修改的编号以及正确值,测试截图如下

在这里插入图片描述

5、将文献映射表写入xls文件(此处可忽略)

保存的是对映射表value值排序后的新元组,其实就是按顺序对应的旧文献编号

def write_to_excel(sorted_data):
    """
    将影射数据写入excel表
    :param sorted_data: [(12, 1), (213, 2)]
    :return:
    """
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('文献映射表', cell_overwrite_ok=True)

    for i in range(1, len(sorted_data)):
        data = sorted_data
        sheet.write(i - 1, 0, data[i][1])
        sheet.write(i - 1, 1, data[i][0])
    book.save("./paper-map.xls")

6、主函数

if __name__ == "__main__":
    docx_path = "paper-name.docx"
    paragraph_texts = get_paragraphs(docx_path)
    paper_map = get_order(paragraph_texts)
    print(paper_map)

    print_location(paragraph_texts, paper_map)

    # 对paper进行映射排序
    sort_reverse_order = sorted(paper_map.items(), key=lambda x: x[1])
    print(sort_reverse_order)

    # 存excel表
    write_to_excel(sort_reverse_order)

7、保存正确文献顺序至word

首先把上文得出的排序后的映射表复制过来,将最后正确的文献输出保存在word文档中,顺便在每一条文献中标记下原来的顺序后期检查看看是否正确

import docx

# 把排序后的映射表复制在此处
paper_map = {1: 5, 2: 6, 3: 7, 4: 8, 5: 9, 6: 10, 7: 11, 8: 12, 9: 13, 10: 14,
             11: 15, 12: 16, 13: 17, 14: 55, 15: 60, 16: 61, 17: 34, 18: 37, 19: 35, 20: 38,
             21: 39, 22: 58, 23: 62, 24: 40, 25: 63, 26: 64, 27: 41, 28: 42, 29: 43, 30: 44,
             31: 18, 32: 36, 33: 59, 34: 45, 35: 28, 36: 46, 37: 19, 38: 23, 39: 67, 40: 70,
             41: 71, 42: 26, 43: 47, 44: -1, 45: 29, 46: 30, 47: 31, 48: 24, 49: 25, 50: 20,
             51: 27, 52: 32, 53: 33, 54: 56, 55: 57, 56: 66, 57: 48, 58: 54, 59: 49, 60: 68,
             61: 69, 62: 2, 63: 1, 64: 4, 65: 3, 66: 21, 67: 22, 68: 50, 69: 51, 70: 52,
             71: 53, 72: 65, 44: 72}

sort_reverse_order = sorted(paper_map.items(), key=lambda x: x[1])
print(paper_map)
print(sort_reverse_order)

file = docx.Document("paper.docx")
paras = file.paragraphs

print(len(paras))
# for para in paras:
#     print(para.text)

# 写入新文件 to-paper
to_file = docx.Document()
for i in range(0, len(paras)):
    cur = sort_reverse_order[i]
    text = "[" + (str)(cur[1]) + ']' + " (" + (str)(cur[0]) + ") " + paras[cur[0]- 1].text
    print(text)
    to_file.add_paragraph(text)

to_file.save("./toPaper.docx")

代码运行后word文档部分内容如下

在这里插入图片描述

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
THE END
分享
二维码
< <上一篇
下一篇>>