您好, 欢迎来到 !    登录 | 注册 | | 设为首页 | 收藏本站

从pdf提取引用-Python

从pdf提取引用-Python

PDF非常复杂,我不是专家,但是我使用了extractText()的代码来查看其工作方式,并使用它print('>>>', operator, operands)可以查看它在PDF中找到的值。

在本文档中,它用于"Tm"将位置移动到新行,因此更改了原始代码extractText(),我曾经"Tm"添加\n并在行中得到了文本

Arto Anttila. 1995. How to recognise subjects in 
English. In Karlsson et al., chapt. 9, pp. 315-358. 
Dekang Lin. 1996. Evaluation of Principar with the 
Susanne corpus. In John Carroll, editor, Work- 
shop on Robust Parsing, pages 54-69, Prague. 
Jason M. Eisner. 1996. Three new probabilistic 
models for dependency parsing: An exploration. 
In The 16th International Conference on Compu- 
tational Linguistics, pages 340-345. Copenhagen. 
David G. Hays. 1964. Dependency theory: A 
formalism and some observations. Language, 
40(4):511-525.

或与---线之间

---
Arto Anttila. 1995. How to recognise subjects in 
---
English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the 
---
Susanne corpus. In John Carroll, editor, Work- 
---
shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic 
---
models for dependency parsing: An exploration. 
---
In The 16th International Conference on Compu- 
---
tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A 
---
formalism and some observations. Language, 
---
40(4):511-525.

但是它仍然不是那么有用,但是现在我用来获得这个结果的代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    # code from original `extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

挖掘之后,似乎Tm也有值,并且有一个新位置x, y可以用来计算文本行之间的距离,并且\n当距离大于某个值时可以添加。我测试了不同的价值,从价值中17我得到了预期的结果

---
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. 
---

这里的代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText2(self):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    prev_x = 0
    prev_y = 0

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        elif operator == b_("Tm"):
            x = operands[-2]
            y = operands[-1]

            diff_x = prev_x - x
            diff_y = prev_y - y

            #print('>>>', diff_x, diff_y - y)
            #text += f'| {diff_x}, {diff_y - y} |'

            if diff_y > 17 or diff_y < 0:  # (bigger margin) or (move to top in next column)
                text += '\n'
                #text += '\n' # to add empty line between elements

            prev_x = x
            prev_y = y

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

它适用于此PDF,但其他文件可能具有不同的结构或彼此之间的距离,references并且可能需要其他更改。

更通用的版本-它有第二个论点

如果没有第二个参数运行

 text += myExtractText(page)

那么它的工作原理就像原始的一样,extractText()并且您将所有内容集中在一个字符串中。

如果第二个参数是 True

 text += myExtractText(page, True)

然后每次添加新行Tm-就像我的第一个版本一样。

如果第二个参数是整数-即。 17

 text += myExtractText(page, 17)

然后它会在距离更大时添加新行17-就像我的第二个版本一样。

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self, distance=None):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)

    prev_x = 0
    prev_y = 0

    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        if operator == b_("Tm"):

            if distance is True: 
                text += '\n'

            elif isinstance(distance, int):
                x = operands[-2]
                y = operands[-1]

                diff_x = prev_x - x
                diff_y = prev_y - y

                #print('>>>', diff_x, diff_y - y)
                #text += f'| {diff_x}, {diff_y - y} |'

                if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                    text += '\n'
                    #text += '\n' # to add empty line between elements

                prev_x = x
                prev_y = y

    return text

# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function

    #text += myExtractText(page)        # modified function (works like original version)
    #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
    text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]

# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

它不仅对References文本而且对其余文本都是有用的-似乎将段落分割了。

PDF开始的结果

---
A non-projective dependency parser 
---
Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. @R_845_2419@ 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i 
---
Abstract 
---
We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. 
---
1 Introduction 
---
We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. 
---
see bi i ~ d'~b~ bird 
---
figure 1: Dependencies for sentence I see a bird. 
---
First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. 
---
64 
---
The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . 
---
2 Background 
---
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule 
---
python 2022/1/1 18:44:38 有321人围观

撰写回答


你尚未登录,登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进,让解决方法与时俱进

请先登录

推荐问题


联系我
置顶