在Itextsharp中使用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标

这是实现的非常非常简单的版本。

在实施之前， _ _ 重要的一点是要知道PDF的“单词”，“段落”，“句子”等概念为零。此外，PDF中的文本不一定从左到右，从上到下地排列，并且没有任何意义。与非LTR语言有关。短语“ Hello World”可以写为PDF：

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

也可以写成

Draw Hello World at (10,10)

ITextExtractionStrategy您需要实现的接口具有一种称为的方法RenderText，该方法将为PDF中的每个文本块调用一次。注意我说的是“块”而不是“词”。在上面的第一个示例中，对于这两个单词，该方法将被调用四次。在第二个示例中，这两个单词将被调用一次。这是要理解的非常重要的部分。PDF没有单词，因此，iTextSharp也没有单词。“单词”部分由您决定要100％解决。

同样，正如我上面所说，PDF没有段落。意识到这一点的原因是因为PDF无法将文本换行。每当您看到类似段落的内容时，实际上您会看到一个全新的文本绘图命令，该命令的y坐标与上一行不同。请参阅此内容以进行进一步讨论。

下面的代码是一个非常简单的实现。为此，我LocationTextExtractionStrategy将已经实现的子类化ITextExtractionStrategy。在每次调用时，RenderText()我都会找到当前块的矩形（在此使用Mark的代码），并将其存储以备后用。我正在使用这个简单的帮助器类来存储这些块和矩形：

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

这是子类：

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding @R_447_2419@ for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

最后是上面的实现：

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.open();

            doc.Add(new Paragraph("This is my sample file"));

            doc.Close();
        }
    }
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

我不能过分强调上述内容 _ _ 考虑“单词”，这取决于您。TextRenderInfo传入的对象RenderText具有一种称为的方法GetCharacterRenderInfos()，您可能可以使用该方法来获取更多信息。GetBaseline() instead of如果您不关心字体的后代，则可能还需要使用GetDescentLine（）`。

（我吃了一顿丰盛的午餐，所以感觉有所帮助。）

这是该版本的更新版本，MyLocationTextExtractionStrategy可完成我在下面的评论中所说的，即，它需要一个字符串来搜索并在每个块中搜索该字符串。由于列出的所有原因，在某些/许多/大多数/所有情况下，这将不起作用。如果子字符串在单个块中多次存在，则它还将仅返回第一个实例。连字和变音符号也可能与此混为一谈。

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0) {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding @R_447_2419@ for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

您将使用与以前相同的方法，但是现在构造函数具有一个必需的参数：

var t = new MyLocationTextExtractionStrategy("sample");

其他 2022/1/1 18:24:54 有506人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

在Itextsharp中使用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标

撰写回答

推荐问题

如何在IE8和9中支持占位符属性

如何在iReport中打印另一个列表中包含的字符串列表？

检查网站是否在iframe中

SVG在IE中无法正确缩放-具有额外的空间

如何在iOS> = 4.2.1 Mobile Safari中自动播放媒体？

如何在iOS 11和Swift 4中从相机捕获深度数据？

在IndexedDB中，是否可以进行排序的复合查询？

Java Lombok注释不在Intellij idea下编译

@ font-face在IE8中有效，但在IE9中无效

在INT字段上执行LIKE比较

在iOS中阅读短信

Spring Boot devtools-静态内容重新加载在IntelliJ中不起作用

在iOS中实现Google自定义搜索API

如何在iframe中的Rich Text编辑器中使用SendKeys（webdriver）命令

使用Tomcat在Intellij Idea中进行调试可创建两个Web App部署

在IIS上运行Go Web应用程序

为什么不能在if语句中声明变量？

如何在Internet Explorer浏览器的JavaScript中修复Array indexOf（）

有没有办法在InitState方法上加载异步数据？

如何在Intellij Idea 12中重命名/移动项目？

分类汇总

您的鼓励是对我最大的支持