pdftotext：高效的 PDF 文本提取工具

pdftotext 是一个开源的命令行工具，用于从 PDF 文件中提取文本内容。它是 Xpdf 工具套件的一部分，由 Glyph & Cog, LLC 开发。

主要特点

1. 纯文本提取

专门提取 PDF 中的文字内容
不保留格式（字体、颜色、布局等）
适合需要纯文本内容的场景

2. 跨平台支持

Linux
macOS
Windows
各类 Unix 系统

3. 高效快速

C++ 编写，运行速度快
内存占用小
可以批量处理大量 PDF 文件

4. 开源免费

GNU GPL 许可证
完全免费使用
源代码公开

安装方法

Linux（Debian/Ubuntu）

sudo apt-get update
sudo apt-get install poppler-utils

Linux（CentOS/RHEL）

sudo yum install poppler-utils

macOS（使用 Homebrew）

brew install poppler

Windows

下载预编译的二进制文件
或使用 WSL（Windows Subsystem for Linux）

基本用法

最简单的使用

pdftotext input.pdf output.txt

从标准输入读取

pdftotext - < input.pdf > output.txt

输出到标准输出

pdftotext input.pdf -

常用参数

`-enc` - 指定编码

pdftotext -enc UTF-8 input.pdf output.txt

`-layout` - 保持布局

pdftotext -layout input.pdf output.txt

`-eol` - 指定换行符

pdftotext -eol unix input.pdf output.txt

`-nopgbrk` - 不分页

pdftotext -nopgbrk input.pdf output.txt

`-fixed` - 固定宽度

pdftotext -fixed 5 input.pdf output.txt

实用示例

批量处理多个 PDF

for file in *.pdf; do
    pdftotext "$file" "${file%.pdf}.txt"
done

提取指定页码

pdftotext -f 1 -l 5 input.pdf output.txt
# 提取第 1 到第 5 页

处理加密的 PDF

pdftotext -upw password input.pdf output.txt

结合其他命令使用

pdftotext input.pdf - | grep "关键词"
# 直接在提取的内容中搜索关键词

指定输出编码为 UTF-8

pdftotext -enc UTF-8 input.pdf output.txt

与其他工具对比

pdftotext vs pdf2text

pdftotext：更快速，输出更简洁
pdf2text：功能更丰富，支持更多格式保留

pdftotext vs 在线工具

pdftotext：本地运行，更安全，可批量处理
在线工具：无需安装，但需要上传文件

局限性

1. 不保留格式

无法提取图片
不保留字体、颜色、大小等格式信息

2. 布局可能错乱

复杂的布局（如多栏、表格）提取后可能混乱
需要手动调整文本顺序

3. 不支持扫描版 PDF

只能提取"原生"文本
扫描图片需要先 OCR 识别

4. 中文支持

支持中文，但有时编码需要指定
建议使用 -enc UTF-8 参数

适用场景

✅ 适合：

快速提取 PDF 文字内容
文本分析和处理
索引和搜索
脚本自动化处理

❌ 不适合：

需要保留排版的场景
提取图片
处理扫描版 PDF
需要 PDF 转换为 Word/HTML

实战案例

案例 1：提取学术论文

pdftotext -enc UTF-8 paper.pdf paper.txt

提取后，可以用文本分析工具进行关键词提取、摘要生成等。

案例 2：批量处理合同文档

#!/bin/bash
mkdir -p output
for pdf in *.pdf; do
    txt="output/${pdf%.pdf}.txt"
    pdftotext -enc UTF-8 "$pdf" "$txt"
    echo "Processed: $pdf -> $txt"
done

案例 3：与文本处理流程集成

# 提取 PDF 文本
pdftotext input.pdf - | \
  # 转换为小写
  tr '[:upper:]' '[:lower:]' | \
  # 去除标点
  tr -d '[:punct:]' | \
  # 分词统计
  tr -s '[:space:]' '\n' | \
  sort | uniq -c | sort -rn > wordfreq.txt

性能优化建议

1. 批量处理时使用多线程

# GNU parallel 可以并行处理多个 PDF
find . -name "*.pdf" | parallel -j 4 pdftotext {} {.}.txt

2. 只提取需要的页面

# 如果只需要前几页，使用 -f 和 -l 参数
pdftotext -f 1 -l 10 input.pdf output.txt

3. 预处理大型 PDF

对于非常大的 PDF 文件，可以：

先确定需要处理的页码范围
使用 -f 和 -l 参数只提取需要的部分
避免处理整个文件

总结

pdftotext 是一个高效、快速、免费的 PDF 文本提取工具。虽然功能相对简单，但在需要纯文本内容的场景中表现出色。

对于开发者来说，掌握 pdftotext 可以方便地集成 PDF 文本处理功能到自动化脚本中，提高工作效率。

如果你的需求比较简单（只需要文本内容），pdftotext 是一个非常不错的选择。如果需要更复杂的功能（如保留格式、提取表格等），可以考虑使用更强大的工具或库。

龙鳞