图片型PDF转html的步骤

1、使用WPS或百度会员将PDF文件OCR成docx文件。

2、使用google doc打开保存在drive中的docx文件，另存为html压缩文件。

3、打开该html文件，为方便处理，可以改为txt后缀，使用程序去除head部分及body部分的空格，再保存为html文件。注意有时文字乱码就需要将txt另存为utf8格式。

4、在浏览器中打开该html文件，同时wordpress打开编辑新文件，复制粘贴，再做少量修改即成。

5、在本网xycost.com每个页面打开后点阅读模式可以得到纯净版的html。去除了无必要的很多标签，体积可以降为原PDF的10%以下且为文字版。

去除空格的代码：

   private void button31_Click(object sender, EventArgs e)
    {
        OpenFileDialog openFileDialog = new OpenFileDialog
        {
            Filter = "Text and HTML Files|*.txt;*.html",
            Title = "请选择一个文本或HTML文件"
        };

        if (openFileDialog.ShowDialog() == DialogResult.OK)
        {
            string filePath = openFileDialog.FileName;
            string fileContent;

            try
            {
                // 读取文件内容
                fileContent = File.ReadAllText(filePath);
                string contentWithoutHead = RemoveHeadSection(fileContent); // 去除 <head> 部分
                string contentWithoutSpan = RemoveSpanTags(RemoveOtags(contentWithoutHead)); // 去除所有 <span> 标签
                string contentCleaned = RemoveEmptyHtmlTags(contentWithoutSpan); // 移除空白 <span> 标签及多余空白字符

                // 处理文件内容：去除汉字之间的空格
                string processedContent = RemoveSpacesBetweenChineseCharacters(contentWithoutSpan);

                SaveFileDialog saveFileDialog = new SaveFileDialog
                {
                    Filter = "Text Files|*.txt|HTML Files|*.html",
                    Title = "保存处理后的文件",
                    FileName = Path.GetFileNameWithoutExtension(filePath) + "_processed" + Path.GetExtension(filePath)
                };

                if (saveFileDialog.ShowDialog() == DialogResult.OK)
                {
                    // 保存处理后的内容
                    File.WriteAllText(saveFileDialog.FileName, processedContent);
                    MessageBox.Show("文件已成功保存！", "保存成功", MessageBoxButtons.OK, MessageBoxIcon.Information);
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show($"发生错误：{ex.Message}", "错误", MessageBoxButtons.OK, MessageBoxIcon.Error);
            }
        }
    }
    static string RemoveSpacesBetweenChineseCharacters(string input)
    {
        // 匹配汉字之间的空格
        string pattern = @"(?<=[\u4e00-\u9fa5])\s+(?=[\u4e00-\u9fa5])";
        return Regex.Replace(input, pattern, "");
    }
    // 去除 <span> 标签，但保留内容，包括带属性的复杂 <span> 标签
    static string RemoveSpanTags(string input)
    {
        string pattern = @"<span[^>]*>(.*?)<\/span>";
        return Regex.Replace(input, pattern, "$1", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    }
    // 去除其他类似空白字符的情况
    static string RemoveEmptyHtmlTags(string input)
    {
        // 匹配 <span> 标签
        // 保留包含 &nbsp; 的 <span> 标签
        // 移除其他 <span> 标签，仅保留内容
        string spanPattern = @"<span[^>]*>(.*?)<\/span>";

        return Regex.Replace(input, spanPattern, match =>
        {
            string content = match.Groups[1].Value; // 获取标签内容
            return content.Contains("&nbsp;") ? match.Value : content; // 如果包含 &nbsp;，保留标签；否则仅保留内容
        }, RegexOptions.Singleline | RegexOptions.IgnoreCase);
    }
    static string RemoveOtags(string input)
    {
        // 匹配 <o:p> 标签并移除
        string oTagPattern = @"<o:p[^>]*>.*?<\/o:p>";
        return Regex.Replace(input, oTagPattern, "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    }
    // 去除 <head> 标签及其内容
    static string RemoveHeadSection(string input)
    {
        string pattern = @"<head[\s\S]*?>[\s\S]*?</head>";
        return Regex.Replace(input, pattern, "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    }

图片型PDF转html的步骤

评论0

在线客服

升级VIP

返回顶部

图片型PDF转html的步骤

猜你喜欢

IIS 环境下使用 Redis 缓存的部署方式，本站已部署WPRocket+Redis，速度比原来上了一个台阶

网站数据库从access升级到sql express

如何使用 Elementor 在 WordPress 中显示阅读进度条

如何使用 Elementor 在 WordPress 中创建时间线

一次服务器证书ERR_SSL_KEY_USAGE_INCOMPATIBLE问题的解决

如何使用 Elementor 在 WordPress 中创建功能框

评论0

在线客服

升级VIP

返回顶部

社交账号快速登录

社交账号快速登录