Forem: jelizaveta

Merge Word Documents in C# (No More Manual Copy-Paste)

jelizaveta — Sat, 09 May 2026 01:38:47 +0000

Many projects need to combine multiple Word documents into a single finished file—for example, combining materials from different chapters into one report, or merging submissions from various sources for delivery. At the same time, questions like whether the formatting stays correct after merging, whether pagination matches expectations, and whether the code is clean enough often determine whether the solution can be implemented smoothly.

Below are two common merging strategies, both implemented using Spire.Doc for .NET:

Method 1: Append the entire second document to the end of the first (usually creates a "start on a new page" style result)
Method 2: Iterate through the second document's Sections, clone their content objects, and append them to the last Section of the first (ideal for more fine-grained structure control)

Spire.Doc: Purpose & Installation

Spire.Doc is a .NET component library for reading, editing, and generating Word documents (DOC/DOCX). It provides a structure-oriented API—for example, loading documents, inserting content, traversing Section/Body/object collections, copying document elements, and saving as DOCX. Compared with manually parsing the DOCX zip package and XML, using Spire.Doc can significantly reduce development effort.

Installation (recommended via NuGet):

Open your project in Visual Studio
Right-click the project → Manage NuGet Packages
Search for and install Spire.Doc for .NET
Add the namespace reference in your code: using Spire.Doc;

After installation, you can load and manipulate Word files directly through the Document class.

Method 1: Append Entire Document (New Page Effect)

When your goal is to treat Doc2 as a whole and attach it to Doc1 afterward, and you want the merged result to feel close to Word's intuitive "Insert Document / Append Content" experience, you can use InsertTextFromFile.

Approach:

Create a Document object to hold the merged document
Load the main document (Doc1)
Insert another Word document entirely into the main document
Save the result as a new merged file

Sample code:

using Spire.Doc;

namespace MergeWord
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Document instance
            Document document = new Document();

            // Load the original Word document
            document.LoadFromFile("Doc1.docx", FileFormat.Docx);

            // Insert another Word document entirely to the original document
            document.InsertTextFromFile("Doc2.docx", FileFormat.Docx);

            // Save the result document
            document.SaveToFile("MergedWord.docx", FileFormat.Docx);
        }
    }
}

When to Use Method 1

Less code, faster development
Best for "overall append" merges by document boundaries
Less control at the Section level, but usually sufficient for common consolidation needs

Method 2: Append to Last Section (Precise Control)

If the merge requires more precision—e.g., keeping content continuous within the same Section as much as possible, or appending the second document's objects one by one to the end of the first—you can use the second approach: iterate through doc2.Sections, clone each section's Body.ChildObjects, and add them to doc1.LastSection.Body.ChildObjects.

Approach:

Load doc1 and doc2
Iterate through doc2.Sections
For each section, iterate through section.Body.ChildObjects
Add cloned objects to doc1.LastSection.Body.ChildObjects
Save doc1 as the merged result

Sample code:

using Spire.Doc;

namespace MergeWord
{
    class Program
    {
        static void Main(string[] args)
        {
            // Load two Word documents
            Document doc1 = new Document("Doc1.docx");
            Document doc2 = new Document("Doc2.docx");

            // Loop through the second document to get all the sections
            foreach (Section section in doc2.Sections)
            {
                // Loop through child objects in the section body
                foreach (DocumentObject obj in section.Body.ChildObjects)
                {
                    // Get the last section of the first document
                    Section lastSection = doc1.LastSection;

                    // Add cloned objects to the last section of the first document
                    lastSection.Body.ChildObjects.Add(obj.Clone());
                }
            }

            // Save the result document
            doc1.SaveToFile("MergeDocuments.docx", FileFormat.Docx);
        }
    }
}

When to Use Method 2

Closest to "concatenate content objects directly into the last section"
Uses Clone() to copy objects safely and avoid direct-reference issues
Better suited for complex cases (multi-section documents, header/footer, pagination rules)—final results still depend on your templates

Method 1 vs. Method 2: How to Choose

If you want a quick merge and an intuitive "append starting on a new page" effect: Method 1 is easier.
If you need stronger structure control and want to append doc2 content into the last Section of doc1: Method 2 is a better match.
If your templates include headers/footers, different page orientations (landscape/portrait), or complex section configurations: It's recommended to test with Method 2 first.
For ordinary text and table merges: Method 1 is typically faster and more reliable.

Downloading a Word Document from a URL Using C#

jelizaveta — Thu, 07 May 2026 02:49:00 +0000

When developing desktop or server-side applications, it’s often necessary to fetch a Word document from a network address and then process it or save it. This article explains how to use Free Spire.Doc for .NET with C# to implement the complete workflow of downloading a Word document from a specified URL and saving it locally.

Prerequisites

First, you need to add the Free Spire.Doc component to your project. You can search for "FreeSpire.Doc" in the NuGet package manager and install it, or download the DLL from the official website and add a reference manually.

In addition, you must import the required namespaces at the top of the code file: Spire.Doc, System.IO, and System.Net.

Implementation

The core idea is to use the WebClient class to download the remote document as a binary data stream, then load that memory stream into a Spire.Doc Document object, and finally save it as a local Word file.

Example code:

using Spire.Doc;
using System.IO;
using System.Net;

namespaceDownloadfromURL
{
    classProgram
    {
        staticvoidMain(string[] args)
        {
            Document doc = new Document();
            WebClient webClient = new WebClient();

            using (MemoryStream ms = new MemoryStream(webClient.DownloadData("http://www.example.com/sample.docx")))
            {
                doc.LoadFromStream(ms, FileFormat.Docx);
            }

            doc.SaveToFile("result.docx", FileFormat.Docx);
        }
    }
}

Key Steps Explained

Create a Document object

The Document class is the core class of Spire.Doc and represents a Word document instance.

Download data using WebClient

WebClient.DownloadData retrieves the remote resource from the specified URL and returns it as a byte[] binary array.

Wrap bytes in a memory stream and load the document

Use MemoryStream to wrap the byte array into a readable stream, then load it into the Document object using LoadFromStream, specifying the file format as Docx. The using statement ensures the memory stream is disposed properly after use.

Save to a local file

Call SaveToFile to write the document content to the local file system, again selecting Docx as the format.

Notes

Network exception handling:

In production, it’s recommended to add a try-catch block around DownloadData to handle possible WebException (e.g., network interruptions, invalid URL, etc.).

File format recognition:

LoadFromStream requires you to explicitly specify the file format. In this example, the URL points to a .docx file. If the remote file is an older .doc format, you should use FileFormat.Doc.

Memory and performance:

For large Word files, using MemoryStream directly can consume a lot of memory. Consider downloading to a temporary file first and then loading it.

HTTPS support:

WebClient supports HTTPS by default. If you encounter certificate validation issues, you can configure ServicePointManager.SecurityProtocol.

Extended Usage

This method is not limited to saving files. After loading the document, you can also edit its content, convert it to other formats (such as PDF or HTML), or extract text. Spire.Doc provides a rich API for handling elements like paragraphs, tables, and images in Word documents, so you can further expand functionality based on your needs.

Summary

By combining Free Spire.Doc for .NET with C# WebClient, you can elegantly download and save a Word document from a URL using only a small amount of code. This approach is stable and concise, making it suitable for scenarios such as data collection and document automation.

Automate PDF Difference Checks with Python (No More Manual Proofreading)

jelizaveta — Thu, 30 Apr 2026 01:47:36 +0000

In scenarios such as document version control, contract review, and report proofreading, accurately identifying differences between two PDF files is a common need. Traditional manual page-by-page comparison is inefficient and prone to missing changes. This article explains how to use the Spire.PDF for Python library to automate PDF document difference comparison through programming.

Install the Required Library

First, install the Spire.PDF library via pip:

pip install Spire.PDF

This library provides full PDF processing capabilities. The PdfComparer class is specifically designed for document comparison. Note that this is a commercial product, but it offers a free version with basic functionality so developers can evaluate it.

Full Document Comparison

When you need to compare all contents of two PDF documents, you can use the following approach:

from spire.pdf.common import *
from spire.pdf import *

# Load the first document
doc_one = PdfDocument("PDF_ONE.pdf")       

# Load the second document
doc_two = PdfDocument("PDF_TWO.pdf")  

# Create a PdfComparer object, using doc_two as the base document and doc_one as the target document
comparer = PdfComparer(doc_two, doc_one)

# Run the comparison and save the results to a new PDF file
comparer.Compare("ComparisonResults.pdf") 

# Release document resources
doc_one.Dispose()
doc_two.Dispose()

After running the code above, the program will generate a difference report named ComparisonResults.pdf. In the report, differences between documents are highlighted with different colors, making it easy for users to quickly find the changed sections.

Parameter Explanation : In the PdfComparer constructor, the first parameter is the base version, and the second parameter is the version to be compared. The output difference report is annotated with the base version as the reference.

Compare Specific Pages

In real-world applications, users may only care about certain pages of the documents. The following code demonstrates how to limit the comparison to a specified page range:

from spire.pdf.common import *
from spire.pdf import *

# Load two PDF documents
doc_one = PdfDocument("PDF_ONE.pdf")       
doc_two = PdfDocument("PDF_TWO.pdf")  

# Create a PdfComparer instance
comparer = PdfComparer(doc_two, doc_one)

# Set page ranges: compare pages 1 to 3 of the first document with pages 1 to 3 of the second document
comparer.PdfCompareOptions.SetPageRanges(1, 3, 1, 3)

# Execute the comparison for the specified page range
comparer.Compare("ComparePageRanges.pdf") 

# Release resources
doc_one.Dispose()
doc_two.Dispose()

SetPageRanges(start1, end1, start2, end2) uses the first two parameters to specify the starting and ending page numbers of the base document, and the last two parameters to specify the starting and ending page numbers of the document to compare. This method supports cases where the page ranges on both sides are not identical; the system will strictly compare pages according to the ranges you set, page by page.

Interpreting the Difference Report

The generated comparison results PDF follows these marking conventions:

Yellow highlight : indicates newly added content
Red highlight : indicates deleted content

By using a side-by-side viewing mode, users can clearly identify the exact differences between the two versions.

Typical Use Cases

Legal contract review : quickly identify revisions to contract clauses
Academic paper proofreading : locate text changes between different versions
Technical document version management : track changes in product manual updates
Financial statement reconciliation : verify numerical changes in data reports

Notes

The free version has a page limit (typically the first 10 pages). Full functionality requires a commercial license.
This comparison feature works for text-based PDF documents. For PDFs stored as images (scanned documents), the comparison results may be limited.
After completing the comparison, be sure to call Dispose() to release document objects and free system resources to prevent memory leaks.

Summary

Spire.PDF for Python provides a simple yet powerful way to compare PDF documents. With just a small amount of code, developers can automate difference analysis. Whether comparing an entire document or only specific pages, this library can effectively improve the efficiency and accuracy of document review workflows.

Read PDFs in Python: Extract Text and Images

jelizaveta — Tue, 28 Apr 2026 07:12:15 +0000

In daily work and study, we often need to batch-extract text or images from PDF files. For example, organizing clauses from a scanned contract, or collecting all the images from a product manual.

Dealing with PDFs used to be a headache, but with the right libraries, everything becomes simple. Today, we’ll introduce how to use Spire.PDF for Python —a powerful library that can extract text and images from PDFs with just a few lines of code.

Before you start, make sure you have installed the Spire.PDF library:

pip install Spire.PDF

1. Load the PDF Document

Before doing anything else, we need to load the PDF file into our code. Spire.PDF is very flexible and supports loading from a file path as well as loading from a data stream (Stream) .

Method 1: Load from a file

This is the most direct approach for fixed files on your local disk.

from spire.pdf import PdfDocument

# Create a PdfDocument instance
pdf = PdfDocument()
# Load a local PDF document
pdf.LoadFromFile("sample.pdf")

Method 2: Load from a data stream

If your PDF data is received from a network interface or generated in memory as byte data, this method is very useful.

from spire.pdf import PdfDocument, Stream

# Read the file as a byte array (demo: read from file; it can also come from a network)
withopen("sample.pdf", "rb") as f:
    byte_data = f.read()

# Create a stream object
pdfStream = Stream(byte_data)
# Load the PDF from the stream
pdf = PdfDocument(pdfStream)

2. Extract Text

Text extraction is one of the most common tasks when processing documents. The following code demonstrates how to iterate through all pages in a PDF and concatenate the text from each page.

It mainly uses two helper classes: PdfTextExtractor and PdfTextExtractOptions. Setting IsExtractAllText = True helps ensure that most visible text on the page is extracted.

# Assume the pdf object has already been loaded using the method above
all_text = ""

# Loop through each page
for pageIndex in range(pdf.Pages.Count):
    # Get the current page by index
    page = pdf.Pages.get_Item(pageIndex)

    # Create a text extractor
    text_extractor = PdfTextExtractor(page)

    # Configure extraction options
    options = PdfTextExtractOptions()
    options.IsExtractAllText = True
    options.IsSimpleExtraction = True

    # Extract and accumulate
    all_text += text_extractor.ExtractText(options)

# Print the result
print(all_text)

3. Extract Images

In many cases, key information in a PDF is actually hidden in illustrations or charts. Spire.PDF also provides a very convenient image extraction solution.

Using the PdfImageHelper helper class, we can directly get image information from a page, and then save each image as an image file (such as .png).

# Get the first page (index is 0)
page = pdf.Pages.get_Item(0)

# Create an image helper object
image_helper = PdfImageHelper()
# Get all image information on the page
images_info = image_helper.GetImagesInfo(page)

# Loop through and save each image
for i in range(len(images_info)):
    # Save as PNG format
    images_info[i].Image.Save(f"output/Images/image_{i}.png")

print(f"Successfully extracted {len(images_info)} images")

Note : If it’s a scanned PDF (image-based), what you extract is essentially the entire scanned image. If it’s an electronically generated PDF, it can accurately extract embedded standalone icons or photos.

4. Advanced Tips

Although the code above covers the basics, there are a few things worth paying attention to in real applications:

Page handling : The example extracts all text for demonstration purposes. If you want to process page by page, just control pageIndex in the loop.
Chinese support : The library supports Chinese well. When extracting Chinese PDFs, just ensure your encoding environment is UTF-8.
Free edition limitations : If you are using the free version of Spire.PDF, note that it usually has a limit on the number of pages it can process (for example, only the first 10 pages). If you need to handle many pages, you may need to evaluate the commercial version.

Summary

With Spire.PDF for Python , you’ll find that processing PDF files is surprisingly easy. Whether it’s reading a file, analyzing text page by page, or saving precious illustrations, you can get everything done with just a short handful of lines of code. This greatly improves document processing efficiency, letting you focus on the next steps—data analysis or business logic.

Try it now and let code free your hands!

Python Tutorial: Extracting Images and Text from PPT

jelizaveta — Thu, 23 Apr 2026 02:13:33 +0000

When we need to grab materials like images and text from a PowerPoint presentation, doing it manually—copying and pasting one by one—is not only time-consuming but also easy to miss things or make mistakes. Today, I'll share a simple way to batch extract images and text from PPT using Python.

Preparation

First, you need to install Spire.Presentation for Python. You can install it via the pip command:

pip install Spire.Presentation

Once the installation is complete, you can start writing the code.

Extracting Images from PPT

Often, the images in a PPT are the materials we need. The following code demonstrates how to batch extract all images from a PPT and save them locally:

from spire.presentation.common import *
from spire.presentation import *

# Create a Presentation instance
ppt = Presentation()

# Load the PowerPoint document
ppt.LoadFromFile("sample.pptx")

# Iterate through all images in the document
for i, image in enumerate(ppt.Images):
    # Extract and save the image
    ImageName = "ExtractImage/Images_" + str(i) + ".png"
    image.Image.Save(ImageName)

ppt.Dispose()

How it works:

Presentation(): Creates a PPT document object
LoadFromFile(): Loads the PPT file to be processed
ppt.Images: Gets the collection of all images in the document
image.Image.Save(): Saves the image in PNG format

After running, all images will be saved sequentially to the ExtractImage folder, named Images_0.png, Images_1.png, and so on.

Extracting Text from PPT

Besides images, extracting text content is also a common requirement. The following code iterates through each slide and extracts text from all shapes:

from spire.presentation import *
from spire.presentation.common import *

# Create a Presentation object
pres = Presentation()

# Load the PowerPoint presentation
pres.LoadFromFile("Sample.pptx")

text = []
# Iterate through each slide
for slide in pres.Slides:
    # Iterate through each shape
    for shape in slide.Shapes:
        # Check if the shape is of IAutoShape type (can contain text)
        if isinstance(shape, IAutoShape):
            # Extract text from the shape
            for paragraph in shape.TextFrame.Paragraphs:
                text.append(paragraph.Text)

# Write the extracted text to a file
with open("output/SlideText.txt", "w", encoding='utf-8') as f:
    for s in text:
        f.write(s + "\n")

pres.Dispose()

How it works:

pres.Slides: Gets the collection of all slides
slide.Shapes: Gets all shapes in each slide
IAutoShape: Represents the auto-shape type that can contain text
shape.TextFrame.Paragraphs: Gets the collection of paragraphs in the shape
Finally, all text is written to the SlideText.txt file, with one paragraph per line

Important Notes

Resource Release : After using the Presentation object, be sure to call the Dispose() method to release resources and avoid memory leaks.
File Paths : Ensure the PPT file path is correct. The directories for saving images and text need to be created in advance or created automatically using code.
Text Encoding : Use utf-8 encoding when writing to text files to properly handle non-English characters such as Chinese.
Image Format : The Save() method saves images in PNG format by default. Refer to the official documentation if you need other formats.
Shape Types : The text extraction only handles the IAutoShape type. If text is located in other shape types like tables or charts, additional processing is required.

Summary

With Spire.Presentation for Python, you can batch extract images and text from PPT with just a dozen lines of code. This library is powerful and easy to use, making it ideal for office automation scenarios. I hope this article helps you improve your work efficiency!

If you have more requirements for PPT automation processing, such as creating PPTs, modifying content, adding charts, etc., Spire.Presentation offers many more rich features waiting for you to explore.

Convert Markdown to HTML Using Python (3 Methods)

jelizaveta — Tue, 21 Apr 2026 06:18:02 +0000

In day-to-day technical writing and document management, Markdown—thanks to its concise syntax—has become the preferred choice for many developers. However, when we need to publish content to the web, HTML is still the irreplaceable presentation format. This article introduces three methods to convert Markdown to HTML using Python, each suited to different use cases.

Method 1: Use markdown2 (a lightweight open-source solution)

If you prefer an open-source approach, markdown2 is an excellent choice. It claims to be a “fast and complete Python Markdown implementation,” with support for many extension features.

First, install it via pip:

pip install markdown2

Then use the following code to perform the conversion:

import markdown2

# Read the Markdown file
withopen("example.md", "r", encoding="utf-8") as f:
    md_content = f.read()

# Convert to HTML
html_content = markdown2.markdown(md_content)

# Save the result
withopen("example.html", "w", encoding="utf-8") as f:
    f.write(html_content)

markdown2 supports a wide range of extended syntax, such as fenced code blocks, tables, footnotes, table-of-contents generation, and more. You can enable these via the extras parameter:

html = markdown2.markdown(md_content, extras=["fenced-code-blocks", "tables", "toc"])

Pros : Open-source and free, easy to install, rich extensions, excellent performance.

Cons : Functionality is relatively basic, and it has limited ability to preserve formatting in complex documents.

Method 2: Use the standard library markdown (the most versatile option)

The most commonly used Markdown conversion library in the Python community is the markdown module. It is also open-source and easy to use.

Installation:

pip install markdown

Example:

import markdown

withopen("example.md", "r", encoding="utf-8") as f:
    md_content = f.read()

# Support extension features
html = markdown.markdown(md_content, extensions=['extra', 'codehilite', 'tables'])

withopen("example.html", "w", encoding="utf-8") as f:
    f.write(html)

The markdown module also supports many extensions. The extra extension includes commonly used features such as tables, fenced code blocks, smart quotes, and more.

Pros : Most active community, well-documented, and a rich extension ecosystem.

Cons : Performance is slightly lower than markdown2.

Method 3: Use Spire.Doc for Python (an enterprise-grade solution)

Spire.Doc for Python is a powerful document processing library. It supports converting Markdown files directly to HTML while perfectly preserving the original format and structure.

Installation:

pip install spire.doc

Example:

from spire.doc import *

# Create a Document object
doc = Document()

# Load the Markdown file
doc.LoadFromFile("example.md", FileFormat.Markdown)

# Save as an HTML file
doc.SaveToFile("example.html", FileFormat.Html)

# Close the document to release resources
doc.Close()

This method is especially suitable for scenarios that require batch processing or higher conversion-quality requirements. You can also easily extend it into a batch conversion script—iterate through all .md files in a folder and automatically generate the corresponding HTML files.

Pros : Complete format preservation, supports image embedding, simple and easy-to-use APIs, supports batch processing.

Cons : Requires installing a commercial library (a free version is provided, but with limitations on the watermark).

Comparison and recommendations

Method	Open-source	Format Preservation	Performance	Suitable for
markdown2	Yes	Good	Excellent	Personal projects, quick conversion
markdown	Yes	Good	Medium	General use cases, community support
Spire.Doc	No	Excellent	Good	Enterprise applications, batch processing

Recommendations :

Prefer open-source and need high performance → choose markdown2
Need the widest community support and extension ecosystem → choose markdown
Prioritize conversion quality and perfect formatting → choose Spire.Doc

No matter which method you choose, you can set up a Markdown-to-HTML conversion workflow in just a few minutes, creating a seamless connection between content creation and web publishing.

How to Download a PDF from a URL in C#

jelizaveta — Fri, 17 Apr 2026 06:15:22 +0000

In everyday development, we often need to retrieve resources from the internet, especially PDF documents. Whether it is automatically backing up online reports, batch-downloading electronic invoices, or fetching dynamically generated contract files, efficiently and reliably saving remote PDFs locally is a very practical skill.

This article explains how to use the Spire.PDF for .NET library with C# to download a PDF document from a specified URL and save it locally. Spire.PDF provides a rich set of PDF processing features beyond just downloading and saving files.

Prerequisites

First, you need to install Spire.PDF for .NET in your project. You can do this via the NuGet Package Manager Console:

Install-Package Spire.PDF

Or via the .NET CLI:

dotnet add package Spire.PDF

This library supports .NET Framework 4.0 and above, .NET Core 3.1, .NET 5.0, and later versions.

Implementation Code

Below is the complete code example:

using System.IO;
using System.Net;
using Spire.Pdf;

namespace DownloadPdfFromUrl
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a PdfDocument object
            PdfDocument doc = new PdfDocument();

            // Create a WebClient object for downloading web resources
            WebClient webClient = new WebClient();

            // Download PDF data from the URL into a memory stream
            using (MemoryStream ms = new MemoryStream(
                webClient.DownloadData("http://www.example.com/sample.pdf")))
            {
                // Load PDF data from the stream into the PdfDocument object
                doc.LoadFromStream(ms);
            }

            // Save the PDF document to a local file
            doc.SaveToFile("result.pdf", FileFormat.PDF);

            // Release resources
            webClient.Dispose();
            doc.Close();
        }
    }
}

Code Explanation

1. Creating a PdfDocument Object

PdfDocument is the core class of Spire.PDF, representing a PDF document instance. It is used to hold and manipulate the PDF data downloaded from the internet.

2. Using WebClient to Download Data

WebClient is a simple HTTP download class in .NET. The DownloadData method returns a byte[], which represents the raw binary content of the PDF file.

3. Using MemoryStream as a Bridge

Wrapping the byte array into a MemoryStream allows us to use the doc.LoadFromStream(ms) method. This avoids the inefficient process of saving the file to disk before reading it again, enabling in-memory processing.

4. Loading and Saving the PDF

The LoadFromStream method parses the memory stream into a usable PDF document. Finally, SaveToFile persists the document to local storage with the filename result.pdf.

Notes

Exception Handling : In production environments, it is recommended to add try-catch blocks to handle network timeouts, invalid URLs, PDF format errors, and other exceptions.
Memory Management : Both WebClient and PdfDocument implement the IDisposable interface, so resources should be properly released. In the example, MemoryStream is handled with a using statement, but it is also recommended to explicitly dispose of webClient and doc, or wrap them in using blocks as well.
Asynchronous Version : For large files, consider using WebClient.DownloadDataTaskAsync or switching to HttpClient with async methods to avoid blocking the UI thread.
URL Validity : Ensure the URL directly points to a PDF file rather than a redirect page.

Extended Applications

With Spire.PDF, you can perform additional operations immediately after downloading a PDF, such as:

Extracting text or images
Merging multiple PDF files
Adding watermarks or headers/footers
Converting PDFs to images or Word format

Summary

This article demonstrated how to download a PDF from a URL and save it locally using C# and Spire.PDF for .NET. The entire process is simple and efficient, requiring only a few lines of core code.

Spire.PDF is not only a document loading and saving tool but also a powerful PDF processing library worth exploring further.

Can’t Copy Text from a PDF? Here Are 3 Ways to Fix It

jelizaveta — Tue, 14 Apr 2026 02:32:15 +0000

Have you ever run into this frustrating situation: after finally finding an important PDF report or academic paper, you realize it’s “protected”—your cursor turns into a blocked symbol, the right-click menu is grayed out, and you can’t even copy a few words.

That “so close, yet untouchable” feeling is incredibly annoying. The good news is that PDF protection isn’t always as solid as it seems. Today, let’s walk through three practical methods—and share a few behind-the-scenes insights you might not know.

Method 1: Google Docs — A Free “Icebreaker”

This method may sound like a workaround, but the underlying idea is clever: when Google Docs opens a PDF, it tries to reconstruct the document structure—and in the process, it often ignores the original copy restrictions.

Steps:

Open Google Drive and sign in
Upload the protected PDF file
Right-click the file and choose Open with → Google Docs
Wait for the conversion to complete, then copy the text

This works because most PDF “protection” is just a permission flag rather than true encryption. When Google Docs converts the file, it creates a brand-new document structure, so the original restriction flags don’t carry over.

However, note that this won’t work if the PDF is a scanned image rather than text-based content.

Method 2: PDF24 Online Converter — Simple but Mind the Privacy

PDF24 is a free toolkit provided by a German company, known for being reliable, with no annoying watermarks or file size limits.

Steps:

Visit the PDF24 website and open the PDF to TXT tool
Upload the protected PDF file
Click convert and wait for processing
Download the TXT file and freely copy the text

Behind the convenience of online tools lies an often-overlooked issue—privacy. Your files are processed on third-party servers. If your document contains contracts, internal reports, or sensitive personal data, think twice before uploading.

A practical tip: upload a harmless test file first to evaluate processing speed and review the site’s privacy policy before using it for important documents.

Method 3: Python Automation — Add an Engine for Batch Processing

When dealing with dozens or even hundreds of protected PDFs, manual methods become inefficient. That’s where Python scripts come in.

Install the required library:

pip install spire.pdf.free

Code Example:

from spire.pdf import *

doc = PdfDocument()
doc.LoadFromFile("Secured.pdf")

for i in range(doc.Pages.Count):
    page = doc.Pages[i]
    textExtractor = PdfTextExtractor(page)

    extractOptions = PdfTextExtractOptions()
    extractOptions.IsExtractAllText = True

    text = textExtractor.ExtractText(extractOptions)

    with open(f'output/TextOfPage-{i+1}.txt', 'w', encoding='utf-8') as file:
        lines = text.split("\n")
        for line in lines:
            if line != '':
                file.write(line)

doc.Close()

The real value of this approach lies not just in extraction, but in integration. You can embed this script into a data processing pipeline—for example, automatically monitoring a folder and extracting text from newly added protected PDFs into a database.

Also, note the easily overlooked parameter: IsExtractAllText = True. It forces extraction of text marked as “non-copyable,” effectively bypassing the permission checks enforced by PDF readers.

Note:

The free version of Spire.PDF for Python only supports documents with up to 10 pages. For larger files, you can split them into smaller parts or use alternative libraries.

Final Thoughts

These three methods serve different needs:

For occasional use, Google Docs is the easiest
For quick results (if privacy isn’t a concern), online tools are convenient
For batch processing or automation, Python is the best choice

One last point: while technology can solve whether you can copy text, it doesn’t answer whether you should . Before extracting content, always check the document’s copyright and usage terms. After all, tools themselves are neutral—it’s how we use them that matters.

Convert Excel to High-Quality JPG Using C#

jelizaveta — Fri, 10 Apr 2026 02:15:06 +0000

In everyday office development, there is often a need to convert Excel spreadsheets into images. Whether for report previews, data presentation, or preventing formatting issues, converting Excel to JPG is a practical solution. Today, we’ll show how to use the Spire.XLS library with C# to achieve high-quality Excel-to-JPG conversion.

Why High-Quality Conversion Matters

Taking screenshots or using basic conversion methods often results in blurry images and unclear text. This becomes especially problematic when printing or zooming in, as low-resolution images cannot meet quality requirements. By setting the resolution to 300 DPI, you can ensure the generated JPG images reach print-level clarity.

Implementation Steps

First, install the Spire.XLS library. You can search for Spire.XLS in the NuGet Package Manager and install it.

The core process consists of three parts:

Load the Excel file : Use the Workbook class to load the target worksheet
Convert to EMF stream : Export the specified range as an EMF memory stream
Adjust resolution and save : Set the resolution to 300 DPI using the ResetResolution method, then save as JPG

Complete Code

using Spire.Xls;
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;

namespace Convert
{
    class Program
    {
        static void Main(string[] args)
        {
            Workbook workbook = new Workbook();
            workbook.LoadFromFile("Input.xlsx", ExcelVersion.Version2013);
            Worksheet worksheet = workbook.Worksheets[0];

            using (MemoryStream ms = new MemoryStream())
            {
                worksheet.ToEMFStream(ms, 1, 1, worksheet.LastRow, worksheet.LastColumn);
                Image image = Image.FromStream(ms);
                Bitmap images = ResetResolution(image as Metafile, 300);
                images.Save("Result.jpg", ImageFormat.Jpeg);
            }
        }

        private static Bitmap ResetResolution(Metafile mf, float resolution)
        {
            int width = (int)(mf.Width * resolution / mf.HorizontalResolution);
            int height = (int)(mf.Height * resolution / mf.VerticalResolution);
            Bitmap bmp = new Bitmap(width, height);
            bmp.SetResolution(resolution, resolution);
            Graphics g = Graphics.FromImage(bmp);
            g.DrawImage(mf, 0, 0);
            g.Dispose();
            return bmp;
        }
    }
}

Key Code Explanation

The ToEMFStream method exports a specified worksheet range as an EMF (Enhanced Metafile) format, which is a vector format that preserves quality when scaled
The ResetResolution method takes a Metafile object and a target resolution, returning a resized Bitmap
Using MemoryStream avoids creating temporary files and allows the entire process to run in memory

Use Cases

Reporting systems : Convert data tables into images for embedding in Word, PowerPoint, or web pages
Data presentation : Ensure accessibility even when users don’t have Excel installed
Archiving and backup : Save important spreadsheets as images for long-term preservation without format changes

With this approach, you can easily convert Excel spreadsheets into high-resolution JPG images suitable for most office scenarios. If you need batch conversion, simply iterate through multiple worksheets in the workbook.

C#: Generate Word Documents Rapidly from a Template

jelizaveta — Wed, 08 Apr 2026 03:40:05 +0000

In daily development, we often encounter scenarios where we need to generate Word documents in bulk, such as contracts, notices, and reports. The most elegant approach is to prepare a template file and then use code to replace placeholders, quickly producing the final documents. In this article, we’ll show how to easily achieve this using Free Spire.Doc.

Why Choose Free Spire.Doc?

Free Spire.Doc is a free and easy-to-use Word processing library that allows you to create, read, edit, and save documents without installing Microsoft Office. It supports both .NET Framework and .NET Core, making it ideal for server-side batch processing.

Install via NuGet:

PM> Install-Package FreeSpire.Doc

Implementation Steps

Design a Word template in advance (e.g., template.docx) and mark placeholders for dynamic content
Load the template in code and replace placeholders with actual data
Support both text replacement and image insertion (e.g., profile photos)
Save the result as a new Word document

Complete Code

using Spire.Doc;
using Spire.Doc.Documents;
using Spire.Doc.Fields;
using System.Drawing;

namespace CreateWordByReplacingPlaceholders
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize a new Document object
            Document document = new Document();

            // Load the template Word file
            document.LoadFromFile("C:\\Users\\Administrator\\Desktop\\template.docx");

            // Dictionary to hold placeholders and their replacements
            Dictionary<string, string> replaceDict = new Dictionary<string, string>
            {
        { "#name#", "Michael Johnson" },
        { "#gender#", "Male" },
        { "#birthdate#", "March 20, 1990" },
        { "#address#", "1234 Maple Street" },
        { "#city#", "Los Angeles" },
        { "#province#", "California" },
        { "#postal#", "90001" },
        { "#country#", "United States" }
            };

            // Replace placeholders in the document with corresponding values
            foreach (KeyValuePair<string, string> kvp in replaceDict)
            {
                document.Replace(kvp.Key, kvp.Value, true, true);
            }

            // Path to the image file
            String imagePath = "C:\\Users\\Administrator\\Desktop\\portrait.png";

            // Replace the placeholder for the photograph with an image
            ReplaceTextWithImage(document, "#photo#", imagePath);

            // Save the modified document
            document.SaveToFile("ReplacePlaceholders.docx", FileFormat.Docx);

            // Release resources
            document.Dispose();
        }

        // Method to replace a placeholder in the document with an image
        static void ReplaceTextWithImage(Document document, String stringToReplace, String imagePath)
        {
            // Load the image from the specified path
            Image image = Image.FromFile(imagePath);
            DocPicture pic = new DocPicture(document);
            pic.LoadImage(image);
            pic.Width = 130;

            // Find the placeholder in the document
            TextSelection selection = document.FindString(stringToReplace, false, true);

            // Get the range of the found text
            TextRange range = selection.GetAsOneRange();
            int index = range.OwnerParagraph.ChildObjects.IndexOf(range);

            // Insert the image and remove the placeholder text
            range.OwnerParagraph.ChildObjects.Insert(index, pic);
            range.OwnerParagraph.ChildObjects.Remove(range);
        }
    }
}

Code Explanation

1. Text Replacement

First, prepare a dictionary that maps placeholders to their replacement values:

Dictionary<string, string> replaceDict = new Dictionary<string, string>
{
    { "#name#", "Michael Johnson" },
    { "#gender#", "Male" },
    // ... other fields
};

Then iterate through the dictionary and call the document.Replace method. The last two parameters indicate whether the replacement is case-sensitive and whether to match whole words only.

2. Image Replacement

Replacing text with an image is slightly more complex. The key steps are:

Load the image using Image.FromFile
Create a DocPicture object, load the image, and set its width
Locate the placeholder using FindString
Get the paragraph and index of the placeholder
Insert the image at the same position and remove the placeholder text

3. Save the Document

Finally, call SaveToFile to save the new document and release resources.

Output:

Template Preparation Tips

In your Word template, mark dynamic fields with placeholders, for example:

Field	Placeholder
Name	#name#
Gender	#gender#
Birth Date	#birthdate#
Photo	#photo#

Notes

Ensure that the template file path and image path are correct
Use unique placeholder patterns (e.g., #fieldname#) to avoid accidental replacements
Adjust Width and Height when inserting images to control display size
Always call Dispose() to release resources after processing

Summary

With Free Spire.Doc, you only need to maintain a single template file to generate thousands of personalized documents efficiently. The library also supports advanced features such as merging table cells, setting font styles, and adding headers and footers. Feel free to explore more!

Convert Images to a PDF Using Python (Including Merging)

jelizaveta — Thu, 02 Apr 2026 06:39:21 +0000

In everyday office or document work, we often need to merge multiple images into a single PDF file. Whether organizing scans, creating an e-book, or archiving materials, converting images to PDF is a very practical task. This article shows how to use Python and the Spire.PDF for Python library to easily convert and merge images into a PDF.

Why Use Spire.PDF for Python?

Spire.PDF for Python is a powerful PDF manipulation library that not only supports creating, reading, and editing PDF documents but also provides rich image-handling features. Compared with other libraries, Spire.PDF’s API is simple and intuitive, enabling easy image-to-PDF conversion and allowing precise control of page size and image layout.

Install it via PyPI:

pip install spire.pdf

Complete Code Example

The following code demonstrates how to merge all JPG/JPEG images in a specified folder into a single PDF file:

from spire.pdf import *
import os

# Folder path containing images
image_folder = r"C:\Users\Administrator\Desktop\Images"

# Output PDF file path
output_file = "output/CombinedImages.pdf"

# Ensure the output directory exists
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# Create a PDF document object
doc = PdfDocument()

# Remove page margins so images fill the whole page
doc.PageSettings.SetMargins(0.0)

# Get all JPG/JPEG files and sort them
image_files = sorted([
    f for f in os.listdir(image_folder)
    if f.lower().endswith((".jpg", ".jpeg"))
])

# Add each image to the PDF
for image_name in image_files:
    image_path = os.path.join(image_folder, image_name)

    # Load the image
    image = PdfImage.FromFile(image_path)

    # Get image dimensions
    width = image.PhysicalDimension.Width
    height = image.PhysicalDimension.Height

    # Create a page with the same size as the image
    page = doc.Pages.Add(SizeF(width, height))

    # Draw the image on the page
    page.Canvas.DrawImage(image, 0.0, 0.0, width, height)

# Save the merged PDF file
doc.SaveToFile(output_file)
doc.Dispose()

Code Explanation

Import libraries and set paths: Import Spire.PDF and the os module, and define the image folder path and output file path.
Create the PDF document: Create an empty PDF document with PdfDocument() and remove page margins with SetMargins(0.0) so images can fill the page completely.
Read image files: Use os.listdir() to get files in the folder, filter for JPG and JPEG using endswith(), and sort with sorted() to ensure images are merged in filename order.
Add images one by one: For each image, load it with PdfImage.FromFile(), get its original dimensions, create a PDF page with matching size, and draw the image on the page using DrawImage().
Save and release resources: Save the PDF with SaveToFile() and call Dispose() to free document resources.

Output

After running the code above, the program will automatically generate CombinedImages.pdf in the output folder. Each page of the PDF corresponds to one original image, and the page size matches the image dimensions, ensuring optimal display.

Extensions

Based on the code above, you can easily extend functionality:

Support more image formats: Add .png, .bmp, etc., to the filter.
Custom page size: Use a fixed page size instead of matching the image size.
Add image compression: Adjust image quality to control the PDF file size.
Batch processing: Generate separate PDFs for multiple folders.

Summary

Using Spire.PDF for Python to convert images to PDF results in concise, easy-to-understand code without requiring additional dependencies. Whether for personal or enterprise use, this feature can be quickly integrated. I hope this helps you improve efficiency in document handling and makes image management and sharing more convenient.

C# Tutorial: Easily Extract Text from PDF Files

jelizaveta — Tue, 31 Mar 2026 02:05:43 +0000

In daily office and data-processing work, PDF files are widely used because they are cross-platform and have stable formatting. However, extracting text from PDFs can be troublesome. Whether you're organizing materials, analyzing data, or building a text-retrieval system, efficient and accurate PDF text extraction is a fundamental need. This article shows how to use the powerful Spire.PDF for .NET component to easily extract PDF text using C# code.

Introduction to Spire.PDF for .NET

Spire.PDF for .NET is a professional PDF component that lets developers create, read, edit, and convert PDF files on the .NET platform—without installing Adobe Acrobat or other external dependencies.

Key features include:

Rich API for comprehensive PDF manipulation
Practical text-extraction capabilities
Support for extracting entire pages or text from specified regions

Install via NuGet:

Install-Package Spire.PDF

Extract All Text from a Specified Page

A common requirement is to extract all the text from a particular page of a PDF. Spire.PDF makes this straightforward.

Complete C# code:

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;

namespace ExtractTextFromIndividualPages
{
    internal class Program  
    {
        static void Main(string[] args)
        {
            // Create a PDF document instance
            PdfDocument pdf = new PdfDocument();
            // Load the PDF file
            pdf.LoadFromFile("Input.pdf");

            // Get the page to extract text from (index 1 = second page; index starts at 0)
            PdfPageBase page = pdf.Pages[1];

            // Create a PdfTextExtractor for the selected page
            PdfTextExtractor extractor = new PdfTextExtractor(page);
            // Set extraction options
            PdfTextExtractOptions option = new PdfTextExtractOptions
            {
                IsExtractAllText = true
            };
            // Extract text from the specified page
            string text = extractor.ExtractText(option);

            // Save the extracted text to a text file
            File.WriteAllText("Extracted.txt", text);
            // Close the PDF document
            pdf.Close();
        }
    }
}

Code flow:

Create a PdfDocument object and load the target PDF
Retrieve the specified page from the Pages collection
Set IsExtractAllText = true to ensure no text is omitted
Create a PdfTextExtractor with the page instance and call ExtractText
Write the extracted text to a local file and close the document

The process is simple—only a few core lines of code to convert a PDF page to plain text.

Extract Text from a Specified Area

In some scenarios you don't need the entire page, but only text from a specific region—for example:

A column in a table
A header area
A signature block

Spire.PDF provides a flexible solution for region-based extraction.

Complete C# code:

using Spire.Pdf;
using Spire.Pdf.Texts;
using System.IO;
using System.Drawing;

namespace ExtractTextFromDefinedArea
{
    internal class Program
    {
        static void Main(string[] args)
        {
            // Create a PDF document instance
            PdfDocument pdf = new PdfDocument();
            // Load the PDF file
            pdf.LoadFromFile("Input.pdf");

            // Get the second page (index 1 corresponds to the second page)
            PdfPageBase page = pdf.Pages[1];

            // Create a PdfTextExtractor for the selected page
            PdfTextExtractor textExtractor = new PdfTextExtractor(page);
            // Set extraction options (specify a rectangular area)
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions
            {
                // Rectangle parameters: X, Y, width, height
                ExtractArea = new RectangleF(0, 0, 595, 300)
            };

            // Extract text from the specified rectangle
            string text = textExtractor.ExtractText(extractOptions);

            // Save the extracted text to a text file
            File.WriteAllText("Extracted.txt", text);

            // Close the PDF document
            pdf.Close();
        }
    }
}

Key differences from full-page extraction:

Load the PDF and get the target page (same as before)
Define the extraction area using the ExtractArea property
Set a rectangle with coordinates (X, Y), width, and height (units: points)
Extract only text within that region

This method is especially useful for structured PDFs like:

Financial statements
Invoices
Forms

It allows precise targeting of needed fields, greatly improving information retrieval efficiency and accuracy.

Practical Use and Notes

Common applications in real development:

Data collection – Extract contract clauses into a database
Content analysis – Pull abstracts from research paper PDFs for search and indexing
Document archiving – Convert PDF content to searchable plain text

Important notes when using Spire.PDF:

Ensure rectangle coordinates and dimensions are accurate—use preview or measurement tools for positioning
For complex PDFs (multi-column layouts or special fonts), consider enabling full extraction mode for best results
Always call Close() after extraction to release document resources and avoid memory issues

Conclusion

With Spire.PDF for .NET, C# developers can implement high-quality PDF text extraction with minimal code. Whether extracting full pages or specific regions, the component provides intuitive and reliable solutions.

For .NET projects that need to process PDF text, Spire.PDF is a highly efficient option worth considering.