How to Extract Content from a PDF in C# with Xceed PdfLibrary for .NET

Get text from every page or from a rectangle, and read form control values (text boxes, checkboxes). Use PdfDocument.Load, page.Text, GetTextFromArea, and doc.FormFields in C#.

How to Extract Content from a PDF in C# with Xceed PdfLibrary for .NET

Extracting content from a PDF means reading both the visible text on each page and the values inside form controls (text boxes, checkboxes, and so on). Such extraction is useful for search indexing, compliance, migrating data, or processing filled forms. Doing it in code gives you structured access without manual copy-paste or external tools.This guide shows how to extract content from a PDF in C# step by step: you’ll get all text from a document page by page, read text from a specific area, and read form field values using Xceed PdfLibrary for .NET. The setup (NuGet, license) is the same as in our how to create a PDF in C# post.If you’re new to the product, check out our first guide: How to Create a PDF in C#.

What you need

  • .NET 5 through .NET 10
  • Xceed PdfLibrary for .NET (trial or license). Add it via NuGet or reference the DLL.
  • A PDF to read from (or create one in code using the create-PDF or split-PDF snippets from the other posts)

Set up the project and license

Create a console or web project and add the library. From the command line (.NET CLI):

dotnet add package Xceed.PdfLibrary.NET

Concesión de licencias: Set your license key once at startup, before any PdfDocument calls. Get a free trial key or a full license from the product page. See the create a PDF in C# post for a full setup snippet.

Extract text from each page

To extract content from a PDF in C#, start with the text on each page. Load the document with PdfDocument.Load(path) (or a Stream), then loop over doc.Pages and use the Text property on each page. In the PDF’s native order, that property returns the page text with spacing preserved.

using (var doc = PdfDocument.Load("document.pdf"))
{
    for (int i = 0; i < doc.Pages.Count; i++)
    {
        string text = doc.Pages[i].Text;
        Console.WriteLine($"--- Page {i + 1} ---");
        Console.WriteLine(text?.Trim() ?? "");
    }
}

On empty pages or those with no extractable text, Text may be null or an empty string. Trimming and null-checking (as above) keeps output clean. This approach is ideal for building a full-text index, dumping content to a .txt file, or feeding text into another process.

Extract text from a specific area

Sometimes you only need the text inside a given rectangle (e.g. one column of a multi-column layout, a header strip, or a form-like zone). In that case, call GetTextFromArea(rectangle) on the page. Pass (x, y, width, height) in page coordinates (origin top-left, Y downward) to the Rectángulo constructor.

using (var doc = PdfDocument.Load("document.pdf"))
{
    var page = doc.Pages[0];
    var area = new Rectangle(50, 100, 400, 200); // left, top, width, height
    string textInArea = page.GetTextFromArea(area);
    Console.WriteLine(textInArea ?? "");
}

This is useful for template-based PDFs where you know the approximate position of the content you care about, or when you want to ignore headers and footers by only extracting the main body rectangle.

Words and reading order

Text returns a single string per page in the order it appears in the PDF. If you need word boundaries or a reading-order view, the library also exposes Words (a collection of Word objects with bounds and text) and OrderedText (text in top-left to bottom-right order). For many use cases (search, logging, or simple export), Text is enough.

Extract form field values (form control extraction)

When a PDF contains form controls (AcroForm fields), you can extract their values programmatically. After loading the document, doc.FormFields gives you all form fields in the document; page.FormFields limits the result to a given page. Each field has a Nombre, and its type determines how to read the value.

Text box fields: Cast to TextBoxFormField and use the Text propiedad. Checkbox fields: Cast to CheckBoxFormField and use the IsChecked property. You can handle both in one loop:

using (var doc = PdfDocument.Load("form.pdf"))
{
    foreach (var field in doc.FormFields)
    {
        if (field is TextBoxFormField textBox)
            Console.WriteLine($"{textBox.Name}: {textBox.Text}");
        else if (field is CheckBoxFormField checkBox)
            Console.WriteLine($"{checkBox.Name}: {(checkBox.IsChecked ? "checked" : "unchecked")}");
    }
}

Other control types: The library supports other form field types (e.g. ComboBoxFormField, ListBoxFormField, RadioButtonGroupFormField). Similarly, iterate doc.FormFields o page.FormFields, check the type, and read the appropriate property. For full details and properties per type, see the Xceed PdfLibrary for .NET documentation.

Combining text and form extraction

You can extract content from a PDF in C# by combining text and form extraction in one pass. Load the document once, then loop pages for Text (or GetTextFromArea where needed) and loop doc.FormFields for form values. As a result, you get both the visible text and the form data for indexing, validation, or export.

Run the code

Paste the snippets above into a console or web project, set your license key, and ensure you have a PDF at the path you pass to PdfDocument.Load (e.g. document.pdf o form.pdf).

From Visual Studio the current directory is usually the project folder; from the command line it is the folder from which you invoked the app.

If you don’t have a PDF yet, use the create a PDF in C# snippet to generate one, or create a form with FormFields (see the documentation) and then run the extraction code on it. Finally, open the console output or any file you write to and confirm the extracted text and form values match the PDF.

Common questions

Do I need to dispose the document? Yes. Prefer using (as in the snippets) or call Dispose so file handles and streams are released.

Can I load from a stream instead of a file? Yes. The overloads of PdfDocument.Load that accept a Stream are useful when the PDF comes from memory or a web request.

What if Text or OrderedText is null? For pages with no extractable text, the library may return null. Rely on null-coalescing (e.g. text ?? "") or explicit null checks before using the value.

How do I know a form field’s type before casting? Rely on is checks (e.g. if (field is TextBoxFormField)) or field.GetType(). The documentation lists all form field types and their properties.

Where can I see more samples? See the Xceed PdfLibrary for .NET documentation for advanced topics, more form field types, and extraction in reading order.

Need help? Visita nuestra support page.

Can I use this in an ASP.NET Core app? Yes. The same code runs in any .NET host. Set the license key at startup, then load and extract in your controllers or services. For user-uploaded PDFs, load from the upload stream and extract text or form values to store or return as JSON.

Summary: You’ve seen how to extract content from a PDF in C#: load with PdfDocument.Load, use Pages[i].Text o GetTextFromArea for text, and use FormFields with TextBoxFormField.Text y CheckBoxFormField.IsChecked for form control values. Dispose the document when done.

The same API works on Windows, macOS, and Linux.

What else can you do with Xceed PdfLibrary?

With Xceed PdfLibrary for .NET you can create PDFs, add form fields, sign documents, split by page or bookmark, and add watermarks. The same patterns (load, read or modify, save) apply.

Once you’re comfortable extracting content from a PDF in C#, try creating or filling form fields, or splitting and extracting in one workflow. The documentación has samples for each of these.

Ready to extract content from PDFs in .NET?

Download Xceed PdfLibrary for .NET and try it free for 45 days, no commitment.

Get it now – 45-day free trial

Next steps

dotnet add package Xceed.PdfLibrary.NET

Set your license key at startup, then load a PDF and use Pages[i].Text y doc.FormFields as shown above to extract content from a PDF in C#.




PDF Library for .Net is now out! Bundle it with Words for .Net for only 100$ for a limited time at checkout

add_action('wp_footer', function() { ?> document.addEventListener('DOMContentLoaded', function() { var cb = document.querySelector('#form-field-Email_Consent'); if (cb) cb.value = 'Yes'; });