EXPath

PDF Module 1.0

EXPath Candidate Module 9 April 2013

This version:
http://expath.org/spec/pdf/20111108
Latest version:
http://expath.org/spec/pdf
Editor:
Claudius Teodorescu, XML Consultant

Abstract

This proposal provides a module comprising a set of functions for manipulation of PDF documents. It has been designed to be compatible with XQuery 3.0 and XSLT 3.0, as well as any other standard based on XPath 3.0.

Table of Contents

1 Introduction
1.1 Namespace conventions
1.2 Error management
2 Generalities
2.1 Page ranges
3 Creation
3.1 The pdf:create() Function
3.2 The pdf:create-page() Function
4 Metadata
4.1 Overall metadata
4.1.1 Basic overall metadata
4.1.2 Custom overall metadata
4.2 The pdf:get-document-metadata() Function
4.3 The pdf:set-document-metadata() Function
4.4 Component metadata
5 Content navigation
6 Content manipulation
6.1 The pdf:merge() Function
6.2 The pdf:split() Function
6.2.1 The pdf:options element
6.3 The pdf:extract() Function
6.4 The pdf:insert() Function
6.5 The pdf:delete() Function
6.6 The pdf:rotate() Function
6.7 The pdf:reverse() Function
7 Links and bookmarks
7.1 The pdf:create-bookmark() Function
7.2 The pdf:edit-bookmark() Function
7.3 The pdf:delete-bookmark() Function
7.4 The pdf:import-bookmarks() Function
7.5 The pdf:export-bookmarks() Function
7.6 The pdf:create-link() Function
7.7 The pdf:edit-link() Function
7.8 The pdf:delete-link() Function
7.9 The pdf:import-links() Function
7.10 The pdf:export-links() Function
8 Form controls
8.1 The pdf:get-text-fields() Function
8.2 The pdf:set-text-fields() Function
9 Stamping, commenting, annotating, and marking
9.1 The pdf:stamp() Function
10 Contents rasterization
10.1 The pdf:to-image() Function
11 Attachments
11.1 The pdf:list-file-attachments() Function
11.2 The pdf:import-attachment() Function
11.3 The pdf:export-attachment() Function
12 Validation
12.1 The pdf:validate() Function
12.2 The pdf:validate-links() Function
12.3 The pdf:validate-bookmarks() Function
13 Optimization
13.1 The pdf:optimize()() Function
13.2 The pdf:linearize() Function
13.3 The pdf:compress() Function
13.4 The pdf:uncompress() Function
14 Security
14.1 The pdf:encrypt() Function
14.2 The pdf:decrypt() Function
14.3 The pdf:add-signature() Function
14.4 The pdf:remove-signature() Function
15 Repairing
15.1 The pdf:repair() Function
16 Scenarios of usage
16.1 Insert a (blank) page every nth page
16.2 Delete broken links
16.3 Audit Bookmarks and Links
16.4 Validate PDF document
16.5 Delete pages
16.6 Rotate pages

Appendices

A References
B Summary of Error Conditions


1 Introduction

This module allows manipulation of PDF documents.

1.1 Namespace conventions

The module defined by this document defines functions and elements in the namespace http://expath.org/ns/pdf. In this document, the pdf prefix, when used, is bound to this namespace URI.

Error codes are defined in the namespace http://expath.org/ns/error. In this document, err prefix, when used, is bound to this namespace URI.

1.2 Error management

Error conditions are identified by a code (a QName). When such an error condition is reached during the execution of the function, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error had been called).

2 Generalities

This section contains general information related to the functions detailed below.

2.1 Page ranges

The syntax for page ranges is as follows: for single page, use the page number; for a range of pages use a hyphen between the start page and the end page.

Single page: 17.
				
Page range: 17 - 22.
				

3 Creation

These functions are for creation of PDF documents.

3.1 The pdf:create() Function

This function is used for creating a PDF document.

pdf:create($contents as xs:base64binary*) as xs:base64binary
				

3.2 The pdf:create-page() Function

This function is used for creating a PDF page.

pdf:create-page($contents as xs:base64binary?) as xs:base64binary
				

4 Metadata

These functions are for getting and setting metadata about a PDF document and its contents.

4.1 Overall metadata

This is metadata about the document itself.

4.1.1 Basic overall metadata

The basic metadata is given in the table below (this is an excerpt from [PDF Reference 1.7]):

Basic overall PDF metadata
Key (as xs:string) Value Meaning
Title xs:string The document’s title. (Optional)
Author xs:string The name of the person who created the document. (Optional)
Subject xs:string The subject of the document. (Optional)
Keywords xs:string Keywords associated with the document. (Optional)
Creator xs:string If the document was converted to PDF from another format, the name of the application (for example, Adobe FrameMaker®) that created the original document from which it was converted. (Optional)
Producer xs:string If the document was converted to PDF from another format, the name of the application (for example, Acrobat Distiller) that converted it to PDF. (Optional)
CreationDate xs:date The date and time the document was created. (Optional)
ModDate xs:date The date and time the document was most recently modified. (Required if PieceInfo is present in the document catalog; otherwise optional)
Trapped xs:string A string indicating whether the document has been modified to include trapping information (see [PDF Reference 1.7], section 10.10.5, “Trapping Support”). The legal values are: "True", "False, and "Unknown". The default value is "Unknown". (Optional)

4.1.2 Custom overall metadata

This is custom metadata about the document itself. This metadata is get or set in form of an entry in the map(xs:string, xs:string) that constitutes the document's overall metadata.

4.2 The pdf:get-document-metadata() Function

This function is used to get the global metadata for a PDF document, as it is defined in [PDF Reference 1.7], section 10.2 Metadata. It gets the global metadata, along with the custom metadata. See [overall-metadata] for details about the overall metadata.

pdf:get-document-metadata($contents as xs:base64Binary?) as map(xs:string, xs:string)?
				
  • $contents is the PDF contents to get the metadata from.

4.3 The pdf:set-document-metadata() Function

This function is used to set the overall metadata for a PDF document.

pdf:set-document-metadata($contents as xs:base64Binary?, $metadata as item()) as xs:base64Binary*
				
  • $contents is the PDF contents to apply the metadata to.

  • $metadata is the overall metadata to be applied. See [overall-metadata] for details about the overall metadata.

4.4 Component metadata

This is metadata about individual components of a document.

5 Content navigation

These functions are for navigation among the PDF objects of a PDF document.

6 Content manipulation

These functions are for manipulation of contents of the PDF documents.

6.1 The pdf:merge() Function

This function is used for merging multiple PDF documents or subsets of them (groups of pages).

pdf:merge($contents as xs:base64binary,
	$new-resource-metadata as element(pdf:resource-metadata) as xs:base64binary
				

6.2 The pdf:split() Function

This function is used for splitting a PDF document into a number of sections. It returns a sequence of sections.

There can be many options for splitting: page splitting, bookmarks splitting, extract until text.

pdf:split($contents as xs:base64Binary?,
	$options as element(pdf:options)) as xs:base64Binary*
				
  • $contents is the PDF contents to be splitted.

  • $options are the options for the current operation.

6.2.1 The pdf:options element

	<pdf:options>
		(pdf:split-delimiter)
	</pdf:options>
					
  • the pdf:split-delimiter child element specifies the delimiter used for splitting the input PDF document.

6.3 The pdf:extract() Function

This function is used for extracting pages from a PDF document. It has a parameter for setting the deletion of the respective pages after extraction. It returns the extracted pages.

6.4 The pdf:insert() Function

This function is used for inserting pages into a PDF document.

6.5 The pdf:delete() Function

This function is used for deleting pages from a PDF document. It returns the PDF document having deleted the respective pages.

pdf:delete($contents as xs:base64binary?, $page-ranges as xs:string*) as xs:base64binary?
				
  • $contents is the PDF contents from which certain pages has to be deleted.

  • $page-ranges is the specification of page ranges to be deleted. The syntax for page range can be found at 2.1 Page ranges. For an example of usage, see scenario 16.5 Delete pages.

6.6 The pdf:rotate() Function

This function is used for rotating indicated pages of a PDF document. The rest of the pages remain unchanged and the page order is maintained. It returns the modified PDF document.

pdf:rotate($contents as xs:base64binary?, $page-ranges-and-directions as map(xs:string, xs:string)?) as xs:base64binary?
				
  • $contents is the PDF contents for which certain pages has to be rotated.

  • $page-ranges-and-directions is the specification of page ranges to be rotated and in which directions. This argument is a map, having as keys the page ranges and as values the corresponding directions. The syntax for page range can be found at 2.1 Page ranges. The legal values for rotation direction are: "right 90" (clockwise with 90 degress), "180" (rotation with 180 degress), and "left 90" (counterclockwise with 90 degress). For an example of usage, see scenario 16.6 Rotate pages.

6.7 The pdf:reverse() Function

This function is used for reversing pages of a PDF document. It returns the PDF document with its pages in reverse order.

pdf:reverse($contents as xs:base64binary?) as xs:base64binary?
				
  • $contents is the PDF contents for which to reverse page order.

7 Links and bookmarks

These functions are for manipulation of links and bookmarks of a PDF document.

7.1 The pdf:create-bookmark() Function

This function is used to create bookmarks in a PDF document.

7.2 The pdf:edit-bookmark() Function

This function is used to edit bookmarks of a PDF document.

7.3 The pdf:delete-bookmark() Function

This function is used to delete bookmarks from a PDF document.

7.4 The pdf:import-bookmarks() Function

This function is used to import bookmarks to a PDF document.

7.5 The pdf:export-bookmarks() Function

This function is used to export bookmarks from a PDF document.

7.6 The pdf:create-link() Function

This function is used to create links in a PDF document.

7.7 The pdf:edit-link() Function

This function is used to edit links of a PDF document.

7.8 The pdf:delete-link() Function

This function is used to delete links from a PDF document.

7.9 The pdf:import-links() Function

This function is used to import links to a PDF document.

7.10 The pdf:export-links() Function

This function is used to export links from a PDF document.

8 Form controls

These functions are designated for gathering information about and interacting with form controls of a PDF document.

8.1 The pdf:get-text-fields() Function

Get all the text fields from a PDF contents. Returns a map containing pairs of fully qualified name and value for each text field.

pdf:get-text-fields($contents as xs:base64binary?) as map(xs:string, xs:string)?
				
  • $contents is the PDF contents where to get the text fields from.

8.2 The pdf:set-text-fields() Function

Set the text fields of a PDF contents. Returns the updated PDF contents.

pdf:set-text-fields($contents as xs:base64binary?, $text-fields as map(xs:string, xs:string)?) as xs:base64binary?
				
  • $contents is the PDF contents where to set the text fields to.

  • $text-fields are the information sets about the text fields, namely a map containing pairs of fully qualified name and value for each text field to be set.

9 Stamping, commenting, annotating, and marking

These functions are associating various objects with a PDF document.

9.1 The pdf:stamp() Function

This function is used for applying a stamp to a PDF document. It has two signatures, the first one without a CSS selector for the stamp.

When using the first signature, the content of $stamp-styling parameter should consists of a set of CSS declarations that will only be applied to the respective stamp. Also, the respective stamp will be applied on every page of the PDF contents.

pdf:stamp($contents as xs:base64binary?, $stamp as item(), $stamp-styling as xs:string) as xs:base64binary?
				
pdf:stamp($contents as xs:base64binary?, $stamp as item(), $stamp-selector as xs:string, $stamp-styling as xs:string) as xs:base64binary?
				
  • $contents is the PDF contents whose pages are to be stamped.

  • $stamp is the stamp to be applied to the PDF contents. The stamp can be either a text, an image, a PDF document, a HTML + Javascript + CSS document, a SVG document, etc. (the implementations should define what formats are supported).

  • $stamp-selector is the CSS selector used to match the current stamp. Such selector is needed for further operations on the stamp, such as updating and deleting.

  • $stamp-styling is the CSS styling for the current stamp.

10 Contents rasterization

These functions are for rendering the PDF documents.

10.1 The pdf:to-image() Function

This function is used for converting pages of a PDF document to images. Returns a sequence of the generated images.

pdf:to-image($contents as xs:base64binary?, $format as xs:string, $scaling as xs:string) as xs:base64binary*
				
  • $contents is the PDF contents whose pages are to be converted to images.

  • $format is the format of the outputted images.

  • $scaling is the scaling of the outputted images.

11 Attachments

These functions are for manipulation of PDF file attachments.

11.1 The pdf:list-file-attachments() Function

This function is used to list the file attachments of a PDF document.

11.2 The pdf:import-attachment() Function

This function is used for importing an attachment into a PDF document.

11.3 The pdf:export-attachment() Function

This function is used for exporting an attachment into a PDF document.

12 Validation

These functions are designated for validating a PDF document.

12.1 The pdf:validate() Function

This function is used for validating a PDF document.

12.2 The pdf:validate-links() Function

This function is used for validating the links contained by a PDF document.

12.3 The pdf:validate-bookmarks() Function

This function is used for validating the bookmarks contained by a PDF document.

13 Optimization

These functions are designated for optimizing a PDF document.

13.1 The pdf:optimize()() Function

This function is used for optimizing a PDF document, by reducing the file size without affecting quality.

13.2 The pdf:linearize() Function

This function is used for linearizing a PDF document, for fast delivery over a network. The first page is already visible, while the rest is downloaded in background

13.3 The pdf:compress() Function

This function is used for compressing a PDF document.

13.4 The pdf:uncompress() Function

This function is used for uncompressing a PDF document.

14 Security

These functions are for protection of the PDF documents.

14.1 The pdf:encrypt() Function

This function is used for encrypting a PDF document by using a digital certificate.

14.2 The pdf:decrypt() Function

This function is used for decrypting a PDF document.

14.3 The pdf:add-signature() Function

This function is used for adding an electronic signature to a PDF document.

14.4 The pdf:remove-signature() Function

This function is used for removing an electronic signature from a PDF document.

15 Repairing

These functions are for repairing and recovering of PDF documents.

15.1 The pdf:repair() Function

This function is used for repairing a PDF document.

16 Scenarios of usage

Scenarios of usage of the functions comprised by this module.

16.1 Insert a (blank) page every nth page

pdf:insert(pdf:create-page(), $pattern)

16.2 Delete broken links

for $broken-link in pdf:validate-links($pdf-document)
return if () then () else ()
				

16.3 Audit Bookmarks and Links

One can validate bookmarks and links, and export those found broken. They can be included in report and/or fixed and imported to document.

				

16.4 Validate PDF document

One can validate a PDf document using certain validation criteria, as such: dimensions of all/certain pages should be A4 (297 mm * 210 mm), there should be no contents within a certain rectangular area of a page (left margin where the print shop inserts a bar code, for instance), number of pages should be less than N, PDF version used.

				

16.5 Delete pages

pdf:delete($pdf-document, ("3", "5 - 7", "13 - 17"))
				

16.6 Rotate pages

pdf:rotate($pdf-document, map {
	"3" := "right 90",
	"5 - 7" := "180",
	"13 - 17" := "left 90"
})
				

A References

XPath 3.0
XML Path Language (XPath) 3.0. Jonathan Robie, Don Chamberlin, Michael Dyck, John Snelson, editors. W3C Working Draft, 13 December 2011.
XSLT 3.0
XSL Transformations (XSLT) Version 3.0. Michael Kay, editor. W3C Working Draft, 10 July 2012.
XQuery 3.0
XQuery 3.0: An XML Query Language. Jonathan Robie, Don Chamberlin, Michael Dyck, John Snelson, editors. W3C Working Draft, 13 December 2011.
XPath and XQuery Functions and Operators 3.0
XPath and XQuery Functions and Operators 3.0. Michael Kay, editor. W3C Working Draft, 13 December 2011.
XQuery and XPath Data Model 3.0
XQuery and XPath Data Model 3.0. Norman Walsh, Anders Berglund, John Snelson, editors. W3C Working Draft, 13 December 2011.
XFA Specification 3.3
XML Forms Architecture (XFA) Specification, version 3.3. Adobe Systems Incorporated, 09 January 2012.
XFDF Specification 3.0
XML Forms Data Format Specification, version 3.0. Adobe Systems Incorporated, August 2009.
PDF Reference 1.7
Adobe® Portable Document Format, version 1.7. Adobe Systems Incorporated, November 2006.
CSS Print Profile
CSS Print Profile. Elika J. Etemad, Mozilla Corporation, and Melinda Grant, Hewlett-Packard Company, editors. W3C Working Group Note, 14 March 201.

B Summary of Error Conditions

err:PDF001
err:PDF001: The transformation formula is not supported.
err:PDF002
err:PDF002: The remote resource does not exist.
err:PDF003
err:PDF003: The user has no rights to access the remote resource.
err:PDF004
err:PDF004: The syntax of the transformation formula is wrong.