Copyright © 2013 Claudius Teodorescu, published by the EXPath Community Group under the W3C Community Contributor License Agreement (CLA) . A human-readable summary is available.
This specification was published by the EXPath Community Group . It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups .
This proposal provides a module comprising a set of functions for manipulation of PDF documents. It has been designed to be compatible with XQuery 3.0 and XSLT 3.0, as well as any other standard based on XPath 3.0.
1 Introduction
1.1 Namespace conventions
1.2 Error management
2 Generalities
2.1 Page ranges
3 Creation
3.1 The pdf:create() Function
3.2 The pdf:create-page() Function
4 Metadata
4.1 Overall metadata
4.1.1 Basic overall metadata
4.1.2 Custom overall metadata
4.2 The pdf:get-document-metadata() Function
4.3 The pdf:set-document-metadata() Function
4.4 Component metadata
5 Content navigation
6 Content manipulation
6.1 The pdf:merge() Function
6.2 The pdf:split() Function
6.2.1 The pdf:options element
6.3 The pdf:extract() Function
6.4 The pdf:insert() Function
6.5 The pdf:delete() Function
6.6 The pdf:rotate() Function
6.7 The pdf:reverse() Function
7 Links and bookmarks
7.1 The pdf:create-bookmark() Function
7.2 The pdf:edit-bookmark() Function
7.3 The pdf:delete-bookmark() Function
7.4 The pdf:import-bookmarks() Function
7.5 The pdf:export-bookmarks() Function
7.6 The pdf:create-link() Function
7.7 The pdf:edit-link() Function
7.8 The pdf:delete-link() Function
7.9 The pdf:import-links() Function
7.10 The pdf:export-links() Function
8 Form controls
8.1 The pdf:get-text-fields() Function
8.2 The pdf:set-text-fields() Function
9 Stamping, commenting, annotating, and marking
9.1 The pdf:stamp() Function
10 Contents rasterization
10.1 The pdf:to-image() Function
11 Attachments
11.1 The pdf:list-file-attachments() Function
11.2 The pdf:import-attachment() Function
11.3 The pdf:export-attachment() Function
12 Validation
12.1 The pdf:validate() Function
12.2 The pdf:validate-links() Function
12.3 The pdf:validate-bookmarks() Function
13 Optimization
13.1 The pdf:optimize()() Function
13.2 The pdf:linearize() Function
13.3 The pdf:compress() Function
13.4 The pdf:uncompress() Function
14 Security
14.1 The pdf:encrypt() Function
14.2 The pdf:decrypt() Function
14.3 The pdf:add-signature() Function
14.4 The pdf:remove-signature() Function
15 Repairing
15.1 The pdf:repair() Function
16 Scenarios of usage
16.1 Insert a (blank) page every nth page
16.2 Delete broken links
16.3 Audit Bookmarks and Links
16.4 Validate PDF document
16.5 Delete pages
16.6 Rotate pages
This module allows manipulation of PDF documents.
The module defined by this document defines functions and elements in the namespace
http://expath.org/ns/pdf
. In this document, the
pdf
prefix, when used, is bound to this namespace URI.
Error codes are defined in the namespace http://expath.org/ns/error
. In
this document, err
prefix, when used, is bound to this namespace
URI.
These functions are for creation of PDF documents.
These functions are for getting and setting metadata about a PDF document and its contents.
This is metadata about the document itself.
The basic metadata is given in the table below (this is an excerpt from [PDF Reference 1.7]):
Key (as xs:string )
|
Value | Meaning |
---|---|---|
Title |
xs:string |
The document’s title. (Optional) |
Author |
xs:string |
The name of the person who created the document. (Optional) |
Subject |
xs:string |
The subject of the document. (Optional) |
Keywords |
xs:string |
Keywords associated with the document. (Optional) |
Creator |
xs:string |
If the document was converted to PDF from another format, the name of the application (for example, Adobe FrameMaker®) that created the original document from which it was converted. (Optional) |
Producer |
xs:string |
If the document was converted to PDF from another format, the name of the application (for example, Acrobat Distiller) that converted it to PDF. (Optional) |
CreationDate |
xs:date |
The date and time the document was created. (Optional) |
ModDate |
xs:date |
The date and time the document was most recently modified. (Required if PieceInfo is present in the document catalog; otherwise optional)
|
Trapped |
xs:string |
A string indicating whether the document has been modified to include trapping information (see [PDF Reference 1.7], section 10.10.5, “Trapping Support”). The legal values are: "True", "False, and "Unknown". The default value is "Unknown". (Optional) |
pdf:get-document-metadata()
Function
This function is used to get the global metadata for a PDF document, as it is defined in [PDF Reference 1.7], section 10.2 Metadata. It gets the global metadata, along with the custom metadata. See [overall-metadata] for details about the overall metadata.
pdf:get-document-metadata
($contents asxs:base64Binary
?) asmap(xs:string, xs:string)?
$contents
is the PDF contents to get the metadata from.
pdf:set-document-metadata()
Function
This function is used to set the overall metadata for a PDF document.
pdf:set-document-metadata
($contents asxs:base64Binary
?, $metadata asitem()
) asxs:base64Binary*
$contents
is the PDF contents to apply the metadata to.
$metadata
is the overall metadata to be applied. See [overall-metadata] for details about the overall metadata.
These functions are for manipulation of contents of the PDF documents.
pdf:merge()
Function
This function is used for merging multiple PDF documents or subsets of them (groups of pages).
pdf:merge
($contents asxs:base64binary
, $new-resource-metadata aselement(pdf:resource-metadata
) asxs:base64binary
pdf:split()
Function
This function is used for splitting a PDF document into a number of sections. It returns a sequence of sections.
There can be many options for splitting: page splitting, bookmarks splitting, extract until text.
pdf:split
($contents asxs:base64Binary
?, $options aselement(pdf:options)
) asxs:base64Binary*
$contents
is the PDF contents to be splitted.
$options
are the options for the current operation.
pdf:extract()
Function
This function is used for extracting pages from a PDF document. It has a parameter for setting the deletion of the respective pages after extraction. It returns the extracted pages.
pdf:delete()
Function
This function is used for deleting pages from a PDF document. It returns the PDF document having deleted the respective pages.
pdf:delete
($contents asxs:base64binary?
, $page-ranges asxs:string*
) asxs:base64binary?
$contents
is the PDF contents from which certain pages has to be deleted.
$page-ranges
is the specification of page ranges to be deleted. The syntax for page range can
be found at
2.1 Page ranges. For an example of usage, see scenario 16.5 Delete pages.
pdf:rotate()
Function
This function is used for rotating indicated pages of a PDF document. The rest of the pages remain unchanged and the page order is maintained. It returns the modified PDF document.
pdf:rotate
($contents asxs:base64binary?
, $page-ranges-and-directions asmap(xs:string, xs:string)?
) asxs:base64binary?
$contents
is the PDF contents for which certain pages has to be rotated.
$page-ranges-and-directions
is the specification of page ranges to be rotated and in which directions. This argument
is a map, having as keys
the page ranges and as values the corresponding directions. The syntax for page
range can be found at 2.1 Page ranges. The legal
values for rotation direction are: "right 90" (clockwise with 90 degress), "180"
(rotation with 180 degress), and "left 90" (counterclockwise with 90 degress).
For an example of usage, see scenario 16.6 Rotate pages.
These functions are for manipulation of links and bookmarks of a PDF document.
pdf:delete-bookmark()
Function
This function is used to delete bookmarks from a PDF document.
pdf:import-bookmarks()
Function
This function is used to import bookmarks to a PDF document.
These functions are designated for gathering information about and interacting with form controls of a PDF document.
pdf:get-text-fields()
Function
Get all the text fields from a PDF contents. Returns a map containing pairs of fully qualified name and value for each text field.
pdf:get-text-fields
($contents asxs:base64binary?
) asmap(xs:string, xs:string)?
$contents
is the PDF contents where to get the text fields from.
pdf:set-text-fields()
Function
Set the text fields of a PDF contents. Returns the updated PDF contents.
pdf:set-text-fields
($contents asxs:base64binary?
, $text-fields asmap(xs:string, xs:string)?
) asxs:base64binary?
$contents
is the PDF contents where to set the text fields to.
$text-fields
are the information sets about the text fields, namely a map containing pairs of
fully qualified name
and value for each text field to be set.
These functions are associating various objects with a PDF document.
pdf:stamp()
Function
This function is used for applying a stamp to a PDF document. It has two signatures, the first one without a CSS selector for the stamp.
When using the first signature, the content of $stamp-styling parameter should consists of a set of CSS declarations that will only be applied to the respective stamp. Also, the respective stamp will be applied on every page of the PDF contents.
pdf:stamp
($contents asxs:base64binary?
, $stamp asitem()
, $stamp-styling asxs:string
) asxs:base64binary?
pdf:stamp
($contents asxs:base64binary?
, $stamp asitem()
, $stamp-selector asxs:string
, $stamp-styling asxs:string
) asxs:base64binary?
$contents
is the PDF contents whose pages are to be stamped.
$stamp
is the stamp to be applied to the PDF contents. The stamp can be either a text, an
image, a PDF document,
a HTML + Javascript + CSS document, a SVG document, etc. (the implementations
should define what formats are supported).
$stamp-selector
is the CSS selector used to match the current stamp. Such selector is needed for
further operations on the stamp,
such as updating and deleting.
$stamp-styling
is the CSS styling for the current stamp.
These functions are for rendering the PDF documents.
pdf:to-image()
Function
This function is used for converting pages of a PDF document to images. Returns a sequence of the generated images.
pdf:to-image
($contents asxs:base64binary?
, $format asxs:string
, $scaling asxs:string
) asxs:base64binary*
$contents
is the PDF contents whose pages are to be converted to images.
$format
is the format of the outputted images.
$scaling
is the scaling of the outputted images.
These functions are for manipulation of PDF file attachments.
pdf:list-file-attachments()
Function
This function is used to list the file attachments of a PDF document.
These functions are designated for validating a PDF document.
These functions are designated for optimizing a PDF document.
pdf:optimize()()
Function
This function is used for optimizing a PDF document, by reducing the file size without affecting quality.
These functions are for protection of the PDF documents.
pdf:encrypt()
Function
This function is used for encrypting a PDF document by using a digital certificate.
Scenarios of usage of the functions comprised by this module.
for $broken-link in pdf:validate-links($pdf-document) return if () then () else ()
One can validate bookmarks and links, and export those found broken. They can be included in report and/or fixed and imported to document.
One can validate a PDf document using certain validation criteria, as such: dimensions of all/certain pages should be A4 (297 mm * 210 mm), there should be no contents within a certain rectangular area of a page (left margin where the print shop inserts a bar code, for instance), number of pages should be less than N, PDF version used.