The DocumentCloud API

All APIs besides the authentication endpoints are served from https://api.www.documentcloud.org/api.
If you develop in Python, check out python-documentcloud which is our Python wrapper for the DocumentCloud API and its corresponding documentation.

Overview

The API end points are generally organized as /api/<resource>/ representing the entirety of the resource, and /api/<resource>/<id>/ representing a single resource identified by its ID. All REST actions are not available on every endpoint, and some resources may have additional endpoints, but the following are how HTTP verbs generally map to REST operations:

/api/<resource>/

HTTP Verb	REST Operation	Parameters
GET	List the resources	May support parameters for filtering
POST	Create a new resource	Must supply all `required` fields, and may supply all non-`read only` fields

/api/<resource>/<id>/

HTTP Verb	REST Operation	Parameters
GET	Display the resource
PUT	Update the resource	Same as for creating - all required fields must be present. For updating resources `PATCH` is usually preferred, as it allows you to only update the fields needed. `PUT` support is included for completeness
PATCH	Partially update the resource	Same as for creating, but all fields are optional
DELETE	Destroy the resources

A select few of the resources support some bulk operations on the /api/<resource>/ route:

HTTP Verb	REST Operation	Parameters
POST	Bulk create	A list of objects, where each object is what you would `POST` for a single object
PUT	Bulk update	A list of objects, where each object is what you would `PUT` for a single object — except it must also include the ID
PATCH	Bulk partial update	A list of objects, where each object is what you would `PATCH` for a single object — except it must also include the ID
DELETE	Bulk destroy	Bulk destroys will have a filtering parameter, often required, to specify which resources to delete

Responses

Lists response will be of the form

{
    "next": <next url if applicable>,
    "previous": <previous url if applicable>,
    "results": <list of results>
}

with a 200 status code. The document search route will also include a count key, with a total count of all documents returned by the search.

Getting a single resource, creating and updating will return just the object. Create uses a 201 status code and get and update will return 200.

Delete will have an empty response with a 204 status code.

Batch updates will contain a list of objects updated with a 200 status code.

Specifying invalid parameters will generally return a 400 error code with a JSON object with a single "error" key, whose value will be an error message. Specifying an ID that does not exist or that you do not have access to view will return status 404. Trying to create or update a resource you do not have permission to will return status 403.

Pagination

All list views accept a per_page parameter, which specifies how many resources to list per page. It is 25 by default and may be set up to 100 for authenticated users. For anonymous users it is restricted to 25. You may register for a free account at https://accounts.muckrock.com/ to use the 100 limit. You may view subsequent pages by using the next URL.

Cursor Based Pagination

Page offset pagination does not scale well to a large number of pages. For improved performance, DocumentCloud uses a cursor based pagination system. Instead of a page parameter, there is a cursor parameter, which accepts an opaque cursor which specifies the last value seen. To use this system, you must use the next and previous links as returned by the API, as random access is not available. This system also restricts arbitrary ordering of the results, except for the document search route, which will still allow re-ordering with cursor based pagination.

If the cursor based pagination breaks your workflow, you may continue to use the old page-offset based pagination system for now. In the future, this will be disabled completely, and you will be forced to use the cursor based pagination. To use the page-offset based pagination, which also has a top level count key with a total count of the objects returned for all list queries, add a version=1.0 query parameter to your API queries. Be aware that this will make your queries less performant, possibly to the point of them being unusable. This should only be used as a stop-gap solution while you update your workflow to use the new cursor based pagination. Please reach out to info@documentcloud.org if you need assistance moving to the new version.

Sub Resources

Some resources also support sub resources, which is a resource that belongs to another. The general format is:

/api/<resource>/<id>/<subresource>/

/api/<resource>/<id>/<subresource>/<subresource_id>/

It generally works the same as a resource, except scoped to the parent resource.

TODO: Examples

Filters

Filters on list views which have choices generally allow you to specify multiple values, and will filter on all resources that match at least one choices. To specify multiple parameters you may either supply a comma separated list of IDs — ?parameter=1,2 — or by specify the parameter multiple times — ?parameter=1&parameter=2.

Rate Limits

The DocumentCloud API is rate limited to 10 requests per second. It also allows bursts up to 20 requests. This means if you exceed the the 10 request per second limit, it will serve you up to 20 requests more quickly, while keeping track of your average rate. After the 20 requests are served, additional requests will be rejected with an HTTP status of 503 until you again fall under an average of 10 requests per second. If you use the Python DocumentCloud library, it will automatically throttle your requests to 10 per second to avoid going over the rate limit. If you are writing custom code, please be mindful of the rate limits.

There is also a secondary limit of 500 requests per day for anonymous users. If you exceed this limit, you will start receiving errors with an HTTP status of 429. In order to avoid this, please register for a free account at https://aacounts.muckrock.com/. Currently, there are no daily limits of registered accounts, although this may change in the future.

Authentication

Authentication happens at the MuckRock accounts server located at https://accounts.muckrock.com/. The API provided there will supply you with a JWT access token and refresh token in exchange for your username and password. The access token should be placed in the Authorization header preceded by Bearer - {'Authorization': 'Bearer <access token>'}. The access token is valid for 5 minutes, after which you will receive a 403 forbidden error if you continue trying to use it. At this point you may use the refresh token to obtain a new access token and refresh token. The refresh token is valid for one day.

POST /api/token/

Param	Type	Description
username	String	Your username
password	String	Your password

Response

{'access': <access token>, 'refresh': <refresh token>}

POST /api/refresh/

Param	Type	Description
refresh	String	Refresh token

Response

{'access': <access token>, 'refresh': <refresh token>}

Documents

The documents API allows you to upload, browse and edit documents. To add or remove documents from a project, please see project documents.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the document
access	String	Default: `private`	The access level for the document
asset_url	String	Read Only	The base URL to load this document's static assets from
canonical_url	URL	Read Only	The canonical URL to view this document
created_at	Date Time	Read Only	Time stamp when this document was created
data	JSON	Not Required	Custom metadata
description	String	Not Required	A brief description of the document
edit_access	Bool	Read Only	Does the current user have edit access to this document
file_url	URL	Create Only	A URL to a publicly accessible document for the URL Upload Flow
file_hash	String	Read Only	A sha1 hash representation of the raw PDF data as a hexadecimal string.
force_ocr	Bool	Create Only	Force OCR even if the PDF contains embedded text - only include if `file_url` is set, otherwise should set `force_ocr` on the call to the processing endpoint. This operation clears underlying metadata about the document like authorship, creation date, etc. If this is a concern, make sure to keep a copy of the original document.
language	String	Default: `eng`	The language the document is in
ocr_engine	string	Not required	Specifies which OCR engine to use on documents. Use with force_ocr set to True. Accepted values: tess4 for tesseract and textract for Amazon textract (which requires AI credits).
noindex	Bool	Not required	Ask search engines and DocumentCloud search to not index this document
organization	Integer	Read Only	The ID for the organization this document belongs to
original_extension	String	Default: `pdf`	The original file extension of the document you are seeking to upload. It must be a supported file type
page_count	Integer	Read Only	The number of pages in this document
page_spec	Integer	Read Only	The dimensions for all pages in the document
pages	JSON	Write Only	Allows you to set page text via the API. See set page text for more information.
presigned_url	URL	Read Only	The pre-signed URL to directly `PUT` the PDF file to
projects	List:Integer	Create Only	The IDs of the projects this document belongs to - this may be set on creation, but may not be updated. See project documents
publish_at	Date Time	Not Required	A timestamp when to automatically make this document public
published_url	URL	Not Required	The URL where this document is embedded
related_article	URL	Not Required	The URL for the article about this document
remaining	JSON	Read Only	The number of pages left for text and image processing - only included if `remaining` is included as a `GET` parameter
slug	String	Read Only	The slug is a URL safe version of the title
source	String	Not Required	The source who produced the document
status	String	Read Only	The status for the document
title	String	Required	The document's title
updated_at	Date Time	Read Only	Time stamp when the document was last updated
user	Integer	Read Only	The ID for the user this document belongs to

Expandable fields: user, organization, projects, sections, notes

Uploading a Document

There are two supported ways to upload documents — directly uploading the file to our storage servers or by providing a URL to a publicly available PDF or other supported file type. To upload another supported file type you will need to include the original_extension field documented above.

Direct File Upload Flow

POST /api/documents/

To initiate an upload, you will first create the document. You may specify all writable document fields (besides file_url). The response will contain all the fields for the document, with two being of note for this flow: presigned_url and id.

If you would like to upload files in bulk, you may POST a list of JSON objects to /api/documents/ instead of a single object. The response will contain a list of document objects.
PUT <presigned_url>

Next, you will PUT the binary data for the file to the given presigned_url. The presigned URL is valid for 5 minutes. You may obtain a new URL by issuing a GET request to /api/documents/\<id\>/.

If you are bulk uploading, you will still need to issue a single PUT to the corresponding presigned_url for each file.
POST /api/documents/<id>/process/

Finally, you will begin processing of the document. Note that this endpoint accepts only one optional parameter — force_ocr which, if set to true, will OCR the document even if it contains embedded text.

If you are uploading in bulk you can issue a single POST to /api/document/process/ which will begin processing in bulk. You should pass a list of objects containing the document IDs of the documents you would like to being processing. You may optionally specify force_ocr for each document.

URL Upload Flow

POST /api/documents/

If you set file_url to a URL pointing to a publicly accessible PDF, our servers will fetch the PDF and begin processing it automatically.

You may also send a list of document objects with file_url set to bulk upload files using this flow.

Endpoints

GET /api/documents/ — List documents
POST /api/documents/ — Create document
PUT /api/documents/ — Bulk update documents
PATCH /api/documents/ — Bulk partial update documents
DELETE /api/documents/ — Bulk delete documents
- Bulk delete will not allow you to indiscriminately delete all of your documents. You must specify which document IDs you want to delete using the id__in filter.
POST /api/documents/process/ — Bulk process documents
- This will allow you to process multiple documents with a single API call. Expect parameters: [{"id": 1, "force_ocr": true}, {"id": 2}] It expects a list of objects, where each object contains the ID of the document to process, and an optional boolean, force_ocr, which will OCR the document even if it contains embedded text if set to true
GET /api/documents/search/ — Search documents
GET /api/documents/<id>/ — Get document
PUT /api/documents/<id>/ — Update document
PATCH /api/documents/<id>/ — Partial update document
DELETE /api/documents/<id>/ — Delete document
POST /api/documents/<id>/process/ — Process document
- This will process a document. It is used after uploading the file in the direct file upload flow or to reprocess a document, which you may want to do in the case of an error. It accepts one optional boolean parameter, force_ocr, which will OCR the document even if it contains embedded text if it is set to true. Note that it is an error to try to process a document that is already processing.
DELETE /api/documents/<id>/process/ — Cancel processing document
- This will cancel the processing of a document. Note that it is an error to try to cancel the processing if the document is not processing.
GET /api/documents/<id>/search/ — Search within a document

Filters

ordering — Sort the results — valid options include: created_at, page_count, title, and source. You may prefix any valid option with - to sort it in reverse order.
user — Filter by the ID of the owner of the document.
organization — Filter by the ID of the organization of the document.
project — Filter by the ID of a project the document is in.
access — Filter by the access level.
status — Filter by status.
created_at__lt, created_at__gt — Filter by documents created either before or after a given date. You may specify both to find documents created between two dates. This may be a date or date time, in the following formats: YYYY-MM-DD or YYYY-MM-DD+HH:MM:SS.
page_count, page_count__lt, page_count__gt — Filter by documents with a specified number of pages, or more or less pages then a given amount.
id__in — Filter by specific document IDs, passed in as comma separated values.

Notes

Notes can be left on documents for yourself, or to be shared with other users. They may contain HTML for formatting.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the note
access	String	Default: `private`	The access level for the note
content	String	Not Required	Content for the note, which may include HTML
created_at	Date Time	Read Only	Time stamp when this note was created
edit_access	Bool	Read Only	Does the current user have edit access to this note
organization	Integer	Read Only	The ID for the organization this note belongs to
page_number	Integer	Required	The page of the document this note appears on
title	String	Required	Title for the note
updated_at	Date Time	Read Only	Time stamp when this note was last updated
user	ID	Read Only	The ID for the user this note belongs to
x1	Float	Not Required	Left most coordinate of the note, as a percentage of page size
x2	Float	Not Required	Right most coordinate of the note, as a percentage of page size
y1	Float	Not Required	Top most coordinate of the note, as a percentage of page size
y2	Float	Not Required	Bottom most coordinate of the note, as a percentage of page size

Expandable fields: user, organization

The coordinates must either all be present or absent — absent represents a page level note which is displayed between pages.

Endpoints

GET /api/documents/<document_id>/notes/ - List notes
POST /api/documents/<document_id>/notes/ - Create note
GET /api/documents/<document_id>/notes/<id>/ - Get note
PUT /api/documents/<document_id>/notes/<id>/ - Update note
PATCH /api/documents/<document_id>/notes/<id>/ - Partial update note
DELETE /api/documents/<document_id>/notes/<id>/ - Delete note

Sections

Sections can mark certain pages of your document — the viewer will show an outline of the sections allowing for quick access to those pages.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the section
page_number	Integer	Required	The page of the document this section appears on
title	String	Required	Title for the section

Endpoints

GET /api/documents/<document_id>/sections/ - List sections
POST /api/documents/<document_id>/sections/ - Create section
GET /api/documents/<document_id>/sections/<id>/ - Get section
PUT /api/documents/<document_id>/sections/<id>/ - Update section
PATCH /api/documents/<document_id>/sections/<id>/ - Partial update section
DELETE /api/documents/<document_id>/sections/<id>/ - Delete section

Errors

Sometimes errors happen — if you find one of your documents in an error state, you may check the errors here to see a log of the latest, as well as all previous errors. If the message is cryptic, please contact us — we are happy to help figure out what went wrong.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the error
created_at	Date Time	Read Only	Time stamp when this error was created
message	String	Required	The error message

Endpoints

GET /api/documents/<document_id>/errors/ - List errors

Data

Documents may contain user supplied metadata. You may assign multiple values to arbitrary keys. This is represented as a JSON object, where each key has a list of strings as a value. The special key _tag is used by the front end to represent tags. These values are useful for searching and organizing documents. You may directly set or update the data from the document endpoints, but these additional endpoints are supplied to add or remove data on a per key basis.

Fields

Field	Type	Options	Description
values	List:String	Required	The values associated with a key
remove	List:String	Not Required	Values to be removed

remove is only used for PATCHing. values is not required when PATCHing.

Endpoints

GET /api/documents/<document_id>/data/ - List values for all keys
- The response for this is a JSON object with a property for each key, which will always be a list of strings, corresponding to the values associated with that key. Example:
```
{
  "_tag": ["important"],
  "location": ["boston", "new york"]
}
```
GET /api/documents/<document_id>/data/<key>/ - Get values for the given key
- The response for this is a JSON list of strings. Example: ["one", "two"]
PUT /api/documents/<document_id>/data/<key>/ - Set values for the given key
- This will override all values currently under key
PATCH /api/documents/<document_id>/data/<key>/ - Add and/or remove values for the given key
DELETE /api/documents/<document_id>/data/<key>/ - Delete all values for a given key

Redactions

Redactions allow you to obscure parts of the document which are confidential before publishing them. The pages which are redacted will be fully flattened and reprocessed, so that the original content is not present in lower levels of the image or as text data. Redactions are not reversible, and may only be created, not retrieved or edited. Redacting a document strips available metadata from a document about authorship, creation date, etc. If this is a concern to you, you may want to hold onto an original copy before redaction, as it is irreversible.

Fields

Field	Type	Options	Description
page_number	Integer	Required	The page of the document this redaction appears on
x1	Float	Required	Left most coordinate of the redaction, as a percentage of page size
x2	Float	Required	Right most coordinate of the redaction, as a percentage of page size
y1	Float	Required	Top most coordinate of the redaction, as a percentage of page size
y2	Float	Required	Bottom most coordinate of the redaction, as a percentage of page size

Endpoints

POST /api/documents/<document_id>/redactions/ - Create redaction

Modifications

Modifications allow you to perform page modification operations on a document, including moving pages, rotating pages, copying pages, deleting pages, and inserting pages from other documents. Applying modifications effectively shuffles, removes, and copies pages, preserving and duplicating page information as needed (this includes page text and any annotations and sections attached to the page). No page text needs to be reprocessed or re-OCR'd. After successfully applying modifications, the document cannot be reverted.

Modification Specification

To support a flexible host of potential modifications, you must pass in the modifications as a JSON array that lists the operations to take place. The modification specification defines the pages that should compose the document post-modification and any operations such as rotation to apply to the pages. Each element of the modification array can have the following fields (instructive examples will be listed after the official specification):

Field	Description
page	A comma-separated string of page ranges, which can include individual pages or hyphenated inclusive runs of pages. Page numbers are 0-based (the first page of the document is page `0`, and `0-9` refers to the first through the 10th page of the document). Valid examples of page ranges include `"7"`, `"0-499"`, `"0-5,8,11-13"`, and `0,0,0` (page numbers can be repeated to duplicate them).
id	If unspecified, pull pages from the current document. Otherwise, pull pages from the document with the specified id.
modifications	An array of JSON objects defining modifications to take place. The only currently defined page modification operation is `rotate`, which rotates pages clockwise, counterclockwise, or halfway. Rotation is specified as `{"type": "rotate", "angle": <angle>}`, where `<angle>` is one of `cc`, `ccw`, or `hw` (corresponding to clockwise, counterclockwise, and halfway, respectively).

Example Specifications

The following examples assume you are modifying the Mueller Report, a 448-page document.

Example	Description
[{ "page": "0-447" }]	Leave the Mueller Report unchanged
[{ "page": "0-23,423-447" }]	Remove the middle 400 pages of the Mueller Report
[{ "page": "0-23,423-447" }]	Duplicate the first 50 pages of the Mueller Report at the end of the document
[{ "page": "0-447", "modifications": [{ "type": "rotate", "angle": "ccw" }] }]	Rotate all the pages of the Mueller Report counter-clockwise
[ { "page": "0-49", "modifications": [{ "type": "rotate", "angle": "hw" }] }, { "page": "50-447" } ]	Rotate just the first 50 pages of the Mueller Report 180 degrees
[ { "page": "0-447" }, { "page": "0-49", "id": "2000000" }, ]	Import 50 pages of another document with id `2000000` at the end of the Mueller report

Endpoints

POST /api/documents/<document_id>/modifications/ - Create modifications

Entities

Entities can be extracted using Google Cloud's Natural Language API. Entity extraction must be initalized manually per document and entities are read-only.

Fields

Top level fields

Field	Type	Description
entity	Object	Object containing information about this particular entity
relevance	Float	An estimate as to how relevant this entity is to this document
occurences	List	A list of occurence objects specifying where in the document this entity was found

Fields for the entity object

Field	Type	Description
name	String	The name of the entity
kind	String	The kind of entity
description	String	A short description of the entity
mid	String	The Knowledge Graph ID
wikipedia_url	URL	The Wikipedia URL for this entity
metadata	Object	Additional metadata for the entity, based on its kind

Fields for the occurence objects

Field	Type	Description
page	Integer	The page of the document this occurs on
offset	Integer	The character offset into the document this occurs on
content	String	The content of this occurence (the occurence may not match the entity name)
page_offset	Integer	The character offset into the page this occurs on
kind	String	`proper` for proper nouns, `common` for common nouns or `unknown`

Kind

Entity kinds include

unknown
person
location
organization
event
work_of_art
consumer_good
other
phone_number — metadata may include number, national_prefix, area_code and extension
address — metadata may include street_number, locality, street_name, postal_code, country, broad_region, narrow_region, and sublocality
date — metadata may include year, month and day
price — metadata may include value and currency

Endpoints

GET /api/documents/<document_id>/entities/ - List entities for this document
POST /api/documents/<document_id>/entities/ - Begin extracting entities for this document (POST body is empty)
DELETE /api/documents/<document_id>/entities/ - Delete all entities for this document

Filters

kind — Filter for entities with the given kind (may give multiple, comma seperated)
occurences — Filter for entities with the given occurence kind (proper or common)
relevance__gt — Filter for documents with the given relevance or higher
mid — Boolean filter for entities which do or do not have a MID
wikipedia_url — Boolean filter for entities which do or do not have a Wikipedia URL

Projects

Projects are collections of documents. They can be used for organizing groups of documents, or for collaborating with other users by sharing access to private documents.

Projects may be used for sharing documents. When you add a collaborator to a project, you may select one of three access levels:

view - This gives the collaborator permission to view your documents that you have added to the project
edit - This gives the collaborator permission to view or edit your documents you have added to the project
admin - This gives the collaborator both view and edit permissions, as well as the ability to add their own documents and invite other collaborators to the project

Additionally, you may add public documents to a project, for organizational purposes. Obviously, no permissions are granted to your or your collaborators when you add documents you do not own to your project — this is tracked by the edit_access field on the project membership. When you add documents you or your organization do own, it will be added with edit_access enabled by default. You may override this using the API if you would like to add your documents to a project, but not extend permissions to any of your collaborators. Also note that documents shared with you for editing via another project may not be added to your own project with edit_access enabled. This means the original owner of a document may revoke any access they have granted to others via projects at any time.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the project
created_at	Date Time	Read Only	Time stamp when this project was created
description	String	Not Required	A brief description of the project
edit_access	Bool	Read Only	Does the current user have edit access to this project
add_remove_access	Bool	Read Only	Does the current user have permission to add and remove documents to this project
private	Bool	Default: `false`	Private projects may only be viewed by their collaborators
slug	String	Read Only	The slug is a URL safe version of the title
title	String	Required	Title for the project
updated_at	Date Time	Read Only	Time stamp when this project was last updated
user	ID	Read Only	The ID for the user who created this project

Endpoints

GET /api/projects/ - List projects
POST /api/projects/ - Create project
GET /api/projects/<id>/ - Get project
PUT /api/projects/<id>/ - Update project
PATCH /api/projects/<id>/ - Partial update project
DELETE /api/projects/<id>/ - Delete project

Filters

user — Filter by projects where this user is a collaborator
document — Filter by projects which contain the given document
private — Filter by private or public projects. Specify either true or false.
slug — Filter by projects with the given slug.
title — Filter by projects with the given title.

Project Documents

These endpoints allow you to browse, add and remove documents from a project

Fields

Field	Type	Options	Description
document	Integer	Required	The ID for the document in the project
edit_access	Bool	Default: `true` if you have access	If collaborators of this project should be granted edit access to this document

Expandable fields: document

Endpoints

GET /api/projects/<project_id>/documents/ - List documents in the project
POST /api/projects/<project_id>/documents/ - Add a document to the project
PUT /api/projects/<project_id>/documents/ - Bulk update documents in the project
- This will set the documents in the project to exactly match the list you pass in. This means any documents currently in the project not in the list will be removed, and any in the list not currently in the project will be added.
PATCH /api/projects/<project_id>/documents/ - Bulk partial update documents in the project
- This endpoint will not create or delete any documents in the project. It will simply update the metadata for each document passed in. It expects every document in the list to already be included in the project.
DELETE /api/projects/<project_id>/documents/ - Bulk remove documents from the project
- You should specify which document IDs you want to delete using the document_id__in filter. This endpoint will allow you to remove all documents in the project if you call it with no filter specified.
GET /api/projects/<project_id>/documents/<document_id>/ - Get a document in the project
PUT /api/projects/<project_id>/documents/<document_id>/ - Update document in the project
PATCH /api/projects/<project_id>/documents/<document_id>/ - Partial update document in the project
DELETE /api/projects/<project_id>/documents/<document_id>/ - Remove document from the project

Filters

document_id__in — Filter by specific document IDs, passed in as comma separated values.

Collaborators

Other users who you would like share this project with. See Sharing Documents

Fields

Field	Type	Options	Description
access	String	Default: `view`	The access level for this collaborator
email	Email	Create Only	Email address of user to add as a collaborator to this project
user	Integer	Read Only	The ID for the user who is collaborating on this project

Expandable fields: user

Endpoints

GET /api/projects/<project_id>/users/ - List collaborators on the project
POST /api/projects/<project_id>/users/ - Add a collaborator to the project — you must know the email address of a user with a DocumentCloud account in order to add them as a collaborator on your project
GET /api/projects/<project_id>/users/<user_id>/ - Get a collaborator in the project
PUT /api/projects/<project_id>/users/<user_id>/ - Update collaborator in the project
PATCH /api/projects/<project_id>/users/<user_id>/ - Partial update collaborator in the project
DELETE /api/projects/<project_id>/users/<user_id>/ - Remove collaborator from the project

Organizations

Organizations represent a group of users. They may share a paid plan and resources with each other. Organizations can be managed and edited from the MuckRock accounts site. You may only view organizations through the DocumentCloud API.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the organization
avatar_url	URL	Read Only	A URL pointing to an avatar for the organization — normally a logo for the company
individual	Bool	Read Only	Is this organization for the sole use of an individual
name	String	Read Only	The name of the organization
slug	String	Read Only	The slug is a URL safe version of the name
uuid	UUID	Read Only	UUID which links this organization to the corresponding organization on the MuckRock Accounts Site

Endpoints

GET /api/organizations/ - List organizations
GET /api/organizations/<id>/ - Get an organization

Users

Users can be managed and edited from the MuckRock accounts site. You may view users and change your own active organization from the DocumentCloud API.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the user
avatar_url	URL	Read Only	A URL pointing to an avatar for the user
name	String	Read Only	The user's full name
organization	Integer	Required	The user's active organization
organizations	List:Integer	Read Only	A list of the IDs of the organizations this user belongs to
username	String	Read Only	The user's username
uuid	UUID	Read Only	UUID which links this user to the corresponding user on the MuckRock Accounts Site

Expandable fields: organization

Endpoints

GET /api/users/ - List users
GET /api/users/<id>/ - Get a user
PUT /api/users/<id>/ - Update a user
PATCH /api/users/<id>/ - Partial update a user

Add-Ons

Add-Ons allow you to easily add custom features to DocumentCloud. Learn more about Add-Ons. Add-Ons are added by installing the GitHub App in the repository you would like to use as an add-on. The API allows you to view, edit and run your add-ons.

Fields

Field	Type	Options	Description
ID	Integer	Read Only	The ID for the add-on
access	String	Read Only	The access level for the add-on (will be settable in the future)
active	Bool	Default: `false`	Whether this add-on is active for you
created_at	Date Time	Read Only	Time stamp when this add-on was created
name	String	Read Only	The name of the add-on (set in the configuration)
organization	Integer	Not Required	The ID for the organization this add-on belongs to
parameters	JSON	Read Only	The contents of the config.yaml file from the repository, converted to JSON
repository	String	Read Only	The full name of the GitHub repository, including the account name
updated_at	Date Time	Read Only	Time stamp when the add-on was last updated
user	Integer	Read Only	The ID for the user this add-on belongs to

Your active add-ons are showed to you in the web interface.

Endpoints

GET /api/addons/ - List add-ons
GET /api/addons/<id>/ - Get an add-on
PUT /api/addons/<id>/ - Update an add-on
PATCH /api/addons/<id>/ - Partial update an add-on

Filters

active — Filter by only your active or inactive add-ons
query — Searches for add-ons which contain the query in their name or description

Add-On Runs

Add-on runs represent an invocation of an add-on. You create one to run the add-on. The add-on itself can then update the add-on run as a means of supplying feedback to the caller.

Fields

Field	Type	Options	Description
UUID	UUID	Read Only	The ID for the add-on run
addon	Integer	Required	The ID of the add-on that is being ran
created_at	Date Time	Read Only	Time stamp when this add-on was created
dismissed	Bool	Default: `false`	Add-on runs are shown to the user until they are dismissed
file_name	String	Write Only	The add-on must set this to the name of the file supplied to `presigned_url` after uploading the file to make it accessible to the user
file_url	URL	Read Only	The URL of a file uploaded via `presigned_url`
message	String	Not Required	Add-ons may set infromational messages to the user while running
parameters	JSON	Write Only	The add-on specific data
presigned_url	URL	Read Only	Only included if you set the `upload_file` query parameter to the name of the file to upload. This is a URL the add-on can directly `PUT` a file to in order to return it to the user
progress	Integer	Not Required	Long running add-ons may set this as a percentage of their progress
status	String	Read Only	The status of the run - `queued`, `in_progress`, `success`, or `failure`
updated_at	Date Time	Read Only	Time stamp when the add-on was last updated
user	Integer	Read Only	The ID for the user who ran the add-on

Endpoints

POST /api/addon_runs/ - Create a new add-on run - this will start the run using GitHub actions
GET /api/addon_runs - List add-on runs
GET /api/addon_runs<uuid>/ - Get an add-on run
PUT /api/addon_runs/<uuid>/ - Update an add-on run
PATCH /api/addon_runs/<uuid>/ - Partial update an add-on run

Filters

dismissed — Filter by dismissed or not dismissed add-on runs

oEmbed

Generate an embed code for a document, page, or annotation using our oEmbed service.

Fields

Field	Type	Options	Description
url	URL	Required	The URL for the document, page or annotation to get an embed code for
maxwidth	Integer		The maximum width of the embedded resource
maxheight	Integer		The maximum height of the embedded resource

Endpoints

GET /api/oembed/ - Get an embed code for a given URL

Examples

Note: The hash symbol (#) in the URL will need to be encoded, which converts it to %23.

Generate an embed code for a page:

GET api/oembed?url=https://www.documentcloud.org/documents/23745991-gpo-j6-video-exh-vc9mp4%23document/p1

Generate an embed code for an annotation:

GET /api/oembed?url=https://www.documentcloud.org/documents/23745991-gpo-j6-video-exh-vc9mp4%23document/p1/a2242636

Appendix

Access Levels

The access level allows you to control who has access to your document by default. You may also explicitly share a document with additional users by collaborating with them on a project.

public – Anyone on the internet can search for and view the document
private – Only people with explicit permission (via collaboration) have access
organization – Only the people in your organization have access

For notes, the organization access level will extend access to all users with edit access to the document — this includes project collaborators.

Statuses

The status informs you to the current status of your document.

success – The document has been succesfully processed
readable – The document is currently processing, but is readable during the operation
pending – The document is processing and not currently readable
error – There was an error during processing
nofile – The document was created, but no file was uploaded yet

Supported File Types

Format	Extension	Type	Notes
AbiWord	ABW, ZABW	Document
Adobe PageMaker	PMD, PM3, PM4, PM5, PM6, P65	Document, DTP
AppleWorks word processing	CWK	Document	Formerly called ClarisWorks
Adobe FreeHand	AGD, FHD	Graphics / Vector
Apple Keynote	KTH, KEY	Presentation
Apple Numbers	Numbers	Spreadsheet
Apple Pages	Pages	Document
BMP file format	BMP	Graphics / Raster
Comma-separated values	CSV, TXT	Text
CorelDRAW 6-X7	CDR, CMX	Graphics / Vector
Computer Graphics Metafile	CGM	Graphics	Binary-encoded only; not those using clear-text or character-based encoding
Data Interchange Format	DIF	Spreadsheet
DBase, Clipper, VP-Info, FoxPro	DBF	Database
DocBook	XML	XML
Encapsulated PostScript	EPS	Graphics
Enhanced Metafile	EMF	Graphics / Vector / Text
FictionBook	FB2	eBook
Gnumeric	GNM, GNUMERIC	Spreadsheet
Graphics Interchange Format	GIF	Graphics / Raster
Hangul WP 97	HWP	Document	Newer "5.x" documents are not supported
HPGL plotting file	PLT	Graphics
HTML	HTML, HTM	Document, text
Ichitaro 8/9/10/11	JTD, JTT	Document
JPEG	JPG, JPEG	Graphics
Lotus 1-2-3	WK1, WKS, 123, wk3, wk4	Spreadsheet
Macintosh Picture File	PCT	Graphics
MathML	MML	Math
Microsoft Excel 2003 XML	XML	Spreadsheet
Microsoft Excel 4/5/95	XLS, XLW, XLT	Spreadsheet
Microsoft Excel 97–2003	XLS, XLW, XLT	Spreadsheet
Microsoft Excel 2007-2016	XLSX	Spreadsheet
Microsoft Office 2007-2016 Office Open XML	DOCX, XLSX, PPTX	Multiple formats
Microsoft PowerPoint 97–2003	PPT, PPS, POT	Presentation
Microsoft PowerPoint 2007-2016	PPTX	Presentation
Microsoft Publisher	PUB	Document, DTP
Microsoft RTF	RTF	Document
Microsoft Word 2003 XML (WordprocessingML)	XML	Document
Microsoft Word	DOC, DOT, DOCX	Document
Microsoft Works	WPS, WKS, WDB	Multiple	Microsoft Works for Mac formats since 4.1
Microsoft Write	WRI	Document
Microsoft Visio	VSD	Graphics / Vector
Netpbm format	PGM, PBM, PPM	Graphics / Raster
OpenDocument	ODT, FODT, ODS, FODS, ODP, FODP, ODG, FODG, ODF	Multiple formats
Open Office Base	ODB	Database forms, data
OpenOffice.org XML	SXW, STW, SXC, STC, SXI, STI, SXD, STD, SXM	Multiple formats
PCX	PCX	Graphics
Photo CD	PCD	Presentation
PhotoShop	PSD	Graphics
Plain text	TXT	Text	Various encodings supported
Portable Document Format	PDF	Document	Including hybrid PDF

Languages

ara – Arabic
zho – Chinese (Simplified)
tra – Chinese (Traditional)
hrv – Croatian
dan – Danish
nld – Dutch
eng – English
fra – French
deu – German
heb – Hebrew
hun – Hungarian
ind – Indonesian
ita – Italian
jpn – Japanese
kor – Korean
nor – Norwegian
por – Portuguese
ron – Romanian
rus – Russian
spa – Spanish
swe – Swedish
ukr – Ukrainian

Page Spec

The page spec is a compressed string that lists dimensions in pixels for every page in a document. Refer to ListCrunch for the compression format. For example, 612.0x792.0:0-447

Static Assets

The static assets for a document are loaded from different URLs depending on its access level. Append the following to the asset_url returned to load the static asset:

Asset	Path	Description
Document	documents/<id>/<slug>.pdf	The original document
Full Text	documents/<id>/<slug>.txt	The full text of the document, obtained from the PDF or via OCR
JSON Text	documents/<id>/<slug>.txt.json	The text of the document, in a custom JSON format (see below)
Page Text	documents/<id>/pages/<slug>-p<page number>.txt	The text for each page in the document
Page Positions	documents/<id>/pages/<slug>-p<page number>.position.json	The position of text on each page, in a custom JSON format
Page Image	documents/<id>/pages/<slug>-p<page number>-<size>.gif	An image of each page in the document, in various sizes

<size> may be one of large, normal, small, or thumbnail

TXT JSON Format

The TXT JSON file is a single file containing all of the text, but broken out per page. This is useful if you need the text per page for every page, as you can download just a single file. There is a top level object with an updated key, which is a Unix time stamp of when the file was last updated. There may be an is_import key, which will be set to true if this document was imported from legacy DocumentCloud. The last key is pages which contains the per page info. It is a list of objects, one per page. Each page object will have a page key, which is a 0-indexed page number. There is a contents key which contains the text for the page. There is an ocr key, which is the version of OCR software used to obtain the text. Finally there is an updated key, which is a Unix time stamp of when this page was last updated.

Position JSON Format

The position JSON file constains position information for each word of text on the page. It is an optional file, which may be generated depending on the type of OCR run on the document. If it exists, it will be a JSON array, which contains a JSON object for each word of text. The object for each word will have the following fields:

text - The text for the current word
x1, x2, y1, y2 - The coordinates of the bounding box for this word on the page. Each value will be between 0 and 1 and represents a percentage of the width or height of the page.

Set Page Text

The format to set the page text is similar to the text formats described above. The pages field may be set to a JSON array of page objects, with the following fields:

Field	Type	Options	Description
page_number	Integer	Required	The page number you would like to set the page text for, zero indexed
text	String	Required	The updated text for the given page
ocr	String	Not Required	An optional identifier for the OCR engine used to generate this text
positions	Array of JSON Objects	Not Required	Optionally set the position of each word of text, see next table for details

The position field in each pages object is a JSON array of position objects, with the following fields:

Field	Type	Options	Description
text	String	Required	A single word on the page
x1	Float	Required	Left most coordinate of the word, as a percentage of page size
x2	Float	Required	Right most coordinate of the word, as a percentage of page size
y1	Float	Required	Top most coordinate of the word, as a percentage of page size
y2	Float	Required	Bottom most coordinate of the word, as a percentage of page size
metadata	JSON	Not Required	Any extra metadata that you would like to store with this word

Example JSON setting just the page text:

[
    {"page_number": 0, "text": "Page 1 text"},
    {"page_number": 1, "text": "Page 2 text"}
]

Example JSON setting the page text and word positions:

[
    {
        "page_number": 0,
        "text": "Page 1 text",
        "ocr": "my-ocr-engine",
        "positions": [
            {
                "text": "Page",
                "x1": 0.1,
                "x2": 0.2,
                "y1": 0.1,
                "y2": 0.2,
                "metadata": {"type": "word"}
            },
            {
                "text": "1",
                "x1": 0.3,
                "x2": 0.4,
                "y1": 0.1,
                "y2": 0.2,
                "metadata": {"type": "word"}
            },
            {
                "text": "text",
                "x1": 0.5,
                "x2": 0.6,
                "y1": 0.1,
                "y2": 0.2,
                "metadata": {"type": "word"}
            }
        ]
    }
]

Expandable Fields

The API uses expandable fields in a few places, which are implemented by the Django REST - FlexFields package. It allows related fields, which would normally be returned by ID, be expanded into the fully nested representation. This allows you to save additional requests to the server when you need the related information, but for the server to not need to serve this information when it is not needed.

To expand one of the expandable fields, which are document in the fields section for each resource, add the expand query parameter to your request:

?expand=user

To expand multiple fields, separate them with a comma:

?expand=user,organization

You may also expand nested fields if the expanded field has its own expandable fields:

?expand=user.organization

To expand all fields:

?expand=~all