Captur: document data extraction
Captur reads documents as they’re crawled and pulls structured data out of them. The extracted values are written into custom fields on the file, which then drive search filters, MaSH automations, and reporting views. Without anyone having to type any of it in.
The headline mode is intelligent extraction: Captur reads the document directly with an LLM, with no template required. Drop an invoice into a Captur-enabled share and it’ll come back tagged with vendor, date, total, currency, and line items, even if the layout has never been seen before.
For high-volume, fixed-layout document classes you can also define templates: zone definitions on a representative page that subsequent matching documents are bound to. Templates run before intelligent extraction in the combined mode, which is useful when you need predictable extraction from a known layout (e.g. your in-house invoice form) and intelligent extraction as a safety net for everything else.
How extraction runs
Captur runs against a file in this order:
- Native PDF text path. If the file is a text-bearing PDF (born digital, with a real text layer), Captur reads it directly. This is the fast and cheap default. No OCR, no rasterisation.
- OCR fallback. If the PDF has no usable text layer, or the file is a raster image (a scan, a phone photo of a receipt), Captur runs OCR first to recover the text, then extracts. The OCR provider is whichever is configured in Foldr Settings > Search (Tesseract, AWS Textract, Google Vision, or Google Document AI).
- Template match (if templates exist on the share). When the share is in templates+AI mode, Captur tries to match the document against the configured templates. If it binds, the template’s zone definitions drive the extraction.
- Intelligent extraction (LLM). If no template matched, or the share is in intelligent-only mode, the document text plus a prompt describing the fields to extract is sent to the configured LLM. The model returns structured values that Captur then writes back to the file’s custom fields.
At the end of the pipeline, the file’s custom field values reflect what Captur found. From that point on the file is searchable by those values. Querying for “invoices over £1000 from last month” is a search index query against the populated fields, not a re-read of the document.
Per-share Captur mode
Captur is enabled per share. The administrator picks one of three modes for each share in Foldr Settings > Files & Storage > Edit share > Search & Data:
| Mode | Behaviour |
|---|---|
| Off (default) | No extraction runs on this share. |
| Intelligent | Every crawled document is sent through the intelligent-extraction pipeline. No templates considered. |
| Templates + Intelligent | Each document is matched against any defined templates first. If a template binds, its zones drive extraction. If nothing binds, intelligent extraction runs as the fallback. |
Picking Intelligent is the right default for most use cases. Reach for Templates + Intelligent when you have a small number of high-volume document layouts you want extracted with absolute precision (e.g. your supplier always sends the same invoice template, or your custody chain documents follow a fixed form).
Setting up extraction targets
Captur extracts into custom fields, so the fields have to exist before extraction can populate them. The flow is:
- Define the custom fields you want populated on the share. Vendor, total, due date, line items, etc. See Custom Fields for field types and configuration.
- Enable Captur on the share by setting Captur mode to Intelligent or Templates + Intelligent.
- (Optional) For Templates + Intelligent, author one or more templates in Foldr Settings > Captur > Templates and assign them to the share.
- Run a crawl. Captur runs as documents are processed; values land on the file as the crawl completes.
When fields are populated, they show in the file’s preview pane in the web app, in search results, and on the file picker. They’re also available to MaSH scripts via mash.event.file.fields on Indexed and Fields Changed events.
Templates
Templates define a small set of named zones on a reference document, with each zone telling Captur which custom field to populate from that area. After a template is defined, Captur tries to bind incoming documents against it: if the layout matches (anchor text, page geometry), the template binds and its zones drive extraction.
The template editor lives in Foldr Settings > Captur > Templates. From there you upload a reference document, draw zones on it, link each zone to a custom field, and save. Templates can be assigned to one or many shares; a share with templates assigned and Captur mode set to Templates + Intelligent will try them in order.
Templates are the right tool when you have:
- A repeating document layout (in-house forms, supplier invoices, government returns).
- High volume of that layout (so the template author time pays back).
- Fields that LLM extraction sometimes gets wrong on this specific layout (so you want deterministic anchoring rather than reasoning).
For everything else. Varied formats, occasional documents, anything you didn’t see coming. Leave it to intelligent extraction.
Licensing
Captur is a licensed feature flag.
- Foldr appliance: licensed add-on. Contact sales / support to enable on an existing appliance.
- Foldr SaaS: included in the Enterprise tier.
When Captur is licensed, the per-share Captur mode dropdown becomes available, the Templates page appears in Foldr Settings, and the extraction pipeline runs against any share with a non-zero captur mode.
Captur and MaSH
Captur extraction populates custom fields, and MaSH can read and act on those fields:
- The
Indexedevent mashlet fires after Captur has run, with the populated fields available onmash.event.file.fields. mash.captur.extract(file)runs intelligent extraction synchronously inside a mashlet (with a short-circuit if the file’s already been extracted).mash.captur.queue(file)queues aProcessAndExtractDocumentjob, useful when the synchronous path can’t run (e.g. inside the lean web image that doesn’t carry the OCR / Tika subprocess binaries).
Both routes return the same field-shaped result; pick extract() for in-script use and queue() for fire-and-forget.
See also
- Custom Fields. Defining and configuring the fields Captur populates
- Foldr Search setup. Search index configuration, including OCR provider selection
- Indexing cloud storage with delta queries. Keeping Captur-populated fields current as documents change in cloud storage
- Enabling OCR with AWS Textract. One of the cloud OCR options Captur uses for raster documents