PDF Parsing

PDF Parsing

PDF was designed as an append only file format because of the slow performance of write operations in 1993. The data was continually added to the end of the file. However, these days many systems write a PDF from memory having built the entire file. The file is also designed to be random access for reading with cross reference tables located at the end of the file. This encourages an implementation where the file is read into memory and then processed

When receiving data into a web application the data is received sequentially so all the data has to be received before the file can be processed (PDF does support a ‘web optimised format’ but this is adopted in the majority of cases and go against the initial append only design).

Sometimes you need to read the entire PDF file to process it e.g. compare the contents against a digital signature checksum but in many cases you can start processing the file as soon as you receive the data. This provides for a lower latency response in a web application. Looking around it seems that many if not all the parsing libraries look to read the file into memory before processing it. This is not what we want. We are going to have to developer our own PDF parser.

+44 77 7619 8972