PdfParserProject | RecentChanges | Preferences

In a PDF file, the cross-reference table starts with the 'xref' tag. The next line holds two numbers, the first index of this cross reference table, and the number of elements in the table. Note that there is no terminator telling you when to stop parsing the xref table-- the second number in line 2 tells you how many lines to read.

Each line contains three things (the file offset, the generation number, and the whether this entry is free or not. -- Ivan

Here's an example XrefTable:

 0 18
 0000000000 65535 f      <--- this is line # 0
 0000038377 00000 n      <--- this is line # 1
 0000038247 00000 n 
 0000038745 00000 n 
 0000014447 00000 n 
 0000038278 00000 n 
 0000035529 00000 n 
 0000036618 00000 n 
 0000036878 00000 n            [...etc...]
 0000037977 00000 n 
 0000000019 00000 n 
 0000014425 00000 n 
 0000015561 00000 n 
 0000014577 00000 n 
 0000015541 00000 n 
 0000035398 00000 n 
 0000015681 00000 n 
 0000035376 00000 n      <--- this is line # 17

Have we already decided on the write algorithm for this? I was wondering if it would be easier to forget about our byte offsets in memory when we were writing out the file and just start recording offsets into the XrefTable wherever objects "land" as we write to disk (instead of calculating our file changes, computing the new byte offsets in memory and then writing to disk)? -- Patty
I agree- let's do it that way. After all, this is the flexibility that references provide us (namely, that we don't have to print the objects out in the same order we read them in. So the procedure would go something like this: The only tricky part will be writing the PdfXrefTable out to disk, since we will have to compute its length and add that to the current offset so that the trailer prints out the correct offset for the XrefTable. But even this shouldn't be very difficult. -- Ivan
For the offset calculation, is there are a reason we couldn't just record our write position in the output file before writing each object to disk? Shouldn't this write position equal the cumulative offset at that point? Let me know if I'm missing something, or maybe this is too low-level. -- Patty
Note that our software assumes this format in the xref table of a PDF document:


 0 18
 0000000000 65535 f      <--- this is line # 0
 0000038377 00000 n      <--- this is line # 1
 0000038247 00000 n 

This means that the PDF file should be optimized (contain only one xref table).

Our software always assumes that the first element of the table is a reference with reference number = 0 and generation number = 65535. It has worked for all optimized PDF files we have tested with. But it's still troubling!

Another quirk was that since our table maps "refnum:gennum" to a PdfReference? in order to guarantee uniqueness, our write code which needed to print out each reference and offset in reference order needed to guess at the generation number. The generation number is not obvious when trying to iterate on this hash.

The correct way to do this might have been to have an ordered collection listing the refnum:gennum combinations in numerical order. It would have looked like:

 0   0:65535
 1   1:0
 2   2:0
 3   3:1

This would have helped us support multiple xref sections. This is ironic because the "refnum:gennum" scheme was originally chosen for storing unique references among many xref sections!

-- Patty

PdfParserProject | RecentChanges | Preferences
This page is read-only (last edited December 14, 2000 9:51 pm (diff))