CRYX's note about the JPEG decoding algorithm. Copyright 1999 Cristi Cuturicu. DISCLAIMER ........... You get this file for free, so you cannot have any legal requests from me. If you don't agree, read no more. No warranty is provided with this doc, there might be bugs or errors in it (although I've tried to avoid them), so use the information contained in this file at your own risk. This is NOT an official documentation, for further information please refer to the JPEG ISO standard. All product names mentioned in this file are trademarks or registered trademarks of their respective owners. Not for reproduction (electronic or hardcopy) except for personal use. THE JPEG COMPRESSION and THE JPG FILE FORMAT ............................................. Long ago, I've been looking on the net a good doc which could have explained to me the JPEG compression, and particularly the JPG file format. And recently I've found the ISO-ITU JPEG standard in a file called itu-1150.ps (JPEG standard = ISO standard 10918-1 or CCITT standard recommendation T.81: "Information Technology - Digital compression and coding of continuous-tone still images - Requirements and guidelines") Though this standard is quite complete, it has a lot of not interesting parts in its 186 pages, and I had to dig in it, and then write my own JPG viewer, to get from this standard the main stuff a programmer needs : The Baseline Sequential DCT JPG compression. First a note : Mainly because of the fact that the majority of the JPG files are Baseline Sequential JPGS, this doc concerns only the Baseline Sequential JPG compression and particularly the JFIF implementation of it. It DOES NOT cover the JPG Progresive or Hierarchical compression. (For more details about these read the itu-1150 standard. It can be found at www.wotsit.org or somewhere at www.jpeg.org/jpeg) I've thought that it would be easier for the reader to understand the JPG compression if I'll explain the steps of the JPG encoder. (The decoder steps will be the inverse of the encoder's steps, but in reverse order, of course) THE JPEG ENCODER STEPS ---------------------- 1) The afine transformation in colour space : [R G B] -> [Y Cb Cr] --------------------------------------------------------------------- (It is defined in the CCIR Recommendation 601) (R,G,B are 8-bit unsigned values) | Y | | 0.299 0.587 0.114 | | R | | 0 | | Cb | = |- 0.1687 - 0.3313 0.5 | * | G | + |128| | Cr | | 0.5 - 0.4187 - 0.0813| | B | |128| The new value Y = 0.299*R + 0.587*G + 0.114*B is called the luminance. It is the value used by the monochrome monitors to represent an RGB colour. Physiologically, it represents the intensity of an RGB colour perceived by the eye. You see that the formula for Y it's like a weighted-filter with different weights for each spectral component: the eye is most sensitive to the Green component then it follows the Red component and the last is the Blue component. The values Cb = - 0.1687*R - 0.3313*G + 0.5 *B + 128 Cr = 0.5 *R - 0.4187*G - 0.0813*B + 128 are called the chromimance values and represent 2 coordinates in a system which measures the nuance and saturation of the colour ([Approximately], these values indicate how much blue and how much red is in that colour). These 2 coordinates are called shortly the chrominance. [Y,Cb,Cr] to [R,G,B] Conversion (The inverse of the previous transform) -------------------------------- RGB can be computed directly from YCbCr ( 8-bit unsigned values) as follows: R = Y + 1.402 *(Cr-128) G = Y - 0.34414*(Cb-128) - 0.71414*(Cr-128) B = Y + 1.772 *(Cb-128) A note relating Y,Cb,Cr to the human visual system --------------------------------------------------- The eye, particulary the retina, has as visual analyzers two kind of cells : Cells for night view which perceive only nuances of gray ranging from intense white to the darkest black and cells for the day view which perceive the color nuance. The first cells, given an RGB colour, detect a gray level similar to that given by the luminance value. The second cells, responsible for the perception of the colour nuance, are the cells which detects a value related to that of the chrominance. 2) Sampling ------------ The JPEG standard takes into account the fact that the eye seems to be more sensitive at the luminance of a colour than at the nuance of that colour. (the white-black view cells have more influence than the day view cells) So, on most JPGS, luminance is taken in every pixel while the chrominance is taken as a medium value for a 2x2 block of pixels. Note that it is not neccessarily that the chrominance to be taken as a medium value for a 2x2 block , it could be taken in every pixel, but good compression results are achieved this way, with almost no loss in visual perception of the new sampled image. A note : The JPEG standard specifies that for every image component (like, for example Y) must be defined 2 sampling coefficients: one for the horizontal sampling and one for vertical sampling. These sampling coefficients are defined in the JPG file as relative to the maximum sampling coefficient (more on this later). 3) Level shift -------------- All 8-bit unsigned values (Y,Cb,Cr) in the image are "level shifted": they are converted to an 8-bit signed representation, by subtracting 128 from their value. 4) The 8x8 Discrete Cosine Transform (DCT) ------------------------------------------ The image is break into 8x8 blocks of pixels, then for each 8x8 block is applied the DCT transform. Note that if the X dimension of the original image is not divisible by 8, the encoder should make it divisible, by completing the remaining right columns (until X becomes a multiple of 8) with the right-most column of the original image. Similar, if the Y dimension is not divisible by 8, the encoder should complete the remaining lines with the bottom-most line of the original image. The 8x8 blocks are processed from left to right and from top to bottom. A note: Since a pixel in the 8x8 block has 3 components (Y,Cb,Cr) the DCT is applied separately to 3 blocks 8x8: The first 8x8 block is the block which contains the luminance of the pixels in the original 8x8 block The second 8x8 block is the block which contains the Cb value of the pixels in the original 8x8 block And, similar, the third 8x8 block contains the Cr values. The purpose of the DCT transform is that instead of processing the original samples, you work with the spatial frequencies present in the original image. These spatial frequencies are very related to the level of detail present in an image. High spatial frequencies corresponds to high levels of detail, while lower frequencies corresponds to lower levels of detail. The DCT transform is very similar to the 2D Fourier transform which shifts from the time domain (the original 8x8 block) to the frequency domain (the new 8x8= 64 coefficients which represents the amplitudes of the spatial frequencies analyzed) The mathematical definition of Forward DCT (FDCT) and Inverse DCT (IDCT) is : FDCT: c(u,v) 7 7 2*x+1 2*y+1 F(u,v) = --------- * sum sum f(x,y) * cos (------- *u*PI)* cos (------ *v*PI) 4 x=0 y=0 16 16 u,v = 0,1,...,7 { 1/2 when u=v=0 c(u,v) = { { 1 otherwise IDCT: 1 7 7 2*x+1 2*y+1 f(x,y) = --- * sum sum c(u,v)*F(u,v)*cos (------- *u*PI)* cos (------ *v*PI) 4 u=0 v=0 16 16 x,y=0,1...7 Applying these formulas directly is computationally expensive, especially when there have been developed faster algorithms for implementing forward or inverse DCT. A notable one called AA&N leaves only 5 multiplies and 29 adds to be done in the DCT itself. More info and an implementation of it can be found in the free software for JPEG encoders/decoders made by Independent JPEG Group (IJG), their C source can be found at www.ijg.org. 5) The zig-zag reordering of the 64 DCT coefficients ----------------------------------------------------- So, after we performed the DCT transform over a block of 8x8 values, we have a new 8x8 block. Then, this 8x8 block is traversed in zig-zag like this : (The numbers in the 8x8 block indicate the order in which we traverse the bidimensional 8x8 matrix) 0, 1, 5, 6,14,15,27,28, 2, 4, 7,13,16,26,29,42, 3, 8,12,17,25,30,41,43, 9,11,18,24,31,40,44,53, 10,19,23,32,39,45,52,54, 20,22,33,38,46,51,55,60, 21,34,37,47,50,56,59,61, 35,36,48,49,57,58,62,63 As you see , first is the upper-left corner (0,0), then the value at (0,1), then (1,0) then (2,0), (1,1), (0,2), (0,3), (1,2), (2,1), (3,0) etc. After we are done with traversing in zig-zag the 8x8 matrix we have now a vector with 64 coefficients (0..63) The reason for this zig-zag traversing is that we traverse the 8x8 DCT coefficients in the order of increasing the spatial frequencies. So, we get a vector sorted by the criteria of the spatial frequency: The first value in the vector (at index 0) corresponds to the lowest spatial frequency present in the image - It's called the DC term. As we increase the index in the vector, we get values corresponding to higher frequencies (The value at index 63 corresponds to the amplitude of the highest spatial frequency present in the 8x8 block). The rest of the DCT coefficients are called AC terms. 6) Quantization ---------------- At this stage, we have a sorted vector with 64 values corresponding to the amplitudes of the 64 spatial frequencies present in the 8x8 block. These 64 values are quantized: Each value is divided by a dividend specified in a vector with 64 values --- the quantization table , then it's rounded to the nearest integer. for (i = 0 ; i<=63; i++ ) vector[i] = (int) (vector[i] / quantization_table[i] + 0.5) Here is the example of the quantization table for luminance(Y) given in an annex of the JPEG standard.(It is given in a form of an 8x8 block; in order to obtain a 64 vector it should be zig-zag reordered) 16 11 10 16 24 40 51 61 12 12 14 19 26 58 60 55 14 13 16 24 40 57 69 56 14 17 22 29 51 87 80 62 18 22 37 56 68 109 103 77 24 35 55 64 81 104 113 92 49 64 78 87 103 121 120 101 72 92 95 98 112 100 103 99 This table is based upon "psychovisual thresholding" , it has "been used with good results on 8-bit per sample luminance and chrominance images". Most existing encoders use simple multiples of this example, but the values are not claimed to be optimal (An encoder can use ANY OTHER quantization table) The table is specified in the JPEG file with the DQT(Define Quantization Table) marker.Most commonly there is one table for Y, and another one for the chrominance (Cb and Cr). The quantization process has the key role in the JPEG compression. It is the process which removes the high frequencies present in the original image -- in consequence the high detail. We do this because of the fact that the eye is much more sensitive to lower spatial frequencies than to higher frequencies, so we can remove, with very little visual loss, higher frequencies. This is done by dividing values at high indexes in the vector (the amplitudes of higher frequencies) with larger values than the values by which are divided the amplitudes of lower frequencies. The bigger the values in the quantization table are, the bigger is the error (in consequence the visual error) introduced by this lossy process, and the smaller is the visual quality. Another important fact is that in most images the colour varies slow from one pixel to another, so most images will have a small quantity of high detail -> a small amount (small amplitudes) of high spatial frequencies - but they have a lot of image information contained in the low spatial frequencies. In consequence in the new quantized vector, at high spatial frequencies, we'll have a lot of consecutive zeroes. 7) The Zero Run Length Coding (RLC) ------------------------------- Now we have the quantized vector with a lot of consecutive zeroes. We can exploit this by run length coding the consecutive zeroes. IMPORTANT: You'll see later why, but here we skip the encoding of the first coefficient of the vector (the DC coefficient) which is coded a bit different. (I'll present its coding later on this doc) Let's consider the original 64 vector a 63 vector (it's the 64 vector without the first coefficient) Say that we have 57,45,0,0,0,0,23,0,-30,-16,0,0,1,0,0,0, 0 , 0 ,0 , only 0,..,0 Here it is how the RLC JPEG compression is done for this example : (0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-16) ; (2,1) ; EOB As you see, we encode for each value different by 0 the number of consecutive zeroes PRECEDING that value, then we add the value. Another note : EOB is the short form for End of Block, it's a special coded value (a marker). If we've reached in a position in the vector from which we have till the end of the vector only zeroes, we'll mark that position with EOB and finish the RLC compression of the quantized vector. [Note that if the quantized vector doesn't finishes with zeroes (has the last element not 0) we'll not have the EOB marker.] ACTUALLY, EOB has as an equivalent (0,0) and it will be (later) Huffman coded like (0,0), so we'll encode : (0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-16) ; (2,1) ; (0,0) Another MAJOR thing: Say that somewhere in the quantized vector we have: 57, eighteeen zeroes, 3, 0,0 ,0,0 2, thirty-three zeroes, 895, EOB The JPG Huffman coding makes the restriction (you'll see later why) that the number of previous 0's to be coded as a 4-bit value, so it can't overpass the value 15 (0xF). So, the previous example would be coded as : (0,57) ; (15,0) (2,3) ; (4,2) ; (15,0) (15,0) (1,895) , (0,0) (15,0) is a special coded value which indicates that there follows 16 consecutive zeroes.Note : 16 zeroes not 15 zeroes. 8) The final step === Huffman coding ------------------------------------- First an IMPORTANT note : Instead of storing the actual value , the JPEG standard specifies that we store the minimum size in bits in which we can keep that value (it's called the category of that value) and then a bit-coded representation of that value like this: Values Category Bits for the value 0 0 - -1,1 1 0,1 -3,-2,2,3 2 00,01,10,11 -7,-6,-5,-4,4,5,6,7 3 000,001,010,011,100,101,110,111 -15,..,-8,8,..,15 4 0000,..,0111,1000,..,1111 -31,..,-16,16,..,31 5 00000,..,01111,10000,..,11111 -63,..,-32,32,..,63 6 . -127,..,-64,64,..,127 7 . -255,..,-128,128,..,255 8 . -511,..,-256,256,..,511 9 . -1023,..,-512,512,..,1023 10 . -2047,..,-1024,1024,..,2047 11 . -4095,..,-2048,2048,..,4095 12 . -8191,..,-4096,4096,..,8191 13 . -16383,..,-8192,8192,..,16383 14 . -32767,..,-16384,16384,..,32767 15 . In consequence for the previous example: (0,57) ; (0,45) ; (4,23) ; (1,-30) ; (0,-8) ; (2,1) ; (0,0) let's encode ONLY the right value of these pairs, except the pairs that are special markers like (0,0) or (if we would have) (15,0) 57 is in the category 6 and it is bit-coded 111001 , so we'll encode it like (6,111001) 45 , similar, will be coded as (6,101101) 23 -> (5,10111) -30 -> (5,00001) -8 -> (4,0111) 1 -> (1,1) And now , we'll write again the string of pairs: (0,6), 111001 ; (0,6), 101101 ; (4,5), 10111; (1,5), 00001; (0,4) , 0111 ; (2,1), 1 ; (0,0) The pairs of 2 values enclosed in bracket paranthesis, can be represented on a byte because of the fact that each of the 2 values can be represented on a nibble (the counter of previous zeroes is always smaller than 15 and so it is the category of the numbers [numbers encoded in a JPG file are in range -32767..32767]). In this byte, the high nibble represents the number of previous 0s, and the lower nibble is the category of the new value different by 0. The FINAL step of the encoding consists in Huffman encoding this byte, and then writing in the JPG file, as a stream of bits, the Huffman code of this byte, followed by the remaining bit-representation of that number. For example, let's say that for byte 6 ( the equivalent of (0,6) ) we have a Huffman code = 111000; for byte 69 = (4,5) (for example) we have 1111111110011001 21 = (1,5) --- 11111110110 4 = (0,4) --- 1011 33 = (2,1) --- 11011 0 = EOB = (0,0) --- 1010 The final stream of bits written in the JPG file on disk for the previous example of 63 coefficients (remember that we've skipped the first coefficient ) is 111000 111001 111000 101101 1111111110011001 10111 11111110110 00001 1011 0111 11011 1 1010 The encoding of the DC coefficient ----------------------------------- DC is the coefficient in the quantized vector corresponding to the lowest frequency in the image (it's the 0 frequency) , and (before quantization) is mathematically = (the sum of 8x8 image samples) / 8 . (It's like an average value for that block of image samples). It is said that it contains a lot of energy present in the original 8x8 image block. (Usually it gets large values). The authors of the JPEG standard noticed that there's a very close connection between the DC coefficient of consecutive blocks, so they've decided to encode in the JPG file the difference between the DCs of consecutive 8x8 blocks (Note: consecutive 8x8 blocks of the SAME image component, like consecutive 8x8 blocks for Y , or consecutive blocks for Cb , or for Cr) Diff = DC - DC i (i-1) So DC of the current block (DC ) will be equal to : DC = DC + Diff i i i-1 And in JPG decoding you will start from 0 -- you consider that the first DC coefficient = 0 ; DC = 0 0 And then you'll add to the current value the value decoded from the JPG (the Diff value) SO, in the JPG file , the first coefficient = the DC coefficient is actually the difference, and it is Huffman encoded DIFFERENTLY from the encoding of AC coefficients. Here it is how it's done: (Remember that we now code the Diff value) Diff corresponds as you've seen before to a representation made by category and it's bit coded representation. In the JPG file it will be Huffman encoded only the category value, like this: Diff = (category, bit-coded representation) Then Diff will be coded as (Huffman_code(category) , bit-coded representation) For example, if Diff is equal to -511 , then Diff corresponds to (9, 000000000) Say that 9 has a Huffman code = 1111110 (In the JPG file, there are 2 Huffman tables for an image component: one for DC and one for AC) In the JPG file, the bits corresponding to the DC coefficient will be: 1111110 000000000 And,applied to this example of DC and to the previous example of ACs, for this vector with 64 coefficients, THE FINAL STREAM OF BITS written in the JPG file will be: 1111110 000000000 111000 111001 111000 101101 1111111110011001 10111 11111110110 00001 1011 0111 11011 1 1010 (In the JPG file , first it's encoded DC then ACs) THE HUFFMAN DECODER (A brief summary) for the 64 coefficients (A Data Unit) of an image component (For example Y) ------------------------------------------------------------- So when you decode a stream of bits from the image in the JPG file, you'll do: Init DC with 0. 1) First the DC coefficient decode : a) Fetch a valid Huffman code (you check if it exists in the Huffman DC table) b) See at what category this Huffman code corresponds c) Fetch N = category bits , and determine what value is represented by (category, the N bits fetched) = Diff d) DC + = Diff e) write DC in the 64 vector : " vector[0]=DC " 2) The 63 AC coefficients decode : ------- FOR every AC coefficient UNTIL (EOB_encountered OR AC_counter=64) a) Fetch a valid Huffman code (check in the AC Huffman table) b) Decode that Huffman code : The Huffman code corresponds to (nr_of_previous_0,category) [Remember: EOB_encountered = TRUE if (nr_of_previous_0,category) = (0,0) ] c) Fetch N = category bits, and determine what value is represented by (category,the N bits fetched) = AC_coefficient d) Write in the 64 vector, a number of zeroes = nr_of_previous_zero e) increment the AC_counter with nr_of_previous_0 f) Write AC_coefficient in the vector: " vector[AC_counter]=AC_coefficient " ----------------- Next Steps ----------- So, now we have a 64 elements vector.We'll do the reverse of the steps presented in this doc: 1) Dequantize the 64 vector : "for (i=0;i<=63;i++) vector[i]*=quant[i]" 2) Re-order from zig-zag the 64 vector into an 8x8 block 3) Apply the Inverse DCT transform to the 8x8 block Repeat the upper process [ Huffman decoder, steps 1), 2) and 3)] for every 8x8 block of every image component (Y,Cb,Cr). 4) Up-sample if it's needed 5) Level shift samples (add 128 to the all 8-bit signed values in the 8x8 blocks resulting from the IDCT transform) 6) Tranform YCbCr to RGB 7--- And VOILA ... the JPG image The JPEG markers and/or how it's organized the image information in the JPG file (The Byte level) -------------------------------------------------------------------------------- NOTE: The JPEG/JFIF file format uses Motorola format for words, NOT Intel format, i.e. : high byte first, low byte last -- (ex: the word FFA0 will be written in the JPEG file in the order : FF at the low offset , A0 at the higher offset) The JPG standard specifies that the JPEG file is composed mostly of pieces called "segments". A segment is a stream of bytes with length <= 65535.The segment beginning is specified with a marker. A marker = 2 bytes beginning with 0xFF ( the C hexadecimal notation for 255), and ending with a byte different by 0 and 0xFF. Ex: 'FFDA' , 'FFC4', 'FFC0'. Each marker has a meaning: the second byte (different by 0 and 0xFF) specifies what does that marker. For example, there is a marker which specifies that you should start the decoding process , this is called (the JPG standard's terminology): SOS=Start Of Scan = 'FFDA' Another marker called DQT = Define Quantization Table = 0xFFDB does what this name says: specifies that in the JPG file, after the marker (and after 3 bytes, more on this later) it will follow 64 bytes = the coefficients of the quantization table. If, during the processing of the JPG file, you encounter an 0xFF, then again a a byte different by 0 (I've told you that the second byte for a marker is not 0) and this byte has no marker meaning (you cannot find a marker corresponding to that byte) then the 0xFF byte you encountered must be ignored and skipped. (In some JPGS, sequences of consecutive 0xFF are for some filling purposes and must be skipped) You see that whenever you encounter 0xFF , you check the next byte and see if that 0xFF you encountered has a marker meaning or must be skipped. What happens if we actually need to encode the 0xFF byte in the JPG file as an *usual* byte (not a marker, or a filling byte) ? (Say that we need to write a Huffman code which begins with 11111111 (8 bits of 1) at a byte alignment) The standard says that we simply make the next byte 0 , and write the sequence 'FF00' in the JPG file. So when your JPG decoder meets the 2 byte 'FF00' sequence, it should consider just a byte: 0xFF as an usual byte. Another thing: You realise that these markers are byte aligned in the JPG file. What happens if during your Huffman encoding and inserting bits in the JPG file's bytes you have not finished to insert bits in a byte, but you need to write a marker which is byte aligned ? For the byte alignment of the markers, you SET THE REMAINING BITS UNTIL THE BEGINNING OF THE NEXT BYTE TO 1, then you write the marker at the next byte. A short explanation of some important markers found in a JPG file. ------------------------------------------------------------------- SOI = Start Of Image = 'FFD8' This marker must be present in any JPG file *once* at the beginning of the file. (Any JPG file starts with the sequence FFD8.) EOI = End Of Image = 'FFD9' Similar to EOI: any JPG file ends with FFD9. RSTi = FFDi (where i is in range 0..7) [ RST0 = FFD0, RST7=FFD7] = Restart Markers These restart markers are used for resync. At regular intervals, they appear in the JPG stream of bytes, during the decoding process (after SOS) (They appear in the order: RST0 -- interval -- RST1 -- interval -- RST2 --... ...-- RST6 -- interval -- RST7 -- interval -- RST0 --... ) (Obs: A lot of JPGs don't have restart markers) The problem with these markers is that they interrupt the normal bit order in the JPG's Huffman encoded bitstream. Remember that for the byte alignment of the markers the remaining bits are set to 1, so your decoder has to skip at regular intervals the useless filling bits (those set with 1) and the RST markers. ------- Markers... ------- At the end of this doc, I've included a very well written technical explanation of the JPEG/JFIF file format, written by Oliver Fromme, the author of the QPEG viewer. There you'll find a pretty good and complete definition for the markers. But, anyway, here is a list of markers you should check: SOF0 = Start Of Frame 0 = FFC0 SOS = Start Of Scan = FFDA APP0 = it's the marker used to identify a JPG file which uses the JFIF specification = FFE0 COM = Comment = FFFE DNL = Define Number of Lines = FFDC DRI = Define Restart Interval = FFDD DQT = Define Quantization Table = FFDB DHT = Define Huffman Table = FFC4 The Huffman table stored in a JPG file --------------------------------------- Here it is how JPEG implements the Huffman tree: instead of a tree, it defines a table in the JPG file after the DHT (Define Huffman Table) marker. NOTE: The length of the Huffman codes is restricted to 16 bits. Basically there are 2 types of Huffman tables in a JPG file : one for DC and one for AC (actually there are 4 Huffman tables: 2 for DC,AC of luminance and 2 for DC,AC of chrominance) They are stored in the JPG file in the same format which consist of: 1) 16 bytes : byte i contains the number of Huffman codes of length i (length in bits) i ranges from 1 to 16 16 2) A table with the length (in bytes) = sum nr_codes_of_length_i i=1 which contains at location [k][j] (k in 1..16, j in 0..(nr_codes_with_length_k-1)) the BYTE value associated to the j-th Huffman code of length k. (For a fixed length k, the values are stored sorted by the value of the Huffman code) From this table you can find the actual Huffman code associated to a particular byte. Here it is an example of how the actual code values are generated: Ex: (Note: The number of codes for a given length are here for this particular example to figure it out, they can have any other values) SAY that, For length 1 we have nr_codes[1]=0, we skip this length For length 2 we have 2 codes 00 01 For length 3 we have 3 codes 100 101 110 For length 4 we have 1 code 1110 For length 5 we have 1 code 11110 For length 6 we have 1 code 111110 For length 7 we have 0 codes -- skip (if we had 1 code for length 7, we would have 1111110) For length 8 we have 1 code 11111100 (You see that the code is still shifted to left though we skipped the code value for 7) ..... For length 16, .... (the same thing) I've told you that in the Huffman table in the JPG file are stored the BYTE values for a given code. For this particular example of Huffman codes: Say that in the Huffman table in the JPG file on disk we have (after that 16 bytes which contains the nr of Huffman codes with a given length): 45 57 29 17 23 25 34 28 These values corressponds , given that particular lengths I gave you before , to the Huffman codes like this : there's no value for code of length 1 for codes of length 2 : we have 45 57 for codes of length 3 : 3 values (ex : 29,17,23) for codes of length 4 : only 1 value (ex: 25) for codes of length 5 : 1 value ( ex: 34) .. for code of length 7, again no value, skip to code with length 8 for code of length 8 : 1 value 28 IMPORTANT note: For codes of length 2: the value 45 corresponds to code 00 57 to code 01 For codes of length 3: the value 29 corresponds to code 100 17 ----||--- 101 23 ----||--- 110 ETC... (I've told you that for a given length the byte values are stored in the order of increasing the value of the Huffman code.) Four Huffman tables corresponding to DC and AC tables of the luminance, and DC and AC tables for the chrominance, are given in an annex of the JPEG standard as a suggestion for the encoder. The standard says that these tables have been tested with good compression results on a lot of images and reccommends them, but the encoder can use any other Huffman table. A lot of JPG encoders use these tables. Some of them offer you an option: entropy optimization - if it's enabled they'll use Huffman tables optimized for that particular image. The JFIF (Jpeg Format Interchange File) file --------------------------------------------- The JPEG standard (that in the itu-1150.ps file) is somehow very general, the JFIF implementation is a particular case of this standard (and it is, of course, compatible with the standard) . The JPEG standard specifies some markers reserved for applications (by applications I mean particular cases of implementing the standard) Those markers are called APPn , where n ranges from 0 to 0xF ; APPn = FFEn The JFIF specification uses the APP0 marker (FFE0) to identify a JPG file which uses this specification. You'll see in the JPEG standard that it refers to "image components". These image components can be (Y,Cb,Cr) or (YIQ) or whatever. The JFIF implementations uses only (Y,Cb,Cr) for a truecolor JPG, or only Y for a monochrome JPG. You can get the JFIF specification from www.jpeg.org The sampling factors -------------------- Note: The following explanation covers the encoding of truecolor (3 components) JPGS; for gray-scaled JPGs there is one component (Y) which is usually no down-sampled at all, and does not require any inverse transformation like the inverse (Y,Cb,Cr) -> (R,G,B). In consequence, the gray-scaled JPGS are the simplest and easiest to decode: for every 8x8 block in the image you do the Huffman decoding of the RLC coded vector then you reorder it from zig-zag, dequantize the 64 vector and finally you apply to it the inverse DCT and add 128 (level shift) to the new 8x8 values. I've told you that image components are sampled. Usually Y is taken every pixel, and Cb, Cr are taken for a block of 2x2 pixels. But there are some JPGs in which Cb , Cr are taken in every pixel, or some JPGs where Cb, Cr are taken every 2 pixels (a horizontal sampling at 2 pixels, and a vertical sampling in every pixel) The sampling factors for an image component in a JPG file are defined in respect (relative) to the highest sampling factor. Here are the sampling factors for the most usual example: Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels (The JFIF specification gives a formula for sampling factors which I think that works only when the maximum sampling factor for each dimension X or Y is <=2) The JPEG standard does not specify the sampling factors , it's more general). You see that Y will have the highest sampling rate : Horizontal sampling factor = 2 = HY Vertical sampling factor = 2 = VY For Cb , Horizontal sampling factor = 1 = HCb Vertical sampling factor = 1 = VCb For Cr Horizontal sampling factor = 1 = HCr Vertical sampling factor = 1 = VCr Actually this form of defining the sampling factors is quite useful. The vector of 64 coefficients for an image component, Huffman encoded, is called DU = Data Unit (JPEG's standard terminology) In the JPG file , the order of encoding Data Units is : 1) encode Data Units for the first image component: for (counter_y=1;counter_y<=VY;counter_y++) for (counter_x=1;counter_x<=HY;counter_x++) { encode Data Unit for Y } 2) encode Data Units for the second image component: for (counter_y=1;counter_y<=VCb ;counter_y++) for (counter_x=1;counter_x<=HCb;counter_x++) { encode Data Unit for Cb } 3) finally, for the third component, similar: for (counter_y=1;counter_y<=VCr;counter_y++) for (counter_x=1;counter_x<=HCr;counter_x++) { encode Data Unit for Cr } For the example I gave you (HY=2, VY=2 ; HCb=VCb =1, HCr,VCr=1) here it is a figure ( I think it will clear out things for you) : YDU YDU CbDU CrDU YDU YDU ( YDU is a Data unit for Y , and similar CbDU a DU for Cb, CrDU = DU for Cr ) This usual combination of sampling factors is referred as 2:1:1 for both vertical and horizontal sampling factors. And, of course, in the JPG file the encoding order will be : YDU,YDU,YDU,YDU,CbDU,CrDU You know that a DU (64 coefficients) defines a block of 8x8 values , so here we specified the encoding order for a block of 16x16 image pixels (An image pixel = an (Y,Cb,Cr) pixel [my notation]) : Four 8x8 blocks of Y values (4 YDUs), one 8x8 block of Cb values (1 CbDU) and one 8x8 block of Cr values (1 CrDU) (Hmax = the maximum horizontal sampling factor , Vmax = the maximum vertical sampling factor) In consequence for this example of sampling factors (Hmax = 2, Vmax=2), the encoder should process SEPARATELY every 16x16 = (Hmax*8 x Vmax*8) image pixels block in the order mentioned. This block of image pixels with the dimensions (Hmax*8,Vmax*8) is called, in the JPG's standard terminology, an MCU = Minimum Coded Unit For the previous example : MCU = YDU,YDU,YDU,YDU,CbDU,CrDU Another example of sampling factors : HY =1, VY=1 HCb=1, VCb=1 HCr=1, VCr=1 Figure/order : YDU CbDU CrDU You see that here is defined an 8x8 image pixel block (MCU) with 3 8x8 blocks: one for Y, one for Cb and one for Cr (There's no down-sampling at all) Here (Hmax=1,Vmax=1) the MCU has the dimension (8,8), and MCU = YDU,CbDU,CrDU For gray-scaled JPGs you don't have to worry about the order of encoding data units in an MCU. For these JPGs, an MCU = 1 Data Unit (MCU = YDU) In the JPG file, the sampling factors for every image component are defined after the marker SOF0 = Start Of Frame 0 = FFC0 A brief scheme of decoding a JPG file -------------------------------------- The decoder reads from the JPG file the sampling factors, it finds out the dimensions of an MCU (Hmax*8,Vmax*8) => how many MCUs are in the whole image, then decodes every MCU present in the original image (a loop for all these blocks, or until the EOI marker is found [it should be found when the loop finishes, otherwise you'll get an incomplete image]) - it decodes an MCU by decoding every Data Unit in the MCU in the order mentioned before, and finally, writes the decoded (Hmax*8 x Vmax*8) truecolor pixel block into the (R,G,B) image buffer. MPEG-1 video and JPEG ---------------------- The interesting part of the MPEG-1 specification (and probably MPEG-2) is that it relies heavily on the JPEG specification. It uses a lot of concepts presented here. The reason is that every 15 frames , or when it's needed, there's an independent frame called I-frame (Intra frame) which is JPEG coded. (By the way, that 16x16 image pixels block example I gave you, is called,in the MPEG's standard terminology, a macroblock) Except the algorithms for motion compensation, MPEG-1 video relies a lot on the JPG specifications (the DCT transform , quantization, etc.) Hope you're ready now to start coding your JPG viewer or encoder. About the author of this doc ---------------------------- The author of this doc is Cristi Cuturicu, student at University Politehnica in Bucharest (UPB), Department of Computer Science. You can contact him by e-mail: cccrx@kermit.cs.pub.ro cryx@ulise.cs.pub.ro And if you are a software company needing a programmer then get in touch. A technical explanation of the JPEG/JFIF file format, written by Oliver Fromme, the author of the QPEG viewer ------------------------------------------------------- Legal NOTE: The legal rules mentioned in the Disclaimer in top of this file apply also to the following informations so neither Oliver Fromme, neither I can be held responsible for errors or bugs in the following informations. The author of the following informations is: Oliver Fromme Leibnizstr. 18-61 38678 Clausthal GERMANY JPEG/JFIF file format: ~~~~~~~~~~~~~~~~~~~~~~ - header (2 bytes): $ff, $d8 (SOI) (these two identify a JPEG/JFIF file) - for JFIF files, an APP0 segment is immediately following the SOI marker, see below - any number of "segments" (similar to IFF chunks), see below - trailer (2 bytes): $ff, $d9 (EOI) Segment format: ~~~~~~~~~~~~~~~ - header (4 bytes): $ff identifies segment n type of segment (one byte) sh, sl size of the segment, including these two bytes, but not including the $ff and the type byte. Note, not Intel order: high byte first, low byte last! - contents of the segment, max. 65533 bytes. Notes: - There are parameterless segments (denoted with a '*' below) that DON'T have a size specification (and no contents), just $ff and the type byte. - Any number of $ff bytes between segments is legal and must be skipped. Segment types: ~~~~~~~~~~~~~~ *TEM = $01 usually causes a decoding error, may be ignored SOF0 = $c0 Start Of Frame (baseline JPEG), for details see below SOF1 = $c1 dito SOF2 = $c2 usually unsupported SOF3 = $c3 usually unsupported SOF5 = $c5 usually unsupported SOF6 = $c6 usually unsupported SOF7 = $c7 usually unsupported SOF9 = $c9 for arithmetic coding, usually unsupported SOF10 = $ca usually unsupported SOF11 = $cb usually unsupported SOF13 = $cd usually unsupported SOF14 = $ce usually unsupported SOF14 = $ce usually unsupported SOF15 = $cf usually unsupported DHT = $c4 Define Huffman Table, for details see below JPG = $c8 undefined/reserved (causes decoding error) DAC = $cc Define Arithmetic Table, usually unsupported *RST0 = $d0 RSTn are used for resync, may be ignored *RST1 = $d1 *RST2 = $d2 *RST3 = $d3 *RST4 = $d4 *RST5 = $d5 *RST6 = $d6 *RST7 = $d7 SOI = $d8 Start Of Image EOI = $d9 End Of Image SOS = $da Start Of Scan, for details see below DQT = $db Define Quantization Table, for details see below DNL = $dc usually unsupported, ignore SOI = $d8 Start Of Image EOI = $d9 End Of Image SOS = $da Start Of Scan, for details see below DQT = $db Define Quantization Table, for details see below DNL = $dc usually unsupported, ignore DRI = $dd Define Restart Interval, for details see below DHP = $de ignore (skip) EXP = $df ignore (skip) APP0 = $e0 JFIF APP0 segment marker, for details see below APP15 = $ef ignore JPG0 = $f0 ignore (skip) JPG13 = $fd ignore (skip) COM = $fe Comment, for details see below All other segment types are reserved and should be ignored (skipped). SOF0: Start Of Frame 0: ~~~~~~~~~~~~~~~~~~~~~~~ - $ff, $c0 (SOF0) - length (high byte, low byte), 8+components*3 - data precision (1 byte) in bits/sample, usually 8 (12 and 16 not supported by most software) - image height (2 bytes, Hi-Lo), must be >0 if DNL not supported - image width (2 bytes, Hi-Lo), must be >0 if DNL not supported - number of components (1 byte), usually 1 = grey scaled, 3 = color YCbCr or YIQ, 4 = color CMYK) - for each component: 3 bytes - component id (1 = Y, 2 = Cb, 3 = Cr, 4 = I, 5 = Q) - sampling factors (bit 0-3 vert., 4-7 hor.) - quantization table number Remarks: - JFIF uses either 1 component (Y, greyscaled) or 3 components (YCbCr, sometimes called YUV, colour). APP0: JFIF segment marker: ~~~~~~~~~~~~~~~~~~~~~~~~~~ - $ff, $e0 (APP0) - length (high byte, low byte), must be >= 16 - 'JFIF'#0 ($4a, $46, $49, $46, $00), identifies JFIF - major revision number, should be 1 (otherwise error) - minor revision number, should be 0..2 (otherwise try to decode anyway) - units for x/y densities: 0 = no units, x/y-density specify the aspect ratio instead 1 = x/y-density are dots/inch 2 = x/y-density are dots/cm - x-density (high byte, low byte), should be <> 0 - y-density (high byte, low byte), should be <> 0 - thumbnail width (1 byte) - thumbnail height (1 byte) - n bytes for thumbnail (RGB 24 bit), n = width*height*3 Remarks: - If there's no 'JFIF'#0, or the length is < 16, then it is probably not a JFIF segment and should be ignored. - Normally units=0, x-dens=1, y-dens=1, meaning that the aspect ratio is 1:1 (evenly scaled). - JFIF files including thumbnails are very rare, the thumbnail can usually be ignored. If there's no thumbnail, then width=0 and height=0. - If the length doesn't match the thumbnail size, a warning may be printed, then continue decoding. DRI: Define Restart Interval: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - $ff, $dd (DRI) - length (high byte, low byte), must be = 4 - restart interval (high byte, low byte) in units of MCU blocks, meaning that every n MCU blocks a RSTn marker can be found. The first marker will be RST0, then RST1 etc, after RST7 repeating from RST0. DQT: Define Quantization Table: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - $ff, $db (DQT) - length (high byte, low byte) - QT information (1 byte): bit 0..3: number of QT (0..3, otherwise error) bit 4..7: precision of QT, 0 = 8 bit, otherwise 16 bit - n bytes QT, n = 64*(precision+1) Remarks: - A single DQT segment may contain multiple QTs, each with its own information byte. - For precision=1 (16 bit), the order is high-low for each of the 64 words. DAC: Define Arithmetic Table: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Current software does not support arithmetic coding for legal reasons. JPEG files using arithmetic coding can not be processed. DHT: Define Huffman Table: ~~~~~~~~~~~~~~~~~~~~~~~~~~ - $ff, $c4 (DHT) - length (high byte, low byte) - HT information (1 byte): bit 0..3: number of HT (0..3, otherwise error) bit 4 : type of HT, 0 = DC table, 1 = AC table bit 5..7: not used, must be 0 - 16 bytes: number of symbols with codes of length 1..16, the sum of these bytes is the total number of codes, which must be <= 256 - n bytes: table containing the symbols in order of increasing code length (n = total number of codes) Remarks: - A single DHT segment may contain multiple HTs, each with its own information byte. COM: Comment: ~~~~~~~~~~~~~ - $ff, $fe (COM) - length (high byte, low byte) of the comment = L+2 - The comment = a stream of bytes with the length = L SOS: Start Of Scan: ~~~~~~~~~~~~~~~~~~~ - $ff, $da (SOS) - length (high byte, low byte), must be 6+2*(number of components in scan) - number of components in scan (1 byte), must be >= 1 and <=4 (otherwise error), usually 1 or 3 - for each component: 2 bytes - component id (1 = Y, 2 = Cb, 3 = Cr, 4 = I, 5 = Q), see SOF0 - Huffman table to use: - bit 0..3: AC table (0..3) - bit 4..7: DC table (0..3) - 3 bytes to be ignored (???) Remarks: - The image data (scans) is immediately following the SOS segment.