Huffman Coding implementation in Java

September 12, 2009 at 01:26:08
Specs: Windows XP, athlon xp3000+/512dd
Hi,

I'm currently working on a program in Java to compress a text file with Huffman Coding. I've got the my program to encode the characters properly. However, the output is a STRING of 1's and 0's.

I.e, if my current coding is a=10, b=110

the output file for "ab" would contain 10110 IN A STRING.

So, the output file ends up being larger than the original input, because each 1 and 0 is taking up one byte each.

Does anyone know how to convert this string of 1's and 0's to its 'true' binary format, so that each 1 and 0 takes up one bit?

Thanks in advanced


See More: Huffman Coding implementation in Java

Report •


#1
September 12, 2009 at 23:51:21
Hi echelon4, As far as I know Java doesn't allow you to write file data at bit level. It allows you to write file data at byte level though. It can significantly optimize your code as each 1 and 0 will not occupy 1 byte each.
Here is the code: -
---------------------------code-------------------------------------------------------
import java.io.*;
public class sampleJ {
public static void main(String[] argv) throws IOException {

String bin1="110";
//converting to decimal equivalent of 110(=6)
int output=Integer.parseInt(bin1,2);

FileOutputStream os = new FileOutputStream("binary.dat");
os.write(output);
os.flush();
os.close();
FileInputStream f=new FileInputStream("binary.dat");
int input=f.read();
f.close();

String bin2=Integer.toBinaryString(input);
System.out.println("binary val= "+bin2);
}

}
---------------------------------------------------------------------------------------

As per the above code, lets say you have the binary value 110 stored in a String(as you had mentioned). The code will convert it into an Integer and then write a single byte 'as an Integer' to the file.This means 110 will not have each 1 and 0 occupying 1 byte each but only 1 byte as a whole. The entire file size(containing 110 in Integer format) as seen on my system was 1 byte.
After this you can read this file back and convert the read Integer data back to binary form using the parseInt method.
It will be stored in a String that means you can access the individual bits by using the charAt methos of Character class.


Report •

#2
September 13, 2009 at 00:12:34
But do remember you can't write a binary number bigger than 11111111 (8 bits) with one call to write(). write method will only write one byte at a time even though its argument is an int of 32 bits -- It will ignore the high order bits . So you will have to call write again for writing the next byte. In this way if you write 8 bits with each call of write method then it will be equivalent to writing one bit for each 1 or zero. So with judicious use of above code you can totally achieve your objective of each 1 and zero occupying only one bit !
Hope this helps you.

Report •

#3
September 13, 2009 at 04:48:56
Hi, thanks very much for you reply.
I do have a problem implementing what you suggested though.

If I have say, several characters to encode, then there may be some characters which require more than 3 bits. E.g if i had the code "g"=111110.

if i code that with 111 and 110 where 111=7, and 110=6, then output is 76, taking up 2 bytes. This makes the encoding greater than the 1 byte used for "g" alone.

However, if i encode the that as a whole six bits, then 111110=126. taking 3 bytes, which is still greater than the 1 byte used for 'g' alone?

So it seems that when i try to huffman code input with several different characters, i will get alot of inefficiencies. Is there a way to overcome this?


Report •

Related Solutions

#4
September 13, 2009 at 05:27:31
So you really don't know about Java's Bitwise and bit shift operators?

Report •

#5
September 13, 2009 at 09:37:18
If I have say, several characters to encode, then there may be some characters which require more than 3 bits.
I never said anything about 3 bits restriction - it was just for demo that I wrote 110 in my code above.

However, if i encode the that as a whole six bits, then 111110=126. taking 3 bytes, which is still greater than the 1 byte used for 'g' alone?
Why will it take 3 bytes? 111110 is 7 bits long so when you write it, it will only occupy 1 byte. I said that all operations will be at 'byte' level - 8 bits. The file size will be only 1 byte -you can try it by putting bin1 variable in my code as 111110 and checking the size of file which is created. It will be one byte only. So as long as you are using distinct characters which have value less than 11111111 (ie: you have less then 255 distinct characters to encode ) you should not have any problem. You can read and write one character (like g=111110) at a time with read and write functions. So I didn't really understand the 3 bit and 2 bit thing you wrote above.
Even if you have more than 255 distinct characters then you can use two byte notation as per which you will have to conduct two read and write operations per character. It is still efficient and consistent approach.

I hope you get my point.It is byte level so with less than 255 distinct characters there is no problem.


Report •

#6
September 13, 2009 at 19:25:34
Thanks Suyash.

I understand now.


Report •


Ask Question