Why is my Huffman Compression making the Compressed output a larger size than my original text?

ajox3412

I am trying to write a program in Python to compress text using Huffman Compression. The problem I am having is that the compressed text ends up being larger than the original text when saved to a text file, and I am not sure why this is the case. I have implemented a class called Heapnode to help me build a priority queue and build my binary tree using Heapq. In the class HuffmanCoding I have implemented methods to get the frequency of each character, make a priority queue, use it to merge 'nodes' into a sort of binary tree, and to traverse that tree to build huffman codes for each character.



class HeapNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):  # if the frequency of one character is lower than the frequency of another one
        return self.freq < other.freq

    def __eq__(self, other):  # if two characters have the same frequencies
        if other == None:
            return False
        if not isinstance(other, HeapNode):  # checks if the character is a node or not
            return False
        return self.freq == other.freq


class HuffmanCoding:
    def __init__(self, text_to_compress):
        self.text_to_compress = text_to_compress  # text that will be compressed
        self.heap = []
        self.codes = {}  # will store the Huffman code of each character
        self.decompress_map = {}

    def get_frequency(self):  # method to find frequency of each character in text - RLE
        frequency_Dictionary = {}  # creates an empty dictionary where frequency of each character will be stored

        for character in self.text_to_compress:  # Iterates through the text to be compressed
            if character in frequency_Dictionary:
                frequency_Dictionary[character] = frequency_Dictionary[character] + 1  # if character already exists in
                # dictionary, its value is increased by 1
            else:
                frequency_Dictionary[character] = 1  # if character is not present in list, its value is set to 1

        return frequency_Dictionary

    def make_queue(self, frequency):  # creates the priority queue of each character and its associated frequency
        for key in frequency:
            node = HeapNode(key, frequency[key])  # create node (character) and store its frequency alongside it
            heapq.heappush(self.heap, node)  # Push the node into the heap

    def merge_nodes(
            self):  # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
        # only one node left
        while len(self.heap) > 1:
            node1 = heapq.heappop(self.heap)  # pop node from top of heap
            node2 = heapq.heappop(self.heap)  # pop next node which is now at the top of heap

            merged = HeapNode(None, node1.freq + node2.freq)  # merge the two nodes we popped out from heap
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)  # push merged node back into the heap

    def make_codes(self, root, current_code):  # Creates Huffman code for each character
        if root == None:
            return

        if root.char != None:
            self.codes[root.char] = current_code
            self.decompress_map[current_code] = root.char

        self.make_codes(root.left, current_code + "0")  # Every time you traverse left, add a 0 - Recursive Call
        self.make_codes(root.right, current_code + "1")  # Every time you traverse right, add a 1 - Recursive Call

    def assignCodes(self):  # Assigns codes to each character
        root = heapq.heappop(self.heap)  # extracts root node from heap
        current_code = ""
        self.make_codes(root, current_code)

    def get_compressed_text(self, text):  # Replaces characters in original text with codes
        compressed_text = ""
        for character in text:
            compressed_text += self.codes[character]
        return compressed_text

    def show_compressed_text(self):

        frequency = self.get_frequency()
        self.make_queue(frequency)
        self.merge_nodes()
        self.assignCodes()

        compressed_text = self.get_compressed_text(self.text_to_compress)
        return compressed_text


print(HuffmanCoding('This sentence will get compressed').show_compressed_text())
Mark Adler

Your codes are character strings containing the ASCII characters "0" and "1". Each character takes eight bits, so you are expanding your compressed data by a factor of eight.

You need to instead make variable-length binary codes, where one bit takes one bit of space. You then need to be able to concatenate those variable-length codes to make a sequence of bytes (starting with b'', not ""), filling out the last byte with zero bits as needed. Then you have a sequence of bytes, each containing eight of the bits from your codes. Each can have any of the 256 possible byte values.

You would use the bit operators on integers to construct this, specifically shift: <<, >>), or: |, and and: &. You can use bytes() to convert integers to bytes. E.g.:

>>> bytes([65,66,67])
b'ABC'

Also note that you are compressing a very short string that won't get compressed even when you're writing the output as bits correctly. Especially once you send the code along with it. You'd need to test on much more text in order for the compression to take advantage of the frequencies of different letters in the English language.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

My compressed file have larger file size than the original file

Why is my text file larger than my binary file?

Compressed (by tar gzip) size is far larger than the original folder

Why is the resident set size (RSS) of my program larger than the executable?

Why My output is printing # in Huffman coding

How do I make my div scale to larger than it's original size using css animation?

CSV file is being created larger than the size of my original data in python/pycharm?

Why my output file's size is less than origin file?

Why my compressed output using GZIPOutputStream is not matching when we decompressed the same string using any online compression tools

Why is my Java socket receiving an array size larger than what was sent?

Why is my containing span larger than my svg?

Why my text is crushed in a larger device?

Why SQLITE doesn't accept My INTEGER/TEXT data larger than 8, using Python 3?

Why Huffman's coding algorithm takes more bit than the original size?

why is my css not making the text bold and red

Why compression output larger zip file

Why is my PostgreSQL table larger (in GB) than the csv it came from?

Why is my file with 10 random bytes larger than 10 bytes?

Why does my program think that 72 is larger than 500?

Why is my frame larger than what I set it to be?

Why is my min_free_kbytes larger than the documented calculation (and larger than the documented maximum)?

Why is the serialized size of my class larger, then the sum of its variables

Why the size of base64-encoded string is larger than the original file

Why is using __slots__ in my class not making a difference in size?

Why is my JFrame size smaller than the size that I input into setPreferredSize()?

Why is one of my FontAwesome icons larger than the other two in my inline-block list?

Why does this line not output my text file?

Why is my button addon way bigger than my text input?

Why is my binary output different than the expected output?