Introduction
Hello, and welcome to my article. Sometimes, I wonder where all my ideas for my articles come from; at least I now know why I cannot sleep most evenings. My brain doesn’t switch off. It is a blessing and a curse.
Today, you will learn how to make use of the Shannon Entropy equation to work out probabilities in your .NET applications.
Entropy
Entropy can be defined in the context of a probabilistic model. For example: A coin flip has an entropy of 1 bit per coin flip. A string that always generates a long sequence of As has an entropy of 0, because the next character in the string will always be an ‘A’.
Shannon Entropy
Claude Shannon’s entropy measures information contained in a message; for example: redundancy in language structure, and information about the occurrence frequencies of letter or word pairs, and so on. Shannon entropy provides a way to determine the average minimum number of bits needed to encode a string, based on the frequency of the symbols inside the string.
Our Project
Create a new C# or Visual Basic.NET Windows Forms project. Once the default form has loaded, add one Button and one ListBox to it.
Code
Add a new Class to your project and name it ShannonEntropy; then, add the necessary NameSpaces.
C#
using System; using System.Collections.Generic; using System.IO; using System.Linq;
VB.NET
Imports System Imports System.Collections.Generic Imports System.IO Imports System.Linq
Add the following fields.
C#
SortedList<byte, int> slTimeSymbolAppears; SortedList<byte, double> slEntropy; double dblEntropy; bool blnUsed; int iSize;
VB.NET
Private slTimeSymbolAppears As SortedList(Of Byte, Integer) Private slEntropy As SortedList(Of Byte, Double) Private dblEntropy As Double Private blnUsed As Boolean Private iSize As Integer
slTimeSymbolAppears contains each occurrence of the desired symbol. dblEntropy will contain the result of the process and blnUsed is True or False depending on whether or not a symbol has been used. Add the Properties.
C#
public int Size { get { return iSize; } private set { iSize = value; } } public int Unique { get { return slTimeSymbolAppears.Count; } } public double Entropy { get { return GetEntropy(); } } public Dictionary<byte, int> Distribution { get { return SortedDistribution(); } } public Dictionary<byte, double> Probability { get { return SortedProbability(); } }
VB.NET
Public Property Size As Integer Get Return iSize End Get Private Set(ByVal value As Integer) iSize = value End Set End Property Public ReadOnly Property Unique As Integer Get Return slTimeSymbolAppears.Count End Get End Property Public ReadOnly Property Entropy As Double Get Return GetEntropy() End Get End Property Public ReadOnly Property Distribution As Dictionary(Of Byte, _ Integer) Get Return SortedDistribution() End Get End Property Public ReadOnly Property Probability As Dictionary(Of Byte, _ Double) Get Return SortedProbability() End Get End Property
Add the reset of the Functions and the Constructor.
C#
public byte GreatestDistribution() { return slTimeSymbolAppears.Keys[0]; } public byte GreatestProbability() { return slEntropy.Keys[0]; } public double SymbolDistribution(byte bSymbol) { return slTimeSymbolAppears[bSymbol]; } public double SymbolEntropy(byte bSymbol) { return slEntropy[bSymbol]; } public Dictionary<byte, int> SortedDistribution() { List<Tuple<int, byte>> lstEntries = new List<Tuple<int, byte>>(); foreach (KeyValuePair<byte, int> e in slTimeSymbolAppears) { lstEntries.Add(new Tuple<int, byte>(e.Value, e.Key)); } lstEntries.Sort(); lstEntries.Reverse(); Dictionary<byte, int> dicResult = new Dictionary<byte, int>(); foreach (Tuple<int, byte> e in lstEntries) { dicResult.Add(e.Item2, e.Item1); } return dicResult; } public Dictionary<byte, double>SortedProbability() { List<Tuple<double, byte>> lstEntries = new List<Tuple<double, byte>>(); foreach (KeyValuePair<byte, double> e in slEntropy) { lstEntries.Add(new Tuple<double, byte>(e.Value, e.Key)); } lstEntries.Sort(); lstEntries.Reverse(); Dictionary<byte, double> dicResult = new Dictionary<byte, double>(); foreach (Tuple<double, byte> e in lstEntries) { dicResult.Add(e.Item2, e.Item1); } return dicResult; } public double GetEntropy() { if (!blnUsed) { return dblEntropy; } dblEntropy = 0; slEntropy = new SortedList<byte, double>(); foreach (KeyValuePair<byte, int> e in slTimeSymbolAppears) { slEntropy.Add(e.Key, (double)slTimeSymbolAppears[e.Key] / (double)iSize); } foreach (KeyValuePair<byte, double> e in slEntropy) { dblEntropy += e.Value * Math.Log((1 / e.Value), 2); } blnUsed = false; return dblEntropy; } public void GetBytes(byte[] bBytes) { if (bBytes.Length < 1 || bBytes == null) { return; } blnUsed = true; iSize += bBytes.Length; foreach (byte bt in bBytes) { if (!slTimeSymbolAppears.ContainsKey(bt)) { slTimeSymbolAppears.Add(bt, 1); continue; } slTimeSymbolAppears[bt]++; } } public void GetBytes(string strBytes) { GetBytes(StringToByteArray(strBytes)); } byte[] StringToByteArray(string strInput) { char[] c = strInput.ToCharArray(); IEnumerable<byte> b = c.Cast<byte>(); return b.ToArray(); } void Clear() { blnUsed = true; dblEntropy = 0; iSize = 0; slTimeSymbolAppears = new SortedList<byte, int>(); slEntropy = new SortedList<byte, double>(); } public ShannonEntropy(string fileName) { Clear(); if (File.Exists(fileName)) { GetBytes(File.ReadAllBytes(fileName)); GetEntropy(); SortedDistribution(); } } public ShannonEntropy() { Clear(); }
VB.NET
Public Function GreatestDistribution() As Byte Return slTimeSymbolAppears.Keys(0) End Function Public Function GreatestProbability() As Byte Return slEntropy.Keys(0) End Function Public Function SymbolDistribution(ByVal bSymbol As Byte) _ As Double Return slTimeSymbolAppears(bSymbol) End Function Public Function SymbolEntropy(ByVal bSymbol As Byte) As Double Return slEntropy(bSymbol) End Function Public Function SortedDistribution() As Dictionary(Of Byte, _ Integer) Dim lstEntries As List(Of Tuple(Of Integer, Byte)) = New _ List(Of Tuple(Of Integer, Byte))() For Each e As KeyValuePair(Of Byte, Integer) In _ slTimeSymbolAppears lstEntries.Add(New Tuple(Of Integer, Byte)(e.Value, _ e.Key)) Next lstEntries.Sort() lstEntries.Reverse() Dim dicResult As Dictionary(Of Byte, Integer) = New _ Dictionary(Of Byte, Integer)() For Each e As Tuple(Of Integer, Byte) In lstEntries dicResult.Add(e.Item2, e.Item1) Next Return dicResult End Function Public Function SortedProbability() As Dictionary(Of Byte, _ Double) Dim lstEntries As List(Of Tuple(Of Double, Byte)) = New _ List(Of Tuple(Of Double, Byte))() For Each e As KeyValuePair(Of Byte, Double) In slEntropy lstEntries.Add(New Tuple(Of Double, Byte)(e.Value, e.Key)) Next lstEntries.Sort() lstEntries.Reverse() Dim dicResult As Dictionary(Of Byte, Double) = New _ Dictionary(Of Byte, Double)() For Each e As Tuple(Of Double, Byte) In lstEntries dicResult.Add(e.Item2, e.Item1) Next Return dicResult End Function Public Function GetEntropy() As Double If Not blnUsed Then Return dblEntropy End If dblEntropy = 0 slEntropy = New SortedList(Of Byte, Double)() For Each e As KeyValuePair(Of Byte, Integer) In _ slTimeSymbolAppears slEntropy.Add(e.Key, CDbl(slTimeSymbolAppears(e.Key)) / _ CDbl(iSize)) Next For Each e As KeyValuePair(Of Byte, Double) In slEntropy dblEntropy += e.Value * Math.Log((1 / e.Value), 2) Next blnUsed = False Return dblEntropy End Function Public Sub GetBytes(ByVal bBytes As Byte()) If bBytes.Length < 1 OrElse bBytes Is Nothing Then Return End If blnUsed = True iSize += bBytes.Length For Each bt As Byte In bBytes If Not slTimeSymbolAppears.ContainsKey(bt) Then slTimeSymbolAppears.Add(bt, 1) Continue For End If slTimeSymbolAppears(bt) += 1 Next End Sub Public Sub GetBytes(ByVal strBytes As String) GetBytes(StringToByteArray(strBytes)) End Sub Private Function StringToByteArray(ByVal strInput As String) _ As Byte() Dim c As Char() = strInput.ToCharArray() Dim b As IEnumerable(Of Byte) = c.Cast(Of Byte)() Return b.ToArray() End Function Private Sub Clear() blnUsed = True dblEntropy = 0 iSize = 0 slTimeSymbolAppears = New SortedList(Of Byte, Integer)() slEntropy = New SortedList(Of Byte, Double)() End Sub Public Sub New(ByVal fileName As String) Clear() If File.Exists(fileName) Then GetBytes(File.ReadAllBytes(fileName)) GetEntropy() SortedDistribution() End If End Sub Public Sub New() Clear() End Sub
Add the code for your Form.
C#
namespace ShannonEntropy_C { public partial class Form1 : Form { ShannonEntropy se = new ShannonEntropy(@"C:\\Temp\\TestFile.txt"); public Form1() { InitializeComponent(); } private void button1_Click(object sender, EventArgs e) { double ge = se.GetEntropy(); listBox1.Items.Add(ge.ToString()); } } }
VB.NET
Public Class Form1 Private se As ShannonEntropy = New _ ShannonEntropy("C:\Temp\TestFile.txt") Private Sub button1_Click(sender As Object, e As EventArgs) _ Handles button1.Click Dim ge As Double = se.GetEntropy() listBox1.Items.Add(ge.ToString()) End Sub End Class
When you click the button, it will calculate and display the Entropy. I have included the Textfile, but keep in mind that it must be referenced properly and you might not have a Temp folder on your disk.
Figure 1 shows the result.
Figure 1: Running
Conclusion
In this article, you have learned how useful entropy can be in determining repetitive values. Until next time, happy coding!