Binary Data and File Formats

20 August 2022

Introduction

Computers work with data stored in binary formats. Reading and interpreting binary data is an important part of understanding file formats so their data can be read and interpreted

Bits and Bytes, and Binary

Binary files store data using bits

A bit can be either a 1 or 0. When working with bit-data, it's useful to group them into sets of 8, known as bytes

A byte is a sequence of 8-bits which can contain a value ranging from 0 to 255 - these are referred to as the decimal representation

Bytes consist of 8-bit, with each position representing a power of 2 from 2^0 to 2^7, as seen below:

0 0 0 0 0 0 0 0
| | | | | | | |  
| | | | | | | |__ 2^0 - 0 or 1
| | | | | | |____ 2^1 - 0 or 2
| | | | | |______ 2^2 - 0 or 4
| | | | |________ 2^3 - 0 or 8
| | | |__________ 2^4 - 0 or 16
| | |____________ 2^5 - 0 or 32
| |______________ 2^6 - 0 or 64
|________________ 2^7 - 0 or 128

total:                  0 to 255

The byte above represents the value for 0, this is because all the bits have a value of 0

Using the above explanation, the number 1 is represented using the following:

0 0 0 0 0 0 0 1
| | | | | | | |  
| | | | | | | |__ 2^0 - 1
| | | | | | |____ 2^1 - 0
| | | | | |______ 2^2 - 0
| | | | |________ 2^3 - 0
| | | |__________ 2^4 - 0
| | |____________ 2^5 - 0
| |______________ 2^6 - 0
|________________ 2^7 - 0

total:                  1

Where the position for 2^0 is the only bit with a value (1)

Similarly, 2 is represented as:

0 0 0 0 0 0 2 0
| | | | | | | |  
| | | | | | | |__ 2^0 - 0
| | | | | | |____ 2^1 - 2
| | | | | |______ 2^2 - 0
| | | | |________ 2^3 - 0
| | | |__________ 2^4 - 0
| | |____________ 2^5 - 0
| |______________ 2^6 - 0
|________________ 2^7 - 0

total:                  2

Where the bit for 2^1 has a value

Or the number 5 with bits 2^0 and 2^2 having a value:

0 0 0 0 0 1 0 1
| | | | | | | |  
| | | | | | | |__ 2^0 - 1
| | | | | | |____ 2^1 - 0
| | | | | |______ 2^2 - 4
| | | | |________ 2^3 - 0
| | | |__________ 2^4 - 0
| | |____________ 2^5 - 0
| |______________ 2^6 - 0
|________________ 2^7 - 0

total:                  5

Which is calculated by adding 2^0 + 2^2 = 1 + 4 = 5

A larger number, like 234 is:

1 1 1 0 1 0 1 0
| | | | | | | |  
| | | | | | | |__ 2^0 - 0
| | | | | | |____ 2^1 - 2
| | | | | |______ 2^2 - 0
| | | | |________ 2^3 - 8
| | | |__________ 2^4 - 0
| | |____________ 2^5 - 32
| |______________ 2^6 - 64
|________________ 2^7 - 128

total:                  234

The calculation for the above value is:

2^1 + 2^3 + 2^5 + 2^6 + 2^7 = total = 234

When substituting the powers of 2:

2 + 8 + 32 + 64 + 128 = total = 234

The numbers discussed above are all 1-byte (8-bit) numbers, which have a range between 0 and 255, adding bits to the value will allow the representation of bigger numbers, for example, a 2-byte (16-bit) number can have a value from 0 to 65,535

Hexadecimal (Hex)

In the above example, numbers are represented in binary format (e.g. 000000020), or decimal format (e.g. 2)

When looking at binary data, it can be a bit easier to navigate around by representing data in hexadecimal (hex) format - which represents every 4 bits as a value ranging from 0-15, so, similar to the byte example above, but instead using 4-bits:

0 0 0 0
| | | |  
| | | |__ 2^0 - 0 or 1
| | |____ 2^1 - 0 or 2
| |______ 2^2 - 0 or 4
|________ 2^3 - 0 or 8

Using what's already been discussed, the number 12 can be represented in bits as 1100,

Hex numbers additionally convert each of these values into a value from 0-9 or A-F, as seen in the following table:

Decimal Bits Hex
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
10 1010 A
11 1011 B
12 1100 C
13 1101 D
14 1110 E
15 1111 F

Using the binary representation, a byte can be represented using 2 Hex values which are taken by using the first 4-bits as the first hex value, and the second 4-bits as the second hex value

For example, the value for 234, represented as bits: 11101010 can be split into 2 sets of 4-bits 1110 1010, using the table above, this becomes EA in hex

Binary Files

Binary files encode data using bits, when viewing them, it's convenient to view the data in them using bytes represented as hex values as was shown above

Binary files usually require some knowledge of how their data is structured to correctly interpret the information. This is usually described in the specification for the file format

Text Files

Plain text files are usually in the UTF-8 or UTF-16 format - this means that they use 8-bits or 16-bits to represent each character, but there are lots of other formats that a text file can use

UTF-8 data can be read by converting the binary data to text data using a table which maps the byte/hex value to the character -for the UTF-8 format, the character "A" is encoded in hex as 41 and "z" is 7a

The Hexadecimal representation for a text file that contains the following UTF-8 data:

Hello World!

Would be stored as a binary file which contains:

48 65 6C 6C 6F 20 57 6F 72 6C 64 21

Putting the hex below the text content, the hex to character mapping can be seen:

H  e  l  l  o     W  o  r  l  d  !
48 65 6C 6C 6F 20 57 6F 72 6C 64 21

Binary files can be viewed in a hex editor to see the raw binary data, but interpreting these files depends on the format used and will differ between file formats

Conclusion

Computers store data using bits. Bits can be structured into sets of 8-bits, called a byte

Data can be represented using bits, bytes, decimal values, or hexadecimal values

Files store data using binary. Binary data can be represented as decimal or hex depending and can be viewed as whichever is appropriate

References