1.2. Quickstart

1.2.1. Database

To start using the DigestDB you first need to create a database. Let’s create a DigestDB and we’ll tell it to use the current directory for storing any binary content.

from digestdb import DigestDB

db = DigestDB('.')

By default the DigestDB will create a file called digestdb.db and a directory called digestdb.data. The digestdb.db is a simple SQLite database that stores the categories and digests of the blobs. Categories are used to group binary content to facilitate searching (e.g. JavaScript, css, images, etc). The digestdb.data directory is the top level directory in which all the binary blobs are stored.

When the DigestDB is instantiated it checks for a lock file. The lock file ensure that it has exclusive access to the data otherwise there is a risk of losing synchronisation between the files on disk and those listing in the database. If the DigestDB encounters a lock file when starting up it will report the error and shut down.

Before writing or reading data from the DigestDB it must first be opened.

db.open()

Conversely, when you are finished with the database it must be closed.

db.close()

If you re-open the database it will simply continue on from where it left off.

If you want to create a new database you can explicitly specify filename and data_dir.

The DigestDB takes a number of optional arguments. The dir_depth is one of the most important settings and is disucssed in detail in the following section.

1.2.1.1. Database Depth

To understand the database directory depth we need some background first.

There isn’t any real limit on the number of files that can be stored in a directory. However, it can become very slow as the number of files increases. The time it takes to list and check for the existance of a file increases as the number of files increase. So we need a strategy to balance the files over some number of direcotries to avoid this problem.

The file path that determines where a blob is stored will be created from the blob’s hash. As a new item is added to the database a hash (SHA-256 by default) is calculated. The default DigestDB dir_depth is 3. This means that the first three bytes from the hash digest are used to construct the directory structure.

Given the following hash:

8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d

A dir_depth of 1 would result in the data item being stored in the following locaiton:

8f/8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d

A dir_depth of 3 would result in the data item being stored in the following locaiton:

8f/dd/8b/8fdd8b7dfa0d7d4f761da78e76d62ec4bee3b1847a6ad48507090e13752b2d

Each directory level adds 256 directories (x00, x01, ... xfe, xff). So with a directory depth of 1 we get 256 directories. With a depth of 2 we get 256 * 256 = 65536 and with a depth of 3 we get 256 * 256 * 256 = 16,777,216 directories.

The chosen directory depth can significantly impact cleanup operations. Let’s assume a naive internal implementation that creates all directories up front. Without storing any data files at all and a depth=1 it takes about 0.03 seconds. When depth=2 it takes about 10 seconds to remove the 65 thousand directories. When depth=3 it takes a very long time (2441 secs) to remove the 16 million directories.

For this reason, directories are created only when required. This significantly reduces the time it takes to remove transient databases, such as those used in unit tests.

The number of directories used to balance the data is related to the total number of data items that are expected to be stored in the database. By default the depth is 3. This is suitable for storing lots (billions) of data files.

As an example, let’s say we plan on having around 10 million files in the database. The following table shows the expected files in each directory for different directory depth settings.

depth directories files per dir
0 1 10,000,000.0
1 256 39,062.5
2 65,536 152.5
3 16,777,216 0.6

In this example a depth of 2 would be appropriate.

The maximum entries in a database for a column with a primary key of a signed integer is 2,147,483,647. So let’s bump the expected file items up to 2 billion.

depth directories files per dir
0 1 2,000,000,000.0
1 256 7,812,500.0
2 65,536 30,517.6
3 16,777,216 119.2

In this example a depth of 3 seems more appropriate.

1.2.2. Categories

Categories provide a method to group associated data items in the database. This provides a mechansim for more efficient querying of data by category.

The selection of what constitutes a category depends on the scenario. Below are some examples of how categories might be used to group different kinds of data:

  • when storing inter-process messages (e.g. for later analysis or replay) the categories might be the message kinds or identifiers.
  • when storing web requests the categories might be route paths.
  • when storing web server resources the categories might represent images, css, javascript, etc.

Categories must be added to the database before data items can be associated with the category.

db.put_category(
    label='js', description='JavaScript resources')

1.2.3. Blobs

Binary data can be stored, retrieved and queried.

To add a binary blob to the database use put_data:

digest = db.put_data('js', b'\x00\x01...')

To add the contents of a file to the database use put_file:

digest = db.put_file('js', '/path/to/js/file')

To check if data exists in the database use exists:

data = db.exists(digest)

To fetch data from the database use get_data:

data = db.get_data(digest)

To delete data from the database use delete_data:

data = db.delete_data(digest)

To query data from the database use query_data:

blobs = db.query_data(category='js')