A DNA-based data storage system: is it real and how does it work?

Data storage systems based on DNA can become a way out for humanity, which generates ever-increasing amounts of information. Compared to all other media, DNA offers phenomenal data densities. Another advantage is that if stored in the right conditions, DNA does not require any energy to maintain the data for centuries. After a few centuries, the data can be read without any problems — of course, if the appropriate technologies are available.

However, there are some downsides to the DNA technology. For example, there is no standard method of encoding information in a DNA strand. Synthesizing artificial molecules remains quite expensive. And reading the data can take days or weeks. Repeated access to the DNA strands for data leads to a violation of the structure of the molecules, so in the end, errors can occur. However, a group of researchers created and proposed a new method that will help solve some of these problems. The new data storage system (so far only for images) is a cross between a regular file system and a metadata-based database.

More about problems

Current DNA-based storage systems involve adding specific sequence tags to the stretches of DNA that contain data. To get the data you want, you simply add bits of DNA that can base-pair with the right tags and use them to amplify the full sequence. Think of it like tagging every image in a collection with an ID, then setting things up so that only one specific ID gets amplified..

This method is quite effective, but it has two limitations. First, the amplification step, done using the PCR (polymerase chain reaction) process, has limits on the size of the sequence that can be amplified. At the same time, each tag takes up some of that already limited space, so adding detailed tags reduces the amount of data storage space.

Another limitation is that the PCR reaction that amplifies specific data-containing DNA consumes part of the original DNA library. In other words, every time we read some data, we destroy some parts of it. Scientists compare this method of searching for information to burning a haystack to find a needle. If you do this often enough, you may end up losing the entire repository altogether. Of course, there are ways to restore lost areas, but this method is not ideal, because when it is used, the probability of errors in the DNA and data sections increases.


The new method allows us to separate he tag information from the main data storage. In addition, the researchers created a system that allows us to access only the data we are interested in and leave the rest of the data untouched. This way the DNA molecules remain intact and not damaged.

New system

The technology is based on silicon-dioxide glass beads that store individual files. Attached to each capsule are DNA tags that show what is in the file. The size of each capsule is about 6 micrometers. Thanks to this system, researchers were able to learn how to extract individual images with 100% accuracy. The set of files they created is not very large — there are only 20 of them. However, if you take into account the capabilities of DNA, then you can scale such a system to a sextillion of files.

These 20 files were encoded into fragments of DNA about 3000 nucleotides long, which is about 100 bytes of data. You can put a file up to a gigabyte in one silicon-dioxide glass shell. Once the file is wrapped, single-stranded DNA tags are placed on its surface. You can attach multiple tags to a single shell that serve as keywords. For example, "red", "cat", "animal".

The shells labeled in this way are combined into a single data library. It is not as compact as a pure DNA repository, but it does not damage the data.

Finding files

A group of keyword tags is used to search for files. For example, if you need to find an image of a cat, use the labels "orange", "cat", and "home". Only "orange" and "cat" are used to search for a tiger. The search speed in such a system is still very low — something like 1 kB per second.

Another trick is that each tags were linked to differently colored fluorescent molecules so that any glass shells linked to the right tags would start glowing specific colors. We already have machines that use lasers to separate things based on what color they glow, so it is technically possible to select the necessary data.

This way, the rest of the library will remain intact, which means the data will be safe. You do not need to burn a haystack to find a single needle anymore. An additional advantage is the possibility of logical search with different criteria. For example, the query conditions can be complex: true for "cat", false for "home", true for "black", and so on.

More than just a search

Yes, after all, the task of finding the necessary data is only part of the task, and not even half. The detected data still needs to be sequenced. And to do this, you need to open the silica glass shell, remove the thread stored in the capsule, insert the DNA into the bacterium and then read the data. This is an extremely slow process, compared to which even streamers are a very fast technology.

On the other hand, while DNA-based systems will not be fast, their main purpose is to store huge amounts of information that does not need to be extracted frequently. In addition, the technology will be improved over time, so the speed of reading information will hopefully increase.