If you are new to databases, there are probably many words you don’t fully understand. SQL, distribution, vertical scaling, consistency… Even a quick read about the characteristics of a product can get overwhelming. We feel you! We’ve been there—and we got you covered. If you are a non-technical person or are just starting, this blog post will introduce you to some useful terminology for database beginners.
For example, let’s take a look at this paragraph describing the main purpose of CrateDB, our database:
We already found some interesting concepts. Let’s look at them:
We can consider machine data all the digital information generated without direct human intervention, as the data coming from computers, smartphones, sensors, or any other smart system. A lot of valuable information can be generated by machines, which can constantly monitor different metrics or processes, providing amazing insights on how everything is working. In order to paint this desired big picture, it is often necessary to collect and process huge amounts of data. This data needs to be read, interpreted, and stored; data with different formats, coming in from multiple places, and in real-time. That’s the challenge.
The notion of IoT is directly related to the challenges of machine data.
We are increasingly surrounded by objects able to collect and exchange data, from phones to fridges to cars; this network of objects—connected to the internet and to each other—receives the name of the Internet of Things, IoT. The entities that adopt IoT principles are referred to as smart, as in smart cities, smart transportation, or smart factories.
If the idea of the IoT is transported to the industrial sector, it comes the term industrial IoT (Industrial Internet of Things, or IIoT). Industrial IoT refers to the use of digital sensors and actuators to monitor and control the different activities that are taking place in the industry, with manufacturing having a special focus.
A huge amount of data is produced by these sensors. If this data is processed in real-time, the visibility of the operations can increase dramatically, making it possible to automate and set up remote processes where it was not possible before. The information extracted from the data also allows industries to identify weak points or to easily pick up failures, addressing them immediately.
Data dashboards are graphical user interfaces for real-time monitoring, used to transform the data into intuitive, real-time information. They display metrics that are interesting for each specific process or application, using charts, tables, or any other visuals helping the user to take real-time action. For the successful dashboarding of complex industrial operations, the huge amounts of data coming in need to be processed very quickly.
What about machine learning?
Apart from allowing us to monitor the state of our machines in real-time, the magical thing about big data analysis is that it enables to program systems in a way that allows them to learn from past experiences. By storing and analyzing historic data, identifying issues, and extracting patterns, the conditions that led to a failure can be identified by machines in a way inaccessible by humans. The result is that failures can be prevented before they even occur, with the machines getting smarter and smarter the more valuable data they ingest.
That was easy, right?
Let’s dive a little deeper:
Short but sweet! Now we’re talking.
In order to fully understand the previous lines, the following concepts are key:
A database that organizes the data into tables (relations). A table/relation represents an entity type. The rows of the table are sometimes called instances.
It is the language of relational databases, a language particularly efficient for handling relational data (i.e., data that can be organized into tables). It is a widely used language, easy to learn and to use.
Instead, the so-called NoSQL databases work with non-structured data, or with data that cannot be organized by the tabular structure of SQL.
Traditional relational databases can be a bit too restrictive, in terms of the data types and structures they support, and they can be difficult to scale horizontally (more about horizontal scalability later). Since many modern applications demand more flexibility, NoSQL databases were born.
NoSQL has some disadvantages too. For example, they hardly achieve the power of SQL for working with complex queries, and they add a significant degree of complexity to the table—SQL is not around anymore, and each database has its own particular language that has to be learned.
A best-of-both-worlds solution would mean to maintain the ease and power of SQL while adopting the scalability and dynamism of NoSQL.
This is where NewSQL comes into play. NewSQL databases are a new generation of databases intended to combine the capabilities of both SQL and NoSQL. For example, by building a database that supports non-tabular data and can be scaled horizontally, while still providing full accessibility by the SQL language.
A distributed database is able to distribute its tasks through multiple “sub-databases”, or patitions—they are called nodes. A group of nodes is called a cluster.
You’ll understand more about distribution as we get into the next concepts.
It comes the time when your data volume grows, and it’s time to increase the capacity of the database.
The first possible strategy is to grow the host server, increasing its capacity through more CPUs, RAM memory, or disk storage. The database can also be moved to a more powerful server altogether. This is called vertical scaling.
On the contrary, if we are talking about a distributed database that operates through multiple nodes, its capacity can be grown by adding more nodes to the cluster. This is called horizontal scaling, and it implies that the database can grow easier and almost indefinitely.
Not all distributed databases are the same. When building a distributed database, there are multiple ways to design how it operates through different nodes.
One option is to set a shared-nothing architecture, meaning that the workload will be partitioned and processed through the different nodes without letting them share any memory or storage. Each one of them does a part of the work. This eliminates potential conflict among nodes, making scalability very simple.
Another term to describe this is with the term sharding i.e. the data from incoming transactions is split through multiple shards running at the same time.
How to determine how to partition the data in order to get the job done in the most optimal, effective way? That’s the key to the kingdom, with every database having its own strategy.
Let's keep it going!
The acronym ACID (for Atomicity, Consistency, Isolation, and Durability) refers to the principles classically followed when defining the architecture of most databases:
If a request is made in a system with a blocking architecture, the system won’t provide a response until the answer is available. In other words, we’ll wait.
A non-blocking architecture avoids that delay by forcing the system to provide an immediate response at any case, even if the response cannot be provided yet (for example, by sending an error message). This assumes that the request would need to be made again when the response is available, but it assures that possible bottlenecks are avoided.
Let’s assume that a request to write a specific value enters one of the nodes of the cluster.
In order to follow the consistency principle, the new value needs to be updated in all the other nodes too, in order to assure that there is no conflicting information between them and that everything is kept up to date everywhere.
Two different strategies can be followed in order to get there. A first option is to not allow any further write or read requests until all the nodes of the database are updated with the new information. This approach assures consistency at all times, as interactions among nodes are only occurring when everything is up to date: the values being read are for sure the last ones.
However, that may imply delays in the response. To operate faster, there is a second approach: to allow the database to process new requests while the information is being updated among nodes. With this approach, consistency will be reached eventually: that’s why it’s called eventual consistency.
Eventual consistency abandons ACID to prioritize the availability and speed of the system, assuming that old values will sometimes be read. In certain applications, this is not a big deal—the gains in productivity might be more important than reading always the latest value, especially in applications dealing with a lot of staple or slow-changing data.
What if a database user is accessing some data that wants to be accessed by other users too at the same time? What if those multiple users perform different updates to the data?
A first option to avoid conflict is to lock the data (i.e. to isolate it from requests) while it is being read by the first user, blocking the access for the other ones until the lock is released. This is called pessimistic concurrency; it is a safe option with respect to avoiding conflict, but it implies that the data is locked even if it’s only being read, blocking more resources.
A second option, called optimistic concurrency, assumes that multiple transactions can happen without necessarily causing any conflict; the data is not locked while it’s being read, but only when it’s being updated.
When that happens, before committing to the updated value, the database must determine if another user has changed the data since it was read. If this has happened, it is considered a concurrency violation that must be resolved by the system, by using a certain strategy defined in the architecture. This approach prioritizes performance (more resources are available) while assuming that some conflicting transactions will actually happen, and they will need to be pushed back.
Lastly, some quick things:
Operations involving a group of data that returns a single summary value.
Operations that allow the formation of new tables by combining data from multiple other tables.
A query within another query. The output of the subquery is used as a condition for the main query, to further restrict its response.
Operations that are outside the routine of pre-defined queries. They ask for transient information, interesting only in this precise moment and circumstances.
In-memory operations are those that rely on the main memory (RAM). Some databases use the main memory for data queries, while traditional systems require to access the disk; this allows them to provide a quick response, as the main memory is faster.
To assure data recovery, the operations that change or update data must be replicated in the disk in the most efficient way possible, to not affect performance.
In an IoT network, the edge refers to the device that is collecting the data.
Using a private hardware server.
An alternative to using private hardware for storage is to use the cloud, i.e. to externalize hardware by subscribing to a service (usually accessible through the internet) that provides it for you.
The cloud enables you to access the resources you need without worrying about it, usually paying more as your capacity requirements increase. This allows users and companies to decentralize operations and to scale easily.
Now you are ready to take a look at our website and make sense of what you read.
And if you are still hungry for database knowledge… Check out our resource library!