Data gravity and its impact on data storage infrastructure
Article by Seagate country manager for A/NZ Jeff Park.
Data gravity affects the entire IT infrastructure; it should be a major consideration when planning data management strategies. It’s important to ensure that no single data set exerts an uncontrollable force on the rest of the IT and application ecosystem.
Data is now an essential asset to businesses in every vertical, just as physical capital and intellectual property are. With ever-increasing quantities of both structured and unstructured data, data growth will continue at unprecedented rates in the coming years.
Meanwhile, data sprawl — the increasing degree to which business data no longer resides in one location but is scattered across data centers and geographies — adds complexity to the challenges of managing data’s growth, movement, and activation.
Enterprises must implement a strategy to efficiently manage mass data across cloud, edge, and endpoint environments. And it’s more critical than ever to develop a calculated plan when designing data storage infrastructure at scale.
As enterprises aim to overcome the cost and complexity of storing, moving, and activating data at scale, they should seek better economics, less friction, and a simpler experience. A new way to data.
The concept of data gravity is a vital element to consider in these efforts.
According to the new Seagate-sponsored report from IDC, as storage associated with massive data sets continues to grow, so will its gravitational force on other elements within the IT universe.
Data gravity is a consequence of data’s volume and level of activation. Basic physics provides a suitable analogy: a body with greater mass has a greater gravitational effect on the bodies surrounding it. “Workloads with the largest volumes of stored data exhibit the largest mass within their ‘universe,’ attracting applications, services, and other infrastructure resources into their orbit,” according to the IDC report.
A large and active dataset will necessarily affect the location and treatment of the smaller datasets that need to interact with it. So, data gravity reflects data lifecycle dynamics and must help inform IT architecture decisions.
Consider two datasets: one is 1 petabyte, and the other is 1 gigabyte. To integrate the two sets, it is more efficient to move the smaller dataset to the location of the larger dataset. As a result, the storage system with the 1-petabyte set now stores the 1-gigabyte set as well. Because large datasets will ‘attract’ other smaller datasets, large databases tend to accrete data, further increasing their overall data gravity.
Managing, analysing and activating data also relies on applications and services, whether those are provided by a private or public cloud vendor or an on-prem data management team. Applications collect and generate data; a lot of work has to happen on the data. Naturally, the more massive a data set grows, the harder it is to use that data unless it is close to the applications. So applications are often moved close to the data sets. From on-premises data centers to public clouds and edge computing, data gravity is a property that spans the entire IT infrastructure.
But according to the IDC report, such massive data sets can become like black holes. “Trapping stored data, applications, and services in a single location, unless IT environments are architected to allow the migration and management of stored data, along with the applications and services that rely on it, regardless of operational location.”
Because data gravity can affect an entire IT infrastructure, it should be a major design consideration when planning data management strategies. An important goal in designing a data ecosystem, according to IDC, is to “ensure that no single data set exerts uncontrollable force on the rest of the IT and application ecosystem.”
Ensuring applications have access to data, regardless of location
IT architecture strategy should put mass storage and data movement at its centre. This begins with optimising data location. A data-centred architecture brings applications, services and user interaction closer to the location where data resides, rather than relying on time-consuming and often costly long-distance transfers of mass data to and from centralised service providers.
IDC notes that “one way to mitigate the impact of data gravity is to ensure that stored data is colocated adjacent to applications regardless of location.”
This model can be accomplished by leveraging colocated data centers that bring together multiple private and public cloud service providers.
The fundamental goal of a data-centred architecture is data accessibility. Accessibility can impact future business innovation, improve the ability to generate metadata and new datasets, enable search and discovery, and further empower data scientists to deploy data for machine learning and AI.
But putting data at the centre of IT architecture can also positively impact application performance optimisation. The overall reliability and durability of the data is also a significant benefit: reliability is the ability to access data when needed, and durability is the ability to preserve data over extended periods.
Put data at the centre of IT strategy
Altogether, these considerations have considerable implications for enterprise data management planning — from defining an overall IT strategy to formulating a business initiative. Planning out the necessary workloads and jobs means accounting for data gravity.
Key questions to ask include:
- What is the volume of data being generated or consumed?
- What is data distribution across the data center, private clouds, public clouds, edge devices, and remote and branch offices?
- What is the velocity of the data being transmitted across the entire IT ecosystem?
Addressing these considerations will increase the efficiency of the data infrastructure and can reduce costly data pipeline issues down the line.
IDC advises in its report, “Don’t let a single workload or operational location dictate the movement of storage or data resources.” Because data has gravity, data infrastructure must be designed to prevent large individual workloads from exerting a dominant gravitational pull on storage resources.
This means always maintaining awareness about which datasets are being pulled where, the most efficient path to move the data, and what helps those workloads run the best. This can also mean automating the movement of data to reduce storage costs, or moving lower-performing datasets that are not immediately needed.
Putting these ideas into action means deploying data architecture, infrastructure and management processes that are adaptive. So while an organisation may have a good idea of its data gravity considerations today, they may not be the same five years from now.
“Not every enterprise manages multiple massive data sets, but many already do,” IDC notes in the report. “And, given the pace of digitisation of business and the importance placed on the value of enterprise data and data gathering, many organisations will find themselves managing massive data sets in the near future.”
Every data management system should change to accommodate new data requirements. Data management and the data architecture to support it must be agile and adapt to shifting business needs.