I’ve been talking about two-tier storage infrastructures for a while now. End users are targeting this kind of approach to cope with capacity growth and performance needs. The basic idea is to leverage Flash memory characteristics (All-flash, Hybrid, hyperconvergence) on one side and implement huge storage repositories, where they can safely store all the rest (including pure Trash) at the lowest possible cost, on the other. The latter is lately also referred to as a data lake.

We are finally getting there but there is something more to consider. And it’s about the characteristics of these storage systems. In either case we are going towards classic/typical storage paradigms. In fact, some of these systems are starting to understand how they are effectively used and what is stored in them. With the help of Analytics they are now building a new set of functionalities which can make a huge difference in terms of how they can be used/implemented to improve both TCO and business.

Smarter primary storage

flashWhen it comes to primary storage, analytics is primarily used to improve TCO and to make life simpler for sysadmins. The array continuously collects tons of sensors that are then sent to the cloud, aggregated and organized with the goal to give you information and insights about what is happening to your storage while comparing it to a similar situation. Thanks to predictive analytics these tools can open support tickets or send alarms before issues become evident. Such tools can be very helpful in a wide range of occasions ranging from troubleshooting to capacity planning.

Sometimes, the analytics tool crosses over the storage boundary. A good example of this comes from Nimble Storage, where InfoSight is now capable of analyzing data coming from the array, the network and the hypervisor. From a certain point of view this is becoming the most interesting feature to look at when it is time to buy a new storage system and efficiency is on top of the requirement list.

The Role of Cloud

Screen Shot 2015-02-16 at 09.54.17

AnalyticsCloud has a fundamental role in primary storage analytics. It has three major advantages. The first is that storage doesn’t need to waste system resources for this application concentrating all its power to IOPS, latency and predictability. Secondly, cloud allows to aggregate data coming from all over the world, enabling comparisons otherwise impossible to make. And, last but not least, cloud helps to simplify the infrastructure because there is no need of a local console or analytics server.

There is however one considerable exception. DataGravity, which proposes an enterprise storage for the mid market, has a peculiar architecture capable of running analytics directly into the system. On the contrary of other primary storage systems this array doesn’t focus on the infrastructure management part but primarily on stored-data analytics. The technology developed by this company allows end users to dig into their data and produce many kinds of different insights with applications ranging from data discovery/recovery to auditing, policy compliance and security. It’s a totally different approach which is quite difficult to find even in bigger systems and can have a great impact on both business and operations as well.

We produce a lot of Trash

Garbage and seagullsTrash includes a lot of things. It’s not just waste that you have to stock forever in a landfill but, in many cases, it is recyclable and can bring value. The problem is having the right tool to perform just that.

Scale-out storage systems are becoming much more common now, and the trend is clear: they are embedding a series of functionalities to manage, analyze and do automated operations on large amounts of data without the need of external compute resources. Most of these systems have recently started to expose HDFS to be easily integrated with Hadoop for in-place data analytics. In other cases, like for example HDS HSP, we can see the evolution of this model with the analytics part already embedded in the product, like a specialized hyperconverged platform. A solution I’m sure will be available also from others in the future.
What I also find noteworthy is the ability shown by Coho data to run code triggered by events, directly on cluster-nodes. A solution that could become very helpful in case of data preparation and that could lead, again, to systems capable of running specific full fledged analytics tools.

Different solutions

iStock_000012120788MediumIn any case, solutions capable of analyzing data and/or metadata of huge file repositories are already available (Qumulo is one example) and some of these are now starting to demonstrate extensive search capabilities too.
At the same time, others are also taking full advantage of high resiliency characteristics of these distributed systems to implement backup features and copy management solutions to support primary storage. (Cohesity is very promising here)

Swiss knives without clouds

All-purpose Swiss army knifeIn order to allow interaction between these systems, APIs and simple query languages will become much more common over time than we think, granting the development of powerful data-driven vertical applications.

At the same time, just because of the nature of the operations performed by these systems and the size of the storage infrastructure, the analytics component is always implemented on-premises. In fact, we are talking about large clusters that can use part of their resources to run specific analytic tasks.

Closing the circle

Storage is changing very quickly, traditional unified storage systems are no longer the solution to every question (and this is also why companies like NetApp are no longer growing).

We are seeing an increasing demand for performance, very predictable behavior and specific analytics features to help attain the maximum efficiency and simplify the job of IT operations. On the other side of the fence, we need the cheapest and most durable options to store as much as we need/can but with the potential to reuse the data or make it quickly available when needed.

Analytics is rapidly becoming the minimum common denominator to build smarter data-driven infrastructures! The scope differs between secondary and primary storage, but the basic concepts are similar and they are all thought up to carve out the most from the resources we manage (performance and capacity).

It’s not at all surprising that companies like Nutanix, with the right technology and potential, will soon be targeting scale-out storage with specialized products!