I’ve been talking about two-tier storage infrastructures for a while now. End users are targeting this kind of approach to cope with capacity growth and performance needs. The basic idea is to leverage Flash memory characteristics (All-flash, Hybrid, hyperconvergence) on one side and implement huge storage repositories, where they can safely store all the rest (including pure Trash) at the lowest possible cost, on the other. The latter is lately also referred to as a data lake.
We are finally getting there but there is something more to consider. And it’s about the characteristics of these storage systems. In either case we are going towards classic/typical storage paradigms. In fact, some of these systems are starting to understand how they are effectively used and what is stored in them. With the help of Analytics they are now building a new set of functionalities which can make a huge difference in terms of how they can be used/implemented to improve both TCO and business.
Smarter primary storage
When it comes to primary storage, analytics is primarily used to improve TCO and to make life simpler for sysadmins. The array continuously collects tons of sensors that are then sent to the cloud, aggregated and organized with the goal to give you information and insights about what is happening to your storage while comparing it to a similar situation. Thanks to predictive analytics these tools can open support tickets or send alarms before issues become evident. Such tools can be very helpful in a wide range of occasions ranging from troubleshooting to capacity planning.
Sometimes, the analytics tool crosses over the storage boundary. A good example of this comes from Nimble Storage, where InfoSight is now capable of analyzing data coming from the array, the network and the hypervisor. From a certain point of view this is becoming the most interesting feature to look at when it is time to buy a new storage system and efficiency is on top of the requirement list.
The Role of Cloud
Cloud has a fundamental role in primary storage analytics. It has three major advantages. The first is that storage doesn’t need to waste system resources for this application concentrating all its power to IOPS, latency and predictability. Secondly, cloud allows to aggregate data coming from all over the world, enabling comparisons otherwise impossible to make. And, last but not least, cloud helps to simplify the infrastructure because there is no need of a local console or analytics server.
There is however one considerable exception. DataGravity, which proposes an enterprise storage for the mid market, has a peculiar architecture capable of running analytics directly into the system. On the contrary of other primary storage systems this array doesn’t focus on the infrastructure management part but primarily on stored-data analytics. The technology developed by this company allows end users to dig into their data and produce many kinds of different insights with applications ranging from data discovery/recovery to auditing, policy compliance and security. It’s a totally different approach which is quite difficult to find even in bigger systems and can have a great impact on both business and operations as well.
We produce a lot of Trash
Trash includes a lot of things. It’s not just waste that you have to stock forever in a landfill but, in many cases, it is recyclable and can bring value. The problem is having the right tool to perform just that.
Scale-out storage systems are becoming much more common now, and the trend is clear: they are embedding a series of functionalities to manage, analyze and do automated operations on large amounts of data without the need of external compute resources. Most of these systems have recently started to expose HDFS to be easily integrated with Hadoop for in-place data analytics. In other cases, like for example HDS HSP, we can see the evolution of this model with the analytics part already embedded in the product, like a specialized hyperconverged platform. A solution I’m sure will be available also from others in the future.
What I also find noteworthy is the ability shown by Coho data to run code triggered by events, directly on cluster-nodes. A solution that could become very helpful in case of data preparation and that could lead, again, to systems capable of running specific full fledged analytics tools.
Different solutions
In any case, solutions capable of analyzing data and/or metadata of huge file repositories are already available (Qumulo is one example) and some of these are now starting to demonstrate extensive search capabilities too.
At the same time, others are also taking full advantage of high resiliency characteristics of these distributed systems to implement backup features and copy management solutions to support primary storage. (Cohesity is very promising here)
Swiss knives without clouds
In order to allow interaction between these systems, APIs and simple query languages will become much more common over time than we think, granting the development of powerful data-driven vertical applications.
At the same time, just because of the nature of the operations performed by these systems and the size of the storage infrastructure, the analytics component is always implemented on-premises. In fact, we are talking about large clusters that can use part of their resources to run specific analytic tasks.
Closing the circle
Storage is changing very quickly, traditional unified storage systems are no longer the solution to every question (and this is also why companies like NetApp are no longer growing).
We are seeing an increasing demand for performance, very predictable behavior and specific analytics features to help attain the maximum efficiency and simplify the job of IT operations. On the other side of the fence, we need the cheapest and most durable options to store as much as we need/can but with the potential to reuse the data or make it quickly available when needed.
Analytics is rapidly becoming the minimum common denominator to build smarter data-driven infrastructures! The scope differs between secondary and primary storage, but the basic concepts are similar and they are all thought up to carve out the most from the resources we manage (performance and capacity).
It’s not at all surprising that companies like Nutanix, with the right technology and potential, will soon be targeting scale-out storage with specialized products!
Enrico,
nice post as usual 🙂
i don’t fully agree with your definition of Data Lake as “Trash” i.e. slow and archival storage, as you know the key definition of BigData is the need to support volume, velocity, and variety simultaneously, and the Data Lake is a critical part in this story and provides the persistency layer through which different applications, sensors, devices can share data, without performance there is no way to address data ingestion, high-performance analytics, or the quest for real-time and interactive BigData.
in BigData the notion of a Flash based font-end is quite limiting (not to mention the fact that Block storage and AFA doesn’t fit the BigData app model), since data needs to be shared and be consistent across applications which don’t reside on the same physical node, this is why the new approaches to Data Lakes adopted by the hyper-scale cloud providers consists of an endless repository with multiple interfaces, built-in caching, tiering, security, and some embedded processing as you allotted to.
a good example is the recent “MS Azure Data Lake (http://azure.microsoft.com/en-us/campaigns/data-lake/) which emphasis the 3 Vs with endless Volume, Velocity (low-latency and high-throughput) as key messages, and different APIs (HDFS, Ingestion, ..) for variety. The networks today (not FC) are much faster than the IO, Azure mentioned publicly the benefits of 40GbE and RDMA, and Facebook talked publicly about their internal non-blocking fabrics.
we the storage guys tend to over emphasis the lower storage layers (i.e AFA/Hybrid) as key contributors to application performance and responsiveness, unfortunately much of the app performance problem is related to the higher layers and how well they deal with concurrency and consistency, providing higher levels of abstraction to shared storage much like Amazon did with Aurora, or Oracle did with Exadata, or Google with Colossus have much more to do with E2E performance and scalability, than if you have a million IOPs FC AFA with a local file system and a traditional stack, those may be a good fit for your VDI or legacy DC stack, not so much for unstructured or BigData or Dockers.
Also i wouldn’t be confused with people (like Pivotal) calling an Hadoop cluster a Data Lake, data lake is the persistency layer on which Hadoop along with many other BigData or client applications will run over, just as Azure defines it, and as it is defined in this nice post: https://www.linkedin.com/pulse/bigdata-datalake-vs-datawarehouse-kumar-chinnakali (which also talks about Velocity as an attribute of the Data Lake)
so IMO if we agree that we need Data Lakes with more application centric API to shared storage to handle the variety in BigData, and those will probably revolve around object or files with potential storage side processing and filtering to avoid unnecessary data movement, it also makes sense for that Data Lake to manage the life cycle of the data (Flash/Trash/..), protection, security, etc’ seamlessly without the poor application guy changing his programming model just because he uses a different tier, or having to write data movement code into his application, or creating data inconsistency with different versions of the data stored in different co-located storage silos.
Yaron
can see more on this topic in my SDSBlog.com
Hi Yaron,
Thank you for commenting.
I probably simplified the concept to much… But our POVs are very similar.
Can’t wait to see more about what you are developing.