I’ve just finished reading these articles from Chris Mellor on the Register and Val Bercovici on NetApp’s blog. NFS for Hadoop? Really? It simply make no sense, at all!
Yes, you could find corner cases… you can always find a corner case for something you love but, in this case, that’s about all you can find.
And I’m not talking about data ingestion here.
Storing (BIG) data on primary storage
One of the benefits of HDFS is that it is a distributed Filesystems and it has all the embedded availability, replication and protection mechanisms you need for storing huge amounts of data safely and, above all, it’s very inexpensive.
In fact, you can build your HFDS-based storage layer by simply adding disks into cluster nodes, and all the management tools are integrated. At the end of the day it’s just a file system, that you get for free with any Hadoop distribution!
Despite all its defects, HDFS is optimized to do that job, it’s “local” to the cluster, it is designed to move big data chunks and it doesn’t need the special attention usually required for primary storage. The TCA and TCO of HDFS is very low.
Primary storage can easily be positioned on the opposite side:
– it’s all but cheap,
– it definitely has problems to managing Big Data analytics and traditional enterprise workloads at the same time (especially if they need to leverage the same resources, QoS is still an opinion for most storage vendors!),
– it also introduces huge management costs when it comes to backup and remote replication. Costs that become unsustainable if your environment scales beyond a few hundred Terabytes.
Val Bercovici, in his article, talks about a hypothetical use case with HDFS in the role of a cache (or a primary file system) and NetApp as a secondary repository. This way HDFS comes up on top compared to what is usually sold as primary storage… and why would you use a primary storage for a secondary storage task then?
Don’t get me wrong. I totally agree with the caching layer part, I’ve been talking about it for months, but I think secondary storage has to be the slowest, most automated, scalable and cheapest part of this kind of design. And this is where NetApp doesn’t really fit in… does it?
Analyzing (BIG) data in place
Which is something that I really like indeed, but doing that on NFS and NetApp FAS is just too costly.
There are many limits and constraints contribute to not making NetApp the ideal solution, and I won’t mention the higher cost of NetApp FAS compared to better-suited alternatives for this particular use case.
In fact, If you look at what is happening all around, enterprises are piling up data. Like it or not, they are starting to build data lakes. ONTAP File system (WAFL) and data volume limits, in terms of number of object and capacity, are just the first examples that pop up in my mind to explain why to avoid using NetApp in this scenario (if I remember well, the limit of a volume in size is still around 100TB). Yes, You can configure a NetApp system for high capacity (and with large volumes) but then performance will suck! (and you won’t have any of the advantages usually found in object-based systems)
On the other hand, various Object storage vendors are working on similar capabilities, proposing an HDFS interface on top of their platforms. Working with the same filesystem interface (internally and externally to the cluster) is much better at any level. And, going back to the first use case presented in Val’s blog, it also enables a seamless use of the Object Storage system for secondary copies of data.
Scalability could be a (BIG) issue
NetApp FAS, as almost every other primary storage system on the market, doesn’t scale enough to cover Big Data needs. It’s not just because of the complexity of their
clunky scale-out solution, limitations abound, starting from the number of managed objects in a single domain (i.e.: file system/volume) to the complexity of managing a storage infrastructure with this characteristics.
I’m not saying it’s not a good solution for what it is meant to be (a primary storage solution), I’m saying that it doesn’t have the necessary characteristics to become a huge data repository that can also be used to analyze data in place.
The (BIG) cost of data protection
Last, but not least, NetApp ONTAP doesn’t have the right set of features and capabilities that are at the base of a huge scale-out and object-based storage infrastructure. Automations and policies (characteristics that you can usually find in object storage) are totally absent. Retention management, multiple data copies, geo/local replication are just a few examples. Most of the work has to be done manually, which is ok in traditional environments, but it doesn’t work if you want to manage a data lake of several Petabytes with limited resources (and money).
License costs and features thought up for other kinds of applications are simply inadequate to cope with the management of an infrastructures of this size.
Closing the circle
I like having storage separate from the compute nodes for Big Data, but expecting NFS and a traditional array to do o just that, isn’t realistic… it won’t work.
The biggest problem with NetApp is ONTAP. They love it too much, and they are still trying to push it everywhere, even when it does not make sense at all. And it’s ridiculous.
Almost all of the vendors are diversifying their product line-ups to serve the users at best. I like the concept of the swiss army knife, but you can’t use it to cut down a tree!
Another aspect of NetApp that is driving me crazy, is that they actually have an object storage system (StorageGRID) and instead of developing it seriously (like adding an HDFS interface), they are wasting resources for useless features on ONTAP. In the meantime, if you look around, all storage vendors are targeting Big Data Analytics and IoT with Object-based scale-out Storage solutions (often deployed on commodity hardware).
FlashRay (their real All-Flash Array) is still not to be found and they don’t really have a strategy for Big Data… not to mention the Cloud where, again, they are pushing ONTAP.
In a world that is growing massively in terms of IOPS/latency needs on one side and huge capacities on the other, there won’t be space for traditional unified storage systems.
NetApp is tuck the middle and they risk being crushed and becoming irrelevant.