I’m talking about Big Data, Openstack and Object Storage. In the last two days I’ve come across a couple of articles (here and here) which discuss the adoption of Hadoop, Openstack and Object storage. Both articles start from surveys that talk about scarce adoption in the enterprise space for these technologies… And the questions which arise time after time are: is it too soon? or, is real interest lacking in traditional enterprises? The short answer is yes (to both questions), but I also think some elaboration is necessary.
Do you have the problem?
These technologies are all thought up to solve problems at a big scale!
Object storage = storing huge amounts of unstructured data.
Big Data (Analytics) = analyzing huge amounts of data.
Openstack (and cloud management platforms in general) = managing huge pools of compute, networking and storage resources.
The common word here is HUGE. Otherwise, it’s like shooting sparrows with Bazookas!
Yes, you could be interested, and yes, you could have them in a lab to better understand what they do and how you could leverage them but, at the end of the day, you’ll stick to your traditional infrastructure: your NAS for unstructured data, SQL DBs for Analytics and VMware or Microsoft virtualization stacks with some fancy automation and provisioning tools. That’s all you need (today), and this is why most of Openstack and Big Data Analytics infrastructures are still PoCs in the labs.
It’s just too soon
In some cases data (and infrastructure) growth is quickly heading towards that HUGE mode mentioned above. It’s just a matter of time and if you don’t want to outgrow your IT team, you’ll be looking at these technologies in a not very distant future.
In fact, if you want to manage Petabytes instead of Terabytes, or thousands of VMs instead of hundreds, per person (aka Sysadmin) you will need something other than what you are used to.
At the same time, if your organization is not experiencing an exponential growth in terms of data and compute needs, but the trend is more linear, any new hardware generation will probably suffice to avoid structural changes. Furthermore, incremental updates to legacy technologies (like adding in-memeory capabilities to a traditional RDBMS) can give some extra juice, and they will still be cheaper to implement than starting from scratch with next generation technology (and the investment needed to train people within your organization!).
But you are already using them!
On the other hand, most of us (both consumers and enterprises) are already using the technologies mentioned in this article. Actually, many modern solutions we are adopting in our organizations are based on these technologies.
Take object storage as an example. Somewhere in your organization there is a sync&share solution, or a cloud-storage gateway of some sort, or backups are being sent to the cloud, or something else! In all these situations, even when the front-end is installed locally, you are already leveraging object storage at the back-end. It’s likely that if you sum together all of the data managed by these applications, it would still be cheaper to buy a service instead of building a new on-premises infrastructure (but, have you ever checked it out??!!). In the adoption charts of the surveys mentioned at the beginning of the article, all those (numerous) companies, accessing the same object storage platform provided by a single service provider are counted as one…. even if the SP has a multi-petabyte installation serving thousands of tenants!
Data lakes or data ponds?
In some cases the quantity of data is not huge… in others organizational issues make a data lake very hard to build, while collecting many different smaller sets of data is quite easy.
This means that you probably have different projects within the company, with different types of data under management and stored on different platforms. This leads to smaller clusters (or cloud services), less overall efficiency and, again, without consolidation, the impossibility to build a single huge infrastructure.
Lack of ease of use and appliances
Up to now, another big obstacle to the adoption of these technologies, especially in medium sized enterprises, is the lack of prepackaged appliances and ease of use. Fortunately, this is now quickly changing and vendors are finally presenting prepackaged, in some cases hyper-converged, solutions putting together different components in a pre-assembled fashion. Part of the benefit comes from simplicity, but fast provisioning and automation also play an important role. With this approach the IT department can now give freedom and flexibility to all business/organizational units, and can allow them to choose the right product for each one of their projects.
For example, if you look at solutions like the recently launched HDS HSP, a hyper-converged appliance based on KVM, Openstack and a proprietary distributed FS highly optimized for big data workloads, you’ll find that you can easily build a data lake and leverage it through different data analytics tools in a cloud-ish way. It’s like having a specialized BigData-as-a-Service-in-box! It doesn’t come cheap (you pay for the integration, support and industrialization of the product) but minimum configuration is 5 nodes… which is not much higher than the 3/4 nodes configuration of most Hadoop clusters out there, while adding a lot of flexibility by supporting many different Hadoop distributions, NoSQL DBs, whatever you need and, above all, enabling the creation of a data lake.
Closing the circle
On-prem Big Data, Openstack (private clouds) and Object Storage are not for everyone. If you don’t have the problem, you don’t need them. It’s just common sense, isn’t it?
In fact, only surveyors and some analysts aren’t aware of it. In this case, leveraging external services is the best choice.
If you are experiencing an exponential growth of data and infrastructure then, sooner or later, you are going to need them. In this case, it’s time to start building the two-tier strategy I mentioned many times in this blog (here an example). A strategy where the secondary tier is the data lake (based on object storage maybe!). Consolidation of different storage/data islands will become an important part of this process but, at the same time, flexibility remains the pillar to maintaining simplicity and usability of data and resources… this is why I’m sure we will see more data-centric hyper converged systems in the market soon, especially in mid-sized organizations where there aren’t enough resources to build these kind of infrastructures from scratch!