Guide to Big Data









Big Data has become big business. According to IDC, “Worldwide revenues for big data and business analytics (BDA) will reach $150.8 billion in 2017, an increase of 12.4 percent over 2016.”

And the market for big data solutions is likely to get even bigger. The same IDC report added, “Commercial purchases of BDA-related hardware, software, and services are expected to maintain a compound annual growth rate (CAGR) of 11.9 percent through 2020 when revenues will be more than $210 billion.”

For enterprises, managing and analyzing big data – and developing a big data strategy – has become a critical part of how business is done. According to the Big Data Executive Survey 2016 from New Vantage Partners, a majority (62.5 percent) of the Fortune 1000 companies surveyed had at least one instance of big data in production, and only 5.4 percent said they had no big data initiatives planned or underway.

Clearly, big data is a key concern for business and IT leaders.

What is Big Data?

The first (or one of the first) people to use the term “big data” was John Mashey, who began discussing big data in the late 1990s when he worked for SGI.

The authoritative definition of big data came from Doug Laney, who was an analyst at META Group, which has since become part of Gartner. In 2001, Laney published a paper called “3D Data Management: Controlling Data Volume, Velocity, and Variety.” His three Vs — volume, velocity and variety — have since become the industry-standard way to define big data.

  • Volume: The most obvious characteristic of big data is that it includes an enormous amount of information. In the early days of the big data trend, much of that data came from e-commerce transactions. In the years since, mobile devices, social media and the Internet of Things (IoT) have all contributed to the ever-growing volume of data stored in enterprise IT systems.
  • Velocity: Enterprises are creating new data at a very rapid pace. Today’s organizations need to deal with real-time streams of data from sources like Twitter, Facebook, IoT sensors, RFID tags and mobile apps. Businesses must find a way to keep pace or they will fall behind the competition.
  • Variety: In the past, organizations were able to store a lot of their data in structured relational database management systems (RDBMSes). However, today, a lot of enterprise data is unstructured and includes text documents, photos, videos, audio files, email messages and other types of information that don’t fit into a traditional database. In fact, most organizations find it more difficult to deal with their variety of data than with the volume or velocity. In the New Vantage survey, for example, 40 percent of those surveyed said their variety of data was the primary technical driver for their big data initiatives, compared to just 14.5 percent who said the same about volume and 3.6 percent who selected velocity.

Some vendors have attempted to add a fourth V to the original three. They may talk about variability, which refers to data flows that speed up or slow down at different times; veracity, which refers to how trustworthy, accurate and consistent the data is; or value, the amount of money organizations can make from the data. However, none of these fourth Vs have caught on widely, and most people still use the three Vs to describe big data.

Big Data Diagram

Big Data experts often note that Big Data is comprised of “the three V’s” – volume, velocity and variety.

Big Data Technologies

In order to deal with the massive volume, velocity and variety of their big data, enterprises deploy a wide variety of technologies, including the following:


In order to have enough room to store their big data, enterprises need a lot of physical or cloud-based storage hardware. Often, they choose virtualized storage solutions that offer excellent scalability.

Organizations also need software that can store their big data. They might choose a data warehouse, a data lake, a NoSQL database and/or a distributed storage solution, such as Hadoop.

Data Management

Vendors also offer a variety of tools that can help organizations move, integrate, clean and otherwise prepare their data for analytics. These tools fit into a variety of categories, including data integration, data virtualization, data preparation, ETL, data quality and data governance. Many companies use Big Data virtualization to help with management.


For most organizations, the goal of big data initiatives is to generate valuable insights that the company can then use to become more efficient, better serve customers or become more competitive. Big data analytics tools include data mining, business intelligence, predictive analytics, machine learning, cognitive computing, artificial intelligence, search and data modeling solutions. A related technology, in-memory data fabric, can speed up big data analytics tasks.


Large repositories of data can be an attractive target for hackers, so enterprises need to make sure that they are appropriately securing their big data. Popular big data security technologies include encryption and access management solutions.

On the flip side, some organizations run big data analytics on their security and log data in order to detect, prevent and mitigate attacks. Software with these capabilities are often referred to as security intelligence or security information and event management (SIEM) solutions.

Cloud-Based Tools

Many vendors offer cloud-based solutions for storing, managing, analyzing or securing big data. The advantage in choosing a cloud big data tool is the affordability and easy scalability that the cloud offers. However, some organizations have security or compliance concerns that prevent them from using cloud-based big data tools.

Big Data Analytics

For most organizations, the primary purpose in launching a big data initiative is to analyze that data in order to improve business outcomes. In the New Vantage Partners survey, the number one business driver of big data projects was “greater business insights,” which was selected by 37 percent of respondents.

They way that organizations generate those insights is through the use of analytics software. Vendors use a lot of different terms, such as data mining, business intelligence, cognitive computing, machine learning and predictive analytics, to describe their big data analytics solutions. In general, however, these solutions can be separated into four broad categories:

  1. Descriptive analytics. This is the most basic form of data analysis. It answers the question, “What happened?” Nearly every organization performs some kind of descriptive analytics when it puts together its regular weekly, monthly, quarterly and annual reports.
  2. Diagnostic analytics. Once an organization understands what happened, the next big question is “Why?” This is where diagnostic analytics tools come in. They help business analysts understand the reasons behind a particular phenomenon, such as a drop in sales or an increase in costs.
  3. Predictive analytics. Organizations don’t just want to learn lessons from the past, they also need to know what’s going to happen next. That’s the purview of predictive analytics. Predictive analytics solutions often use artificial intelligence or machine learning technology to forecast future events based on historical data. Many organizations are currently investigating predictive analytics and beginning to put it into production.
  4. Prescriptive analytics. The most advanced analytics tools not only tell organizations what will happen next, they also offer advice on what to do about it. They use sophisticated models and machine learning algorithms to anticipate the results of various actions. Vendors are still in the process of developing this technology, and most enterprises have not yet begun using this level of big data analytics in their operations.

Big Data Challenges

While big data offers tremendous business opportunities, it also poses some challenges, including the following:

Dealing with data growth. According to the IDC’s Digital Universe report, the amount of digital information stored on the world’s systems is growing by 40 percent each year. For enterprises, simply storing that ever-growing amount of information can be a difficult — and costly — proposition. Analyzing those vast quantities of data poses additional challenges because as data stores grow, analytics processes take longer and require more computing power.

Generating insights in a timely manner. Many organizations are looking to analyze and respond to their big data in real time. That requires specialized hardware and software with advanced capabilities. In the past, business analysts may have generated business intelligence (BI) reports on a weekly or monthly basis, but now many organizations are pressuring their analysts to create those same reports — and more — several times per day.

Recruiting and retaining big data talent. Big data experts and data scientists are some of the most highly sought employees in the extremely competitive IT talent market. According to the 2017 Robert Half Technology Salary Guide, the average big data engineer earns between $135,000 and $196,000, while data scientists make $116,000 to $163, 500 and business intelligence analysts bring home $118,000 to $138,750. Many organizations find it difficult to hire the big data experts that they need. To fill in the gaps, they often look for big data analytics tools that promise to allow business users to self-service their own needs.

Integrating disparate data sources. Most organizations have data that comes from a wide variety of different enterprise applications and both internal and external sources. Before they can perform analytics on those different data sets, they need a way to bring all that data together. Several vendors offer big data integration tools that can help, but integration remains difficult for many organizations.

Validating data. Big data analytics can only yield valuable insights if it is based on accurate data. Unfortunately, many organizations find that the data they have on their various systems is not consistent. Before they can analyze that data effectively, they need to have a process — and technology — for cleaning and standardizing that data.

Securing big data. Big data repositories can be particularly attractive to advanced persistent threats (APTs), the nation-states and competitors with the resources to launch a sophisticated, hard-to-detect cyberattack. Organizations need to make sure that they are protecting their big data stores with appropriate security measures, including encryption and access control.

Big Data Security

The task of securing big data is complicated by the volume, velocity and variety of the data. Any big data store is likely to include some sensitive information, such as customer credit card numbers, usernames, passwords, email addresses, etc. Enterprises often address this issue with encryption technology, but standard encryption techniques can slow down data retrieval or make big data analytics difficult or even impossible.

To get around this problem, organizations have a couple of options. First, attribute-based encryption encrypts only the sensitive data. So, for example, it could encrypt credit card numbers and names in a database while leaving the customers’ ages and genders in the clear. This allows business users to conduct analysis on anonymized data while restricting access to personal information.

Another option is fully homomorphic encryption. This technique allows users to conduct analysis on encrypted data. This generates encrypted results, which can be decrypted with the same key used to conduct the original encryption. This option secures the data even as it is being analyzed.

Organizations should also make sure that any big data solutions they deploy have role-based access control with an audit trail. This protects against insider threats and gives organizations a way to see who may have accessed data. In addition, organizations should use real-time monitoring, and intrusion detection and prevention systems to help thwart attacks against their big data systems.

Another issue that complicates big data security is the prevalence of open source solutions. Many organizations have cobbled together their own big data solutions using freely available software. However, open source software doesn’t always have the same level of built-in security as proprietary solutions. NoSQL databases, in particular, often come under criticism for a lack of protection against attacks. Organizations that rely heavily on open source software will need to be especially vigilant to make sure that they are using appropriate levels of data protection for their big data.

Open Source Big Data Tools

Some of the best and most popular big data solutions are available under an open source license. In fact, the open source Apache Hadoop project dominates the big data space, and analysts at Forrester have gone so far as to call Hadoop a “must-have for large enterprises.”

The best-known open source big data solutions include the following:

Big Data Companies

The list of big data vendors seems to be endless. Frankly, if you name any technology company, it’s quite likely that it has a product with a “big data” label. Additionally, new big data startups are constantly being created. The vendors below are some of the best known big data companies in each category:

Big data storage vendors

Hadoop distribution vendors

Big data analytics vendors

Cloud-based big data vendors

Similar articles

Get the Free Newsletter!
Subscribe to Data Insider for top news, trends & analysis
This email address is invalid.
Get the Free Newsletter!
Subscribe to Data Insider for top news, trends & analysis
This email address is invalid.

Latest Articles