4 Vs of Big Data

4 Vs of Big Data
Author: Leighton Johnson, CISSP, CISM, CTO at ISFMT, Inc.
Date Published: 12 June 2019

Big data requires strong data handling processes in data-intensive systems. Today, with the incredible growth of data collection into systems of diverse kinds and sizes around the world, we need to understand big data basics for review, audit and security purposes. The characteristics of big data that force new architectures are as follows:

  • Velocity (i.e., rate of flow)
  • Volume (i.e., the size of the dataset)
  • Variety (i.e., data from multiple repositories, domains or types)
  • Veracity (i.e., provenance of the data and its management)

These 4 characteristics are known colloquially as the Vs of big data. The 4 Vs are used in the following ways:

  • Velocity describes the speed at which data are processed. The data usually arrive in batches or are streamed continuously. As with certain other nonrelational databases, distributed programming frameworks were not developed with security and privacy in mind. Malfunctioning computing nodes might leak confidential data. Partial infrastructure attacks could compromise a significantly large fraction of the system due to high levels of connectivity and dependency.
  • Volume describes how much data are coming in. This typically ranges from gigabytes to exabytes and beyond. As a result, the volume of big data has necessitated storage in multitiered storage media. The movement of data between tiers has led to a requirement of cataloging threat models and a surveying of novel techniques. This requirement is the threat model for network-based, distributed, auto-tier systems. A positive of having large volumes of data is that analytics can be performed to help detect security breach events. This is an instance where big data technologies can help to fortify security.
  • Variety describes the organization of the data including whether the data are structured, semi-structured or unstructured. Retargeting traditional relational database security to non-relational databases has been a challenge. These systems were not designed with security and privacy in mind, and these functions are usually relegated to middleware. Traditional encryption technology also hinders the organization of data based on semantics.
    An emerging phenomenon introduced by big data variety is the ability to infer identity from anonymized datasets by correlating with apparently innocuous public databases. Sensitive data are shared after sufficient removal of apparently unique identifiers and indirectly identifying information by the processes of anonymization and aggregation.
  • Veracity includes provenance and curation. Provenance is based upon the pedigree of the data, the metadata and the context of the data when collected. This is important for both data quality and for protecting security and maintaining privacy policies. Big data frequently moves across individual boundaries to groups and communities of interest and across state, national and international boundaries. An additional area of the pedigree is the potential chain of custody and collection authority of the data. Curation is an integral concept that binds veracity and provenance to principles of governance and data quality assurance. Curation, for example, may improve raw data by fixing errors, filling in gaps, modeling, calibrating values and ordering data collection. Furthermore, there is a central and broadly recognized privacy principle incorporated in many privacy frameworks (e.g., the Organisation for Economic Co-operation and Development [OECD] principles, the EU General Data Protection Regulation [GDPR], Fair Trade Commission [FTC] fair information practices) that data subjects must be able to view and correct information collected about them in a database.

Big data systems and analysis by organizations large and small are here to stay, and we need to stay up to date with the technologies, analytics and utilization of these types of systems as they advance in the future.

Leighton Johnson, CISA, CISM, CIFI, CISSP, is a senior security consultant for the Information Security and Forensics Management Team of Bath, South Carolina, USA.