This blog was co-written by Myles Suer, Sean Keenan, VP of product, data catalog and data preparation, and Richard LeBlanc, senior product manager.
Today’s market winners have moved, or are well along in the process of moving, from siloed data and analytics efforts to being truly analytical competitors. Those that haven’t chosen to take this journey will join the increasing scrap heap of organizations that have been digitally disrupted. The key distinguishing factor is organizational data readiness.
Market winners have the ability to use analytics, including machine learning, to shape their businesses’ execution. A recent survey by Deloitte LLP shows that analytics are increasingly being used to enhance company products and services. No wonder more and more markets are being disrupted. Another survey finds the most common places for analytics usage are product innovation, customer service and experience, sales execution, and supply chain operations with customer engagement being a common objective.1
Given the power of analytics to shape business, it is an axiom that “you can't be really good at analytics without really good data.”2 The goal for businesses of all stripes should be to industrialize the making of great data.
At Boomi, we call this state ‘Data Readiness.’
To achieve data readiness, organizations need to “gather, input, clean, integrate, process, and safeguard data in a systematic, sustainable way.”3 So, what goes into being systematic and sustainable, and how can a unified, integrated platform accelerate this business goal?
The sad fact is that most enterprises — especially large, legacy organizations — do not have a complete view of the data they have, where it is located, or how it is protected. This is ground zero for the journey to data readiness.
In order to take the first step, data discovery is needed — ideally through a tool or system that can automatically profile, categorize, and tag data potentially useful for analysis.
What analysts and data scientists are looking for is a Google-like search that returns a rich library of data sets. Discovery as well needs to look for unprotected personally identifiable information (PII) such as credit card numbers, social security numbers, phone numbers, FICO scores, etc. In healthcare, HIPAA has defined 20 different data types that, when combined, are personally identifying.
Organizations need to also find where their data comes from, including provenance and lineage. This is essential to determining the validity of the data for use by applications or analysis. For example, regulatory standards such as the EU General Data Protection Regulation (GDPR) and US Federal Reserve Board CFO Attestation require banks to be accountable for their datasets, including lineage, for anti-money laundering (AML) program compliance.
The next step for being analytical competitive is dealing with data silos and an ever-growing variety of data types. It should come as no surprise that “integration with existing systems and processes represents the greatest challenge for organizations in Deloitte’s cognitive aware survey."4
With this said, the goal for data integration remains to be able to aggregate data from multiple sources inside and outside the organization. The problem internally is the data from transactional systems typically has been stovepiped. This, of course, is a result of how these systems were put together.
It should also come as no surprise that the research of MIT-CISR has found that 51 percent of organizations have their data locked in silos, and another 21 percent have their data connected with band-aids and duct tape.
Things are made even more difficult as organizations attempt to integrate data from an increasingly hybrid and multicloud world. What makes this difficult is the fact that today we need to integrate larger and more complex data sets.
At the same time, data management teams need to provide access to data across what is an increasingly distributed landscape. The diversity, scale, and complexity of these organizational data sets makes data integration and data management design complicated.
Meanwhile, traditional data integration is failing to meet new business requirements which combine real-time connected data, self-service, automation, speed, and intelligence. New and expanding data sources, batch data movement, rigid transformation workflows, growing data volume, and distribution of data across multi- and hybrid cloud environments only exacerbates the integration issue.
Once the integration efforts we mentioned above are complete, the next issue is data quality.
The goal here should be to make clean, consistent data available to applications. The problem is that “if people can’t trust the data, new systems and processes will have little impact."5 And for most organizations, “company data, one of its most important assets is patchy, error-prone, and not up to date.”6
This represents a business problem, because as digital transformation has accelerated, it has become clear that business users need faster, easier means to access truly trustworthy data. More importantly, this is critical to making appropriate business decisions and to implementing the edicts of digital transformation.
Fixing data involves businesses establishing their ability to fix data quality, prep data, master data, and implement data stewardship.
Transforming data requires users be able to cleanse, enrich, parse, normalize, transform, filter, and format data.
Organizations need the ability to evaluate data for completeness, conformity, and accuracy. At the same time, users' recommendations should have the ability to drive improvements in metadata descriptions and tags, so the next person to use the data set doesn’t have to decipher a meaningless attribute description to understand if it’s relevant for analysis. Data preparation is all about automatically cleansing, enriching, normalizing, and transforming data seamlessly.
With these steps completed, matching can occur to determine the uniqueness of data as part of a governance process prior to synchronization. To master data, it is important to establish validation rules, enrichment, and classification for the nouns of the business.
Clearly, with all the effort described above to discover, polish, and improve data, organizational data has value. The problem with classical data protection methods is that they involve either creating castle and moats to keep the bad actors out, or encrypting databases.
These are actually not bad things. However, bad external or internal actors have proven via phishing and other techniques that they can get past everything that you have put in place. And more importantly, they have learned that the database administrators who have the database keys are the persons to target.
Given this, security is critical for data discovery cases, because as we said earlier, most systems were built in stovepipes to address PII with governance policies and mask data in motion. With appropriate governance, any hack, whether internal or external, will be limited in scope and impact because no one person will have access to every sensitive record.
Michelle Dennedy, former chief privacy officer at Cisco, says, “the process of compiling a list of information assets is an important first step for performing a risk management assessment to identify the level of risk to information.7” With discovery and data governance policies for privacy data classes, organizations can protect their data in motion.
Recently, multiple analyst organizations have suggested that the processes for creating and managing data need to be rethought, and that organizations should consider applying DevOps-thinking to their data processes.
DataOps is about establishing a collaborative data management practice that improves the communication, integration, and automation of data across organizations. In DataOps, the deployment frequency is increased via automated testing, metadata and version control, monitoring, and improved collaboration between data stakeholders. It aims to shorten data system development and delivery cycles. As an ideal, DataOps is about providing continuous delivery of new data capabilities and higher quality of deliverables.
Analysts say there is a need for a unified, intelligent, and integrated end-to-end platform that supports existing, new, and emerging data use cases.
A data fabric consists of a unified architecture, and services or technologies running on that architecture, that helps organizations manage their data. A data fabric that provides holistic data management capabilities and outputs clean, consistent, integrated, and secure data can provide data that is ready for analysis and action. To achieve this end, the data fabric needs to be agile and scalable, but even more important, have the ability to integrate data from disparate data sources and secure data that is in motion or rest.
And while doing this, a data fabric must minimize complexity and hide heterogeneity by embodying a coherent data model that reflects business requirements rather than the details of underlying systems and sources. It needs to expose metadata and create a data catalog that complies with enterprise data policies.
In short, a data fabric focuses upon automating the processes for integration, transformation, preparation, curation, security, governance, and orchestration to enable analytics and insights more quickly. It should minimize complexity by automating processes, workflows, and pipelines, generating code and streamlining data to accelerate various business use cases.
The data fabric should also be able to detect data types, including PII such as credit card numbers, Social Security numbers, phone numbers, and FICO scores, plus types like URLs. At the same time, it needs to provide comprehensive features for maintaining control and security of the data and understanding its lineage. This includes supporting data stewards, who can select any data attribute in any data set connected to the data fabric, and then select a function to mask the data.
This control should include row-level data, and should also allow users to see PII on some records but not others, especially important for compliance with regulations such as the EU GDPR.
Today’s market winners clearly are working to become analytical competitors.
The 28 percent of organizations that have been investing in their data and data readiness are well-positioned to win, but the 72 percent that haven't do not have 10 years to get there. They need a data fabric that can eliminate complexity, quickly integrate what they have today, and deliver clean, trustworthy, integrated, and secure data — and do so in an increasingly multicloud environment.
This is no longer about CIOs and CDOs buying and integrating point solutions. What is needed today is a solution that can be a single conduit for data.
Is your data difficult and expensive to decipher, manage, and control? "Five Steps for Accelerating Data Readiness," a TDWI Checklist Report, discusses modernizing technology strategies to create a faster path to data readiness. Get your free copy here.