The Problem of Data requires a fundamental shift in data thinking.

Vivek Kumar
9 min readJul 28, 2022

--

The Problem

Nowadays companies can’t live without data however they also haven’t quite figured out how to navigate through the data. Data debt just keeps on piling and as the saying goes: more data, more problems. Just walk into any meetings related to data. You will find them filled with complaints.

Let me share some examples of these complaints.

Leadership & Business & Product

  • Delay in data & analysis. Bad data. Lacks actionable insights. Data dumps
  • Lack of self-serving system and unnecessary dependency on the data team.
  • All the business and product teams are unsatisfied with the data team.
  • Why can’t we empower the business & product team to do their own analytics?
  • Data stack is costly and we are not getting a return on our expenses.

Data Leaders & Managers

  • Attrition is not slowing down. People are not happy with the work.
  • Either analyst wants to become product managers or business managers.
  • The problem lies with upstream data, data capture, and data warehousing.
  • No clear ownership of data assets. This is a data engineering or software engineering problem.

Data Operators: Analysts & Data Scientists

  • We are always cleaning the bad data and don’t get the time to solve actual analytics problems.
  • Correct Data doesn’t get captured. Data Quality suffers in the data ecosystem. We are not engineers.
  • No one knows what data asset has what use case. Which data to use for what insights?
  • ML models are not working because of bad data.
  • There is not enough data or too much data which is just a mess.

Data Tech: Data Engineering & Software Engineering

  • We don’t want to keep writing API for any new data capture.
  • We don’t get clear spec about new data asset capture.
  • Business or Analytics doesn’t keep us in the loop about the data.
  • We want to focus on building core products and enhancing core infrastructure.
  • Why are we writing queries and doing analytics work?

Does all of this sound familiar? Of course, it does. Anyone working in proximity to data knows this.

What is happening now?

So how do we solve it?

We go in search of the truth that will solve it all. During this search, some of the things which I stumbled across are the following.

  • Use Modern Data Stack or Post-Modern Data Stack
  • The emergence of Data Product Managers & Data as a Product
  • Data Design & Data Modelling first approach
  • Agile Analytics & emergence of analytics engineering (dbt)
  • Clear & defined ownerships of data assets. Consumers vs Producers
  • A more collaborative approach, processes & culture change

But are we any closer to solving it? I think not. Are these approaches incorrect? Nope. They work fine but solve only part of the data problem.

So what are we missing?

The Data Problem is one of the puzzles whose pieces are puzzles in themselves. As you may already be aware of the following pieces in the data ecosystem.

  • Software Engineering: Manage the data upstream
  • Data Engineering: Manage the data downstream
  • Analytics & Data Science: Consumes the data for data applications
  • Data Products-AI/ML: Productize Analytics & Data Science

While building companies, founders have always taken on of the following approach.

  • Engineering First Approach
  • Business First Approach
  • Product First Approach

Now building a business or company has created The Data Problem. There never has been a data-first approach.

This has resulted in an inherently broken data ecosystem that resulted in ownership issues, misinformed roles, impact reduction, career stagnation, and poor data org design.

We blame it all on people who own the part of it and expect them to solve the part which they don’t own and have no incentive to solve for and doesn’t align with their primary responsibilities.

There never has been a single data org that owns the complete data ecosystem at any company. If No one owns the data ecosystem then who is supposed to solve it?

Now we never talk about this. Why is that? What is the reason? Why there has never been a single data org which is responsible for the complete data ecosystem at a company?

  • Before the 2000s, when people were building companies, first the focus was on business then it moved to Tech and then it moved to Product. Data was the byproduct of these innovations and data impact was not understood. So data ownership stayed scattered.
  • Now once data started to become highly valuable and people realized that data can bring them prestige, fame, and money. It became critical for them to keep data to themselves. So these functions started to own part of the ecosystem and didn’t want to let go even if it was hurting at the overall level.
  • But even if anyone wanted to own the complete data ecosystem, they couldn’t. They will have to understand it all — Data, Business, Tech & Product. No one understood it all. Basically fear conquered them and everyone is afraid to venture into unknown territories.

So the system is against solving it and there is no one to solve it. How will it go to get solved? Who is going to build a data strategy for the complete data ecosystem at the company?

Is there a way? Well, the good thing is there is a way. You know it already

(Please let’s not go in the direction where we can’t expect people to know it all.)

The direction we need to take

Single Org & Single Ownership for data: But how will this happen?

Chief Data Officer has been existing for quite some time now. If this role has already been existing then why hasn’t the problem been solved?

  • They don’t understand the complete data ecosystem.
  • They don’t know how to bring the pieces together.

So I am going with the same designation for easier understanding with a few modifications that should help you solve this problem. Now when I talk about CDO.

Who are they?
They can quite literally see and imagine data at each stage and the impact which this data can bring. They must see each data asset as a single product that goes through the assembly line to produce something beautiful and impactful for the business. Basically, It is like running a data product company inside another company.

  • Partner with Tech to produce raw data assets and refine them to make them usable.
  • Build analytics & data products that are consumed directly by business or product.

To deliver this, CDO needs to own data at each stage, define the scope of data org and be responsible for it. It can’t be owned by any other team. Because these are problems of data, not of tech or product or business. Just because data is passing through tech or product, it doesn’t make them their problem. They are responsible for a safe and timely passage and that’s it. Nothing more and nothing less.

All this needs to be solved by data operators and they need to be familiar with the paths and medium that are responsible for data passage. It is scary. why?

  • This requires a massive mind shift in looking at data.
  • This requires upskilling and learning the things we are afraid of.
  • Taking ownership of the most broken system in any org.
  • We don’t know where data org starts and end. No clear boundaries.

This is the reason no one wants to bite the bullet. This is where the deadlock happens and all things get stuck. We keep trying to solve people & process problems with tools. This is exactly why Modern Data Stack is failing precisely. So let’s solve it. let’s start with defining the data ecosystem, data org, and the parts which need to be owned by data org.

The Data Eco System

Upstream Data: Production & Management at Source

  • All the ways and places from which data enters into the data ecosystem and get stored in non-analytical databases or data lakes of the company should reside here.
  • Instrumentation, Logging, Sensors, external data, internal data, user-generated data, third-party data, APIs, Feature Engineering, Data Architecture
  • Data Storage & Movement, Reliable data flow, pipelines, uptime, Security, Streaming, compute and storage, infrastructure
  • Owners: Central Data Team | Partners: Software & Data Engineering

Downstream Data: Production & Management at Destination

  • All the ways via which data moves and transforms to make it meaningful, related to the real-world(semantic layer), and consumable for the rest of the data team and company (self-serving).
  • Data Cleaning, Aggregation & Transformation, Data Quality, Data Modelling, Data Warehousing, Security, access, availability, resiliency, job orchestration, compute and storage, Data Architecture,
  • Data Governance & Management, Data Privacy, Data Compliances, tools management, Integrations, Data lineage, Data Modelling, Fact and Dimension tables, Summary tables
  • Owners: Central Data Team | Partners: Data Engineering

Data: Passive Consumption | Applications & Products

  • All the ways via which data can be used to provide visibility to the different teams & companies.
  • Business Intelligence & Reporting, Central dashboards, Data Dumps, Finance reporting, Investors’ Data
  • Data Applications — Incident Reporting & Monitoring, CRM, Marketing, Sales
  • Owners: Central Data Team | Partners: Analytics | DS | Business

Data: Active Consumption | Applications & Products

  • All the ways via which data can be used to drive decision-making across the company.
  • Analytics: Descriptive analytics, Diagnostic Analytics, Predictive Analytics, Prescriptive analytics
  • Experimentation & Product Optimization
  • Data Science: Personalisation, Fraud Identification, Recommendation, Search engine, Matchmaking, ML & AI
  • Owners: Analytics & DS| Partners: Business | Product | Leadership

Now we know that this is the playground for the data org and where the boundaries lie. Let’s define the Data Org structure now and it will be based on the hub & spoke model.

The Design of The Data Org

CDO:

  • Central Data Team Leader — Tech & Data Design Oriented
  • Analytics Leader — Business & Product Oriented
  • Data Science Leader- AI & ML Oriented

Central Data Team: This team ensures that correct raw data flows from source to destination and raw data is transformed into usable data assets at the destination.

  • Upstream Data: Production & Management at Source
  • Downstream Data: Production & Management at Destination
  • Data: Passive Consumption | Applications & Products
  • People: Data Stewards, Analytics Engineers, or Data Analysts, Product Manager — Data Platform
  • Skills: SQL, Excel, Python, Cloud, Data warehousing, Knowledge of Data Engineering, Software Engineering

Active Consumption | Applications & Products

Analytics Team: This team ensures that real-time decision-making is enabled for the various teams at the company. They are closer to business problem solving and use data to deliver actionable solutions & insights.

Business Analytics | Product Analytics | Fraud Analytics | Game Analytics

  • People: Business Analyst, Product Analyst, Fraud Analyst,
  • Skills: SQL, Excel, Python, Cloud, Data warehousing, Knowledge of Data Engineering, Software Engineering

Data Science Team: This team ensures that data is productized in its most optimal way.

  • People: Data Scientists, AI/ML Engineers, Product Manager — Data Products
  • Skills: SQL, Excel, Python, Cloud, Data warehousing, Knowledge of Data Engineering, Software Engineering

The Implementation of The Data Org

Is Org Design and CDO enough to make this whole thing work? No. Let’s focus on processes, guidelines, and policies to make it effective.

People

  • Data Operators in the Data Org should be given the flexibility to move across the teams as their interests and passion evolve. It works really well for retention and opens growth paths for individuals.
  • People with more business and product thinking/aptitude should be brought into analytics. They are the ones who want to make an impact on business/product.
  • People with more interest in the pure data(design+tech) side should be brought into the central Data Team.

Process

  • The Central Data team should be in the loop whenever any new data is introduced into the system to manage data quality.
  • Product Analytics should be responsible for defining data specs for any new/updated feature and 1st party data basically. For the rest, the central data team should take care.
  • Product Analytics should focus on product optimization via data, not the PM work.

Culture

  • NA

Systems

  • NA

Roles & Hierarchy in the Data Org

At GetMega, I defined the complete Data Org. So if anyone wants to use it as a reference, feel free to do it. I will be writing in detail about each role and its responsibility in some other article.

Generic Hierarchy

  • Analyst I / Data Steward I / Data Scientist I
  • Analyst II / Data Steward I / Data Scientist II
  • Analyst III / Data Steward III / Data Scientist III
  • Senior Analyst / Data Steward / Data Scientist
  • Lead Analyst / Data Steward / Data Scientist
  • Manager
  • Senior Manager
  • Director
  • VP

Notes :

This is a work in progress. I will keep updating as and when.

This is a framework to build a better data org. You can modify it based on your needs but don’t go too far.

I will be writing more articles to expand on some of the ideas which are briefly discussed here.

Reference :

--

--