When You Should NOT Build Your Own Internal Data Platform
A data platform is about enabling an organization to work with data at scale, in a self-service manner.
That is how Kristof Martens, tech lead at multiple of Dataminded's customers, describes a data platform. That is also what Wim Vancuyck, data lead at imec, envisions in his quest to enable more than 5000 researchers to work with data at imec, the world’s leading independent nano-electronics R&D hub. We recently sat together and discussed why companies should actually not build their own internal data platform. These are our main conclusions.
The impact of company culture
Let’s start with explaining the different factions that we typically see while deciding on a data platform: business, IT and governance. Typically business teams want to go fast and have something available yesterday. That’s why they mostly lean towards buying off the shelf tooling: click install and off you go.
IT departments on the other hand focus on standardization and development best-practices, and tend to lean towards building a solution, as it offers you more flexibility and control. And let’s be honest: building a data platform often looks easy at the start. The third party, the governance team, has a more narrow view on a data platform: a data catalog will solve all problems. In the end, it’s the faction with most power in the company that influences its buy vs build mentality.
Funny enough, at imec it’s the other way around: the R&D department exists of people that are used to coding, which led to building and rebuilding quite often. The IT department on the contrary is small compared to the amount of researchers, loves to work with partners and buy tools to scale fast and introduce standardization. It’s all about avoiding re-inventing the wheel over and over again by all those researchers.
This buy mentality, combined with the desire to keep costs under control, the love for open source, and the preference to avoid vendor lock-in, has led to a modular approach to the data platform at imec. This data platform consists of multiple Azure components, combined with tools like Matlab, or Conveyor. Conveyor can be considered as the cornerstone, enabling multiple open source technologies as a managed services and properly glued together. In theory, this modular approach allows imec to migrate the platform to other cloud providers, again mitigating a vendor lock-in.
The pitfalls of building your own data platform
“We can build this ourselves!” A common heard argument to build your own data platform, and probably the authors are right. The question is not whether you can build your own data platform, but whether you want or should do so. At imec they have the guideline
Buy where you can, build where you want to make the difference.
Often the platform is not where you make the difference: what you use it for is the value add.
Remember that a data platform is about enabling an organization to work with data at scale in a self-service manner. This raises a lot of questions, like:
- How will I provide proper separation between teams and projects?
- How can I encourgag best-practices?
- How do I share data?
- How do I keep track of governance and compliance concerns?
- How do I implement a data product lifecycle?
It is hard to come up with an answer to all these questions beforehand. If you want to have all of this prior to a first use-case, this can easily take some years. Glueing tested and proven components together, as imec does, speeds up this process drastically. After the consideration phase, imecs data platform was up and running in just 4 months. Even more: within the same 4 months the first 14 use-cases were live.
Conveyor was chosen to avoid reinventing the wheel. It offered a set of open standards which were on imecs wishlist, and they were already properly integrated. Imec could have done this themselves, but they acknowledged the fact that this has been done before. Due to experience, others might be better at it and could take the long-term maintenance out of their hands. This long-term maintenance might be the most neglected pitfall of building your own data platform: if you build it yourself, you need to maintain it yourself. Often platform teams get bogged down in operational issues after a while and can no longer add any new functionality.
Scaling your data platform
When talking about scaling a data platform, we immediately think about scaling the infrastructure but scaling a data platform also involves:
- Increasing the amount of use cases running on top of it.
- The number of people working with it and the knowledge required to work on it.
- The ability to securely share data between use cases.
Imec is a good example that technically scaling the platform is not the biggest challenge, scaling people is.
It is hard to onboard people to work with data. Both internally where you need to sell the platform as externally where you need to attract experts. This is one of the reasons they prefer buying platforms and components, to avoid hiring a lot of people in the data platform team. Data workers might be getting infused in the business teams, a consequence of an ongoing data mesh discussion. A small data platform team should enable them to focus on value creation with standards, guidelines and paved roads. This renewed focus is the new IQ, efficience is key.
This separation of responsibilities is recognized by Kristof amongst many companies. A data platform team will take care of shared services, compute, infrastructure and data access, allowing use case teams to focus on building and maintaining the data use cases themselves. It’s the platforms team responsibility to sell the platform to more and more people internally. You can only do this with a proper customer-centric mindset.
The cost of innovation
Sometimes standardization is considered the opposite of innovation. The goal of imec is to balance both. Yes, the IT department wants to introduce standards, but researchers value freedom and flexibility. The key is having a good offering tailored to each user profile. Remember the customer-centric mindset of the platform team? The users at imec are strongly demanding paved roads/standardization within the tools of their preference because it takes away complexity, it becomes an enabler, and even an accelerator for innovation.
The modular approach at Imec limits the dependency on one single partner. The open standards offered by Conveyor or the interchangeable Azure components allow to change them when desired with new, innovative tools over time.
But what does this approach cost? Make a business case when buying a tool and compare it to the cost building it. However, do not forget the maintenance cost on the long run. Even open source technologies cost money as you need to keep maintaining their installation and keep upgrading them. Imec argues that the tools they have bought have a lower Total Cost of Ownership than a build approach. In order to keep costs under control they focus on avoiding licenses without added value as well as idle resources.
As a small remark, Kristof notices that many companies find it easier to accept the people cost as opposed to the tooling cost. Hiring extra people often does not require a procurement process and rarely the business case of people versus licenses is made.
Security and compliance
Off the shelf data platforms usually have a sensible story that covers most if not all security aspects. Companies want to have control over who can access which data, who can alter production processes, how to enable software engineering best-practices, how to comply to regulatory requirements, …
When buying a tool, it is mostly a matter of translating the concepts that these data platforms offer to the existing company processes. As this translation is never a 100% match, companies often favor building a custom data platform than updating the company processes.
Kristof advocates that it makes sense to change the company processes to the concepts of a data platform. Again these are proven concepts, distilled from experiences at many different customers. The agility of your organization heavily impacts the feasibility of this change. Yet, if your level of agility does not allow for this, do you believe your company is the right one to build a tool itself?
Looking back
Imec’s approach has always been to buy where possible and to build where you can make the difference. However they do not believe in buying a Suisse knife that can do everything and probably comes with the cost of a Rolls Royce. Therefor imec takes a modular approach, which requires you to write glue code between the different selected components. This is where Conveyor came in: a proven technology. An added benefit is that Imec does not need to maintain the components, nor the glue between them.
Overall, imec is very happy with the choices made. The world of technology and data is moving fast, which means that over time you might no longer have the best platform possible. This does not imply that you made the wrong choices, but rather stresses the importance of building a modular platform. With this approach, you have the flexibility to change components when desired.
A data platform is there to scale the self-service usage of data throughout your organization. We notice that it’s more difficult to scale the number of users than your infrastructure. Your data platform should be customer-centric in order to convince more and more people to use it. A great user experience is crucial. A small data platform team focusing on this enablement, frees up your data workers to focus on value adding use-caes. Furthermore, their development with the use of paved roads is what allows you to reach your data ambitions.