The Guide for Time Series Data Projects is out.

Download now
Skip to content

Data Management in 2024: Expert Insights from DataPebbles

We spoke with Surajeet Bhuinya, the founder of DataPebbles, a partner of CrateDB based in the Netherlands. Surajeet is an expert in the field of data management and shares his insights on the challenges that companies face with data management and how to solve those challenges.  

DataPebbles is a boutique data consultancy with years of data engineering practices. They help businesses with cloud migration, visualizing insights, data pipelines, and automation of machine learning applications. Most of their clients use Kubernetes on-prem or in the cloud.  

CrateDB: Surajeet, what are the key data challenges companies are facing today?  

Surajeet Bhuinya (SB): One of the biggest challenges faced by companies today is understanding data management. Often, people focus on finding a single piece of data without realizing that a well-planned data pipeline with automation can lead to an overwhelming amount of data. In such cases, it becomes necessary to manage the data effectively, understand its source, and track its lineage. 

Unfortunately, many companies fail to plan for this from day one, leading to issues down the road. Despite recognizing the importance of data management, companies often overlook the need for documentation and a data glossary. These simple yet crucial steps are essential for understanding the data and avoiding issues in the future.  

CrateDB: What does it take to get there, and what's the approach to help with this challenge?  

SB: When it comes to setting up a new data pipeline or data insights, technology can be incredibly helpful. At our consultancy, we always recommend two basic things to our clients: automatic metadata and automatic lineage. These two steps are essential and can be easily implemented.  

While data quality reports, data governance, and access management are also important, it's crucial to start with data understanding, domain knowledge, and documentation. At our consultancy, we primarily work with Spark, which has standards for lineage called open lineage. By setting up data properly, lineage can be obtained without writing any special code. Similarly, metadata can also be obtained for free or at a low cost with the right software and tools.  

We strongly advise using open-source software for its flexibility, and it is important to use tools that will continue to work in the future. Preparing documentation from the start is also crucial, as it will be useful later on. Following these steps can go a long way in ensuring effective data management.

CrateDB: What are the critical differentiators of working and engaging with CrateDB as a partner?

SB: At our consultancy, we prioritize open-source software as it offers greater flexibility and scalability than other proprietary software.

In the past, I have had challenges in creating a reference data architecture for our customers where we can limit the number of tools/software while providing a wide array of features, especially ad hoc queries on structured data and unstructured data storage. Typically you will end up with choosing 2 technologies:

  • one specialized in ad-hoc query or query on demand
  • one for very structured and performance optimized queries.

I am looking forward to putting CrateDB in the mix to use a single storage place for all the ad hoc and structured queries without any compromise on performance. Additionally, I am excited to use the vector DB feature to ensure the data architecture we present is AI-ready.  

CrateDB: How is your experience so far with CrateDB?

SB: Our experience engaging with the CrateDB team has been great. We appreciate the amount of focus they give to their partners, which is evident in the amount of involvement we receive. In comparison to other software partners, CrateDB stands out for the level of attention they provide.  

Thank you, Surajeet. At CrateDB, we believe technically skilled partners are key to our success. We are excited about our partnership with DataPebbles and how it will develop in the future.