Data-Driven Culture: Part II- What is good data?

In the first part of this series, we noted that African tech start-ups often face the challenge of gathering quality data. However, there are few clear guidelines on how to gather quality data on the continent as data practitioners are still in the early stages of establishing and adopting industry-wide data practices. After engaging with a lot of start-ups that position themselves as data-driven, we have identified four principles that African start-ups can consider as guidelines for establishing data quality: i) Accuracy ii) Completeness iii) Reliability and iv) Relevance.

1.   Accuracy

When information is inaccurate, it’s difficult to objectively perceive progress and make well-informed decisions. Erroneous data is very costly to deal with as it creates mistrust among internal decision-makers, product builders and even customers. It is therefore important to collect and disseminate data that reflects reality as accurately as possible.

One of the ways to establish data accuracy is by ensuring that all data sources are verified. For a start-up, the most critical data source is a customer. It’s crucial to ensure that you are correctly capturing customer details such as addresses when delivery is involved, identification documents when money transfer is involved and demographic details when it’s relevant to understand customer characteristics. In some instances (like in FinTech), the Know Your Customer (KYC) verification process is even mandatory for regulation compliance.

Data verification should happen at the data collection level to ensure that errors are immediately flagged and corrected. For instance, a FinTech company that relies on an agent distribution network would get more accurate data when identification verification is done at the point of data collection (i.e. the agent) as opposed to having the collected data being verified later on by data analysts. Considering that humans are prone to errors, it’s also best to automate the data verification process when possible. This automation process however needs to be reviewed periodically to ensure that it’s still providing the desired level of accuracy as time and context change.

2.   Completeness

Data must account for all necessary processes within a business and all relevant use cases when building predictive models. For this reason, it’s crucial to create processes that ensure that data points are easily consolidated from different sources.

For most African start-ups, completeness is usually lacking when there are both offline and online channels for interacting with customers. The offline interactions are often partially recorded if not completely lost. It’s therefore important to encourage and incentivize the use of online channels from the onset and if customers have a rigid preference for offline engagements, then it’s crucial to implement processes to ensure offline transactions are recorded.

Completeness is also a crucial principle when trying to minimize bias. For instance, when trying to build a credit scoring model, it would be important to have a comprehensive view of customer characteristics that determine creditworthiness. Personal biases could easily leak into the model when trying to determine groups and customer segments that are at risk of default, especially when working with smaller sample sizes. For example, an expert may view younger borrowers as riskier and therefore overweight the age variable in a credit model without considering other variables that could diminish the explanatory power of age. It is therefore very important to have diverse teams that can deliberate on assumptions and push for the consideration of comprehensive data points. 



3.   Reliability

Reliable data should be able to give stable and consistent results. The three[1] common reasons for inconsistency are i) hardware faults ii) software errors iii) human errors.

Hardware faults

In the African context, hardware faults are often experienced in the context of having unreliable access to electricity and the internet. Fast-growing start-ups especially experience challenges when data volumes rapidly increase, as the infrastructure that is used to collect, store and manage data is strained. For this reason, it’s important to prioritize cloud database management from reliable service providers. Cloud solutions allow you to back up data in a virtual environment, minimizing the disruption that can be caused by hardware faults.

Software errors

When trying to build a product, it’s not unusual to come across bugs that can cause systemic faults in software. To tackle this challenge, it’s important to constantly test, measure and monitor interactions in the system during the building phase. If possible, you can include automated self-checks that can flag any data consistency issues. For instance, you can include a self-check for a message functionality to ensure that the number of incoming messages is consistent with the number of outgoing messages. Any redundant messages will therefore be flagged.

Human errors

As we’ve already established, humans can be quite unreliable even when they have the best intentions. Besides automation, human-related errors can also be minimized by creating sandbox environments where people can experiment and test real data without affecting users. Having clear guidelines and documentation on version control is also really important in this context. Moreover, performance metrics (such as low error rates) can incentivize data engineers and analysts to test for errors and minimize any inconsistencies flagged.

4.   Relevance

The data that is collected and analyzed should be relevant to the success of the business. It’s therefore important to be clear on what you would define as a product/business success and how you would want to measure this success.

Everyone within the business should also be aligned on why you chose specific metrics and the definition of the measures that have been chosen. This way, the data that is collected and disseminated can be interpreted similarly across the business, creating alignment when there is actionable data.

In our next piece, we will explore some best practices when trying to choose the right business metrics.

Summary

 

A start-up that is striving to work with quality data ought to:

  • Verify all data sources, preferably at the data collection level
  • Automate data verification processes where possible
  • Incentivize customers to engage with the business on online channels
  • Create processes that capture data during offline engagements with customers
  • Have diverse teams that can deliberate on assumptions and push for the consideration of comprehensive data points.
  • Prioritize cloud-based data management
  • Test, measure and monitor interactions within a system during the production phase
  • Automate self-checks to flag any software errors
  • Create sandbox environments where people can experiment and test real data without affecting users
  • Have clear guidelines and documentation on version control

Have clarity on measures of success across the business

[1] Designing Data-Intensive Applications- Martin Kleppmann


The Baobab Network Accelerator Application Banner

By Wanjiku Kimani

Venture Partner at The Baobab Network


The Baobab Network Accelerator Applications Banner

Data-Driven Culture: Part II- What is good data?

In the first part of this series, we noted that African tech start-ups often face the challenge of gathering quality data. However, there are few clear guidelines on how to gather quality data on the continent as data practitioners are still in the early stages of establishing and adopting industry-wide data practices. After engaging with a lot of start-ups that position themselves as data-driven, we have identified four principles that African start-ups can consider as guidelines for establishing data quality: i) Accuracy ii) Completeness iii) Reliability and iv) Relevance.

1.   Accuracy

When information is inaccurate, it’s difficult to objectively perceive progress and make well-informed decisions. Erroneous data is very costly to deal with as it creates mistrust among internal decision-makers, product builders and even customers. It is therefore important to collect and disseminate data that reflects reality as accurately as possible.

One of the ways to establish data accuracy is by ensuring that all data sources are verified. For a start-up, the most critical data source is a customer. It’s crucial to ensure that you are correctly capturing customer details such as addresses when delivery is involved, identification documents when money transfer is involved and demographic details when it’s relevant to understand customer characteristics. In some instances (like in FinTech), the Know Your Customer (KYC) verification process is even mandatory for regulation compliance.

Data verification should happen at the data collection level to ensure that errors are immediately flagged and corrected. For instance, a FinTech company that relies on an agent distribution network would get more accurate data when identification verification is done at the point of data collection (i.e. the agent) as opposed to having the collected data being verified later on by data analysts. Considering that humans are prone to errors, it’s also best to automate the data verification process when possible. This automation process however needs to be reviewed periodically to ensure that it’s still providing the desired level of accuracy as time and context change.

2.   Completeness

Data must account for all necessary processes within a business and all relevant use cases when building predictive models. For this reason, it’s crucial to create processes that ensure that data points are easily consolidated from different sources.

For most African start-ups, completeness is usually lacking when there are both offline and online channels for interacting with customers. The offline interactions are often partially recorded if not completely lost. It’s therefore important to encourage and incentivize the use of online channels from the onset and if customers have a rigid preference for offline engagements, then it’s crucial to implement processes to ensure offline transactions are recorded.

Completeness is also a crucial principle when trying to minimize bias. For instance, when trying to build a credit scoring model, it would be important to have a comprehensive view of customer characteristics that determine creditworthiness. Personal biases could easily leak into the model when trying to determine groups and customer segments that are at risk of default, especially when working with smaller sample sizes. For example, an expert may view younger borrowers as riskier and therefore overweight the age variable in a credit model without considering other variables that could diminish the explanatory power of age. It is therefore very important to have diverse teams that can deliberate on assumptions and push for the consideration of comprehensive data points. 



3.   Reliability

Reliable data should be able to give stable and consistent results. The three[1] common reasons for inconsistency are i) hardware faults ii) software errors iii) human errors.

Hardware faults

In the African context, hardware faults are often experienced in the context of having unreliable access to electricity and the internet. Fast-growing start-ups especially experience challenges when data volumes rapidly increase, as the infrastructure that is used to collect, store and manage data is strained. For this reason, it’s important to prioritize cloud database management from reliable service providers. Cloud solutions allow you to back up data in a virtual environment, minimizing the disruption that can be caused by hardware faults.

Software errors

When trying to build a product, it’s not unusual to come across bugs that can cause systemic faults in software. To tackle this challenge, it’s important to constantly test, measure and monitor interactions in the system during the building phase. If possible, you can include automated self-checks that can flag any data consistency issues. For instance, you can include a self-check for a message functionality to ensure that the number of incoming messages is consistent with the number of outgoing messages. Any redundant messages will therefore be flagged.

Human errors

As we’ve already established, humans can be quite unreliable even when they have the best intentions. Besides automation, human-related errors can also be minimized by creating sandbox environments where people can experiment and test real data without affecting users. Having clear guidelines and documentation on version control is also really important in this context. Moreover, performance metrics (such as low error rates) can incentivize data engineers and analysts to test for errors and minimize any inconsistencies flagged.

4.   Relevance

The data that is collected and analyzed should be relevant to the success of the business. It’s therefore important to be clear on what you would define as a product/business success and how you would want to measure this success.

Everyone within the business should also be aligned on why you chose specific metrics and the definition of the measures that have been chosen. This way, the data that is collected and disseminated can be interpreted similarly across the business, creating alignment when there is actionable data.

In our next piece, we will explore some best practices when trying to choose the right business metrics.

Summary

 

A start-up that is striving to work with quality data ought to:

  • Verify all data sources, preferably at the data collection level
  • Automate data verification processes where possible
  • Incentivize customers to engage with the business on online channels
  • Create processes that capture data during offline engagements with customers
  • Have diverse teams that can deliberate on assumptions and push for the consideration of comprehensive data points.
  • Prioritize cloud-based data management
  • Test, measure and monitor interactions within a system during the production phase
  • Automate self-checks to flag any software errors
  • Create sandbox environments where people can experiment and test real data without affecting users
  • Have clear guidelines and documentation on version control

Have clarity on measures of success across the business

[1] Designing Data-Intensive Applications- Martin Kleppmann


The Baobab Network Accelerator Application Banner


By Wanjiku Kimani

Venture Partner at The Baobab Network


The Baobab Network Accelerator Applications Banner