Fifty shades of plaid (part 1)

14th December, 2020

It’s a good sign when you become the Something for X. It enshrines you as a business model that has changed the market and one where learnings can be applied to other niches. There’s Uber for X (e.g. Rover, for dogs), AirBnB for X (e.g. HipCamp, for camping) and, particularly since their proposed acquisition by Visa, there has been a trend of talking about the Plaid for X.

But, as has also been the subject of some of my other blogs (Banking-on -a-Headache, Rejecting the Gold Rush of Pick and Shovel Analogies), there is often more nuance and complexity to fintech business models than first meets the eye, and it’s important to dig into the different ‘shades’ to understand where the real opportunities are.

Nowhere is this more the case than in the fintech data API space.

With regulatory tailwinds, technology advances, and growing consumer interest in being in control of one’s own data, there’s been an explosion of solutions that open up, connect and enrich previously disparate data sources. We’re seeing start-ups tackle payroll data, real-estate data, climate data, accounting data… to name just a few.

The problems these companies are addressing, however, are often quite different. It’s easy to bundle them into the generic category of ‘data API platform’, but actually they’re solving different data challenges depending on the industry. This is an attempt to split out the different problems being addressed, dive into more detail in each area, and explore where ‘Plaid-like’ (or in the case of our own portfolio, ‘Truelayer-like’) dynamics are likely to develop.

I’ve split the broader space into four distinct groups: data access, aggregation, insight and attestation.

The focus of this first blog is data access, which acts as the foundation for the other categories. Although aggregation, insight and attestation may appear higher value activities, without reliable access to the data, the value and utility of these other activities becomes negated.

DATA ACCESS

I remember when I was working in consulting back in 2010, the hot topic of the time was ‘Single Customer View’ (SCV). Companies had huge swathes of data sat in various spreadsheets and databases, but the challenge was joining it up and using it effectively. “Data is the new oil… let’s mine it, refine it and use it as a defensive moat...”

A decade later, with the ongoing rise of goliaths such as Salesforce and newer unicorns such as Segment, great strides have been made. For many companies, maximising the value of customer data is of the utmost priority – personalisation, upsell, cross-sell… they all depend on a broad and deep view of the customer.

But bigger forces have also been at play. Data has also become a liability, with regulatory interventions (e.g. GDPR) helping customers take more control of their personal information. We’ve entered a new era where consumers and businesses alike are demanding access to their data whenever and however they want. The days of customer data being the ‘possession’ of those who collect it are becoming numbered, and supporting this change has been a rise of ‘data access platforms’ which have helped unlock previously inaccessible data silos.

Bank transaction data is one of the most well-known and mature examples within financial services, but there are emerging opportunities in other data areas too. Argyle and Finch for instance are trying to build the ‘pipes’ for employment data, helping fintechs and consumers access payroll and HR data currently ‘locked’ in employer ERP systems. Similarly, businesses such as Codat and Railz are opening up access to accounting data. Unlocking this data enables lenders and other third parties to utilise and verify critical business information that would otherwise need to be sent manually. Uncapped, for example, one of our portfolio companies, uses accounting data to understand the finances of the SMEs it lends to, helping them to make lending decisions within 24 hours and take repayments as a share of revenue.

The extent to which these platforms are able to succeed and scale depends on a multitude of factors including their coverage, accuracy and reliability and also the incentives of the underlying data source. But critical to a lot of this is the data collection technology used. As was the case with bank transaction data, these new data platforms are often using screen scraping and API reverse-engineering on their journey to (hopefully) getting direct API connections (aka the ‘Plaid’ playbook). But as was also the case with bank data, there are some significant impediments and challenges along the way which can impact the uptake and usability of such solutions. It arguably took consumers over a decade to start trusting bank data connections en masse as the technology went from ‘hacky workaround’ to core data rail.

So, will these new platforms face similar challenges, or will the journey be a smoother one…?

Screen scraping

Prior to PSD2 in Europe and the direct data connections established by the likes of Plaid in the US, screen scraping was the dominant approach to accessing bank transaction data. Yodlee, for instance, was founded more than 20 years ago and helped power early consumer propositions such as Mint.com in the US (and also Pariti, the startup I founded in 2014). For those less aware of scraping, it’s similar to what Google uses to crawl website content and, in the case of bank data, involves logging in to a user’s online banking website on their behalf to ‘scrape’ their latest balance and transactions. It is technology that can be applied to most financial accounts and at most banks (assuming they have an online banking portal). It also doesn’t require agreements with the banks themselves – as long as users provide their login credentials, the platforms can just scrape the banking sites on their behalf.

This lack of ‘needing permission’ is a critical point – looping back to my earlier point about companies seeing data is a competitive moat, historically the incentives for banks to share data were low. By having ‘exclusive’ access to the spending behaviour of a customer, they were in a stronger position compared to an external finance provider to offer them other products. Bank customers were captive to their bank as the bank had a data advantage. But screen scraping technology has helped change this – whether the banks liked it or not, consumers are now able to ‘unlock’ their data and share it with third parties.

As a pioneer in space, Yodlee grew rapidly and reached almost 50 million end users prior to its listing on NASDAQ in 2014. But as much as the scraping approach can help companies scale quickly, it can also be an achilles heel. As any developer who has utilised screen scraping can attest to, the technology has some significant challenges and hurdles.

The first is data consistency. If a website changes, or if a marketing banner pops up, data can often not be collected or is done so inaccurately. Similarly, because screen scraping involves ‘capturing’ data at a single point in time, data points that adjust over time are difficult to accurately update and reconcile. For instance, as a bank transaction settles, both the name and amount (if a foreign transaction) can change from the initial pending information.

The second challenge is data timeliness. Scraping solutions often have a ‘cat and mouse’ problem with websites. If they scrape too frequently, they can negatively impact the site’s performance and indirectly add to costs, so to avoid being blocked, they spread out the time of day that they scrape each site. And for the end user of the data, this means there is a window of time during which the data may not necessarily be accurate.

Both of these issues present large technical challenges for developers building on top of screen scraped data. At Pariti, for instance, we had to develop logic to remove duplicate transactions where the scrapers had caused an error, to reconcile transactions as they moved from pending to settled, and to handle downtime errors when the end bank website couldn’t be reached at all. Added to that, the regulators have also started to clamp down on scraping practices in certain markets. PSD2, for instance, has all but removed the use of screen scraping for bank transaction data in Europe in favour of regulated open API infrastructure.

So, does this mean that all solutions based on screen scraping are doomed to fail? Not at all…

Even for bank transaction data, screen scraping persists and continues to be utilised for financial accounts where APIs don’t exist (including by Plaid). And beyond bank data, there are many use cases where screen scraping remains a low cost, easy way to access data trapped in silos. We have a jobs board on the Mouro Capital website for instance which scrapes our portfolio companies’ jobs pages to show available roles: https://talent.mourocapital.com. For this it works well.

But by and large, the use cases most appropriate for scraped data are those where:

The incentives of the data source are to restrict the opening up of data (making direct API connections unlikely)
The frequency of new data points is low
Data points remain fairly static once posted
Sensitive / complex login credentials are not required for access
The data is easily accessible via the web

Reverse engineering APIs

A second, more reliable method of accessing data is to utilise a company’s private mobile APIs. These are not actually intended for public / 3rd party use, but with the right reverse engineering and login credentials, they can be used to indirectly access data in a more structured format.

Similar to scraping, such solutions help unlock data silos where the incentives of the underlying data sources are to not share the data easily, but they also require a lot of trust from end-users as they involve providing sensitive login details, and as you may imagine, companies are increasingly seeking ways to prevent such non-authorised access to their APIs. But regardless, the prevalence of such approaches is wide. A Software Engineer job posting at Argyle for instance lists familiarity with “Android/iOS device verification frameworks (SafetyNet Attestation / DeviceCheck) and ways to bypass them” as a “big plus”. No guessing what approach they use…

Prior to PSD2, and the availability of open banking APIs, this was the primary approach platforms such as Truelayer in Europe deployed. It is a major step up from scraping – it takes less time to collect the data, the data consistency issues are reduced and it theoretically enables more frequent updates of data – but there are still limitations.

The use cases most appropriate therefore for reverse engineered API data are those:

Which don’t require real-time data or notifications (batch updates a few times per day are ok)
Where the data accessible via a mobile application (therefore, the APIs are available)
Have login credentials that customers are willing/able to share

Direct API access

The final and most reliable option is to build directly against a company’s dedicated public API. With robust access to structured data directly from the source, a richer more real-time experience can be created for customers. In markets such as Europe, where there have been regulatory interventions, the availability of such open APIs for financial data has grown significantly, and increasingly companies are embracing – or are at least being forced to embrace – a more interoperable, open and connected future (I’ll come onto the aggregation opportunity this presents in part 2 of this blog).

But this is not the case in all markets and getting access to these APIs often involves direct agreements with individual data sources and even sometimes building out the APIs for them. This is where Plaid ultimately succeeded in the US. They moved from screen scraping and reverse-engineered APIs when the banks were initially resistant to the solution, to eventual direct integrations and agreements with some of the largest financial institutions in the world. Such access is highly defensible and has, in essence, enabled them to create a whole new ‘data rail’ between themselves and the banks, but it is a long and difficult path to get to this point.

And so, what does this all mean for the new data access platforms being born in other verticals? My main takeaways would be as follows:

Those platforms that are able to get direct API integration, at scale, will likely be able to build the most reliable defensible solutions. But this is no easy task, and sometimes the incentive for the underlying providers of data is to not share it! Banks went on a long educational (if not mandated) journey to opening up their data, but this need not be so difficult for all industries. By opening up HR and payroll data, for instance, third parties can operationally help employers by reducing their costs (e.g. automated employment references, salary checks etc). The data in this example is not withheld by employers, but rather lacking an easy technical solution to opening it up.
Screen-scraping and reverse-engineered API access can work for some use cases and can be a good first step to prove out a market. But there are challenges if reliable, rich experiences are required. Also, high levels of end-user trust are required if the solution requires sensitive customer login details.
Coverage is key. Only having a handful of data sources accessible can significantly impact the overall utility of a platform. The future of ‘data access’ solutions in sectors such as payroll, HR, pensions, insurance, etc is therefore closely twinned with the provision of mobile apps and uptake of cloud/SaaS platforms in those areas. The data needs to be online for it to be accessible.
The scale of the opportunity for new data access platforms depends on the breadth and value of use cases that can be built on top of them. The utility of bank transaction data is extremely wide (from proving affordability to understanding spending behaviour), and is of interest to a lot of third parties. But not all data platforms present the same scale of opportunity for ‘access’ activities. In these smaller areas, the extent to which providers can offer higher-value ‘insight’ and ‘attestation’ services will ultimately drive the size of their respective opportunities (a topic I’ll come on to in Part 2).

I’m excited about a world of open, connected data and the innovation this can bring. Open banking regulation and the likes of Plaid and Truelayer have been a catalyst for financial services organisations thinking differently about data, but there is still a long way to go, both in banking and other sectors. I’m bullish about the future of the new data platforms and use cases outside of transaction data, but the path they will need to take is very much dependent on their own sector specific data challenges. A one-size-fits-all ‘Plaid-playbook’ is unlikely to work universally and as with bank data, there will inevitably be many casualties along the way.

In Part 2 of this series, I’ll dig into the other areas of the ‘data API platform’ landscape: aggregation, insight and attestation. If the story of data access has been one built over the last ten years, many of the use cases in these other categories (and the technologies underpinning them) are only just getting started, so it’s a chance to speculate a bit more about what is yet to come!