Our experience of tier 1 financial services organisations is that the evolution of how infrastructure was designed and built resulted in a complex tightly-coupled architecture with significant duplication & data quality issues that cannot be easily changed. This has resulted in multiple business problems where the root cause is due to large amounts of disparate data spread across highly specialised systems with no ability to holistically view across systems, organisational or geographic boundaries. This is excerbated by a and uncertain regulatory and market environment subjecting the these systems to enormous change, and an increasing need for faster response times.
This problem can be summarised through the following data characteristics:
- Variety – many disparate data sources
- Volume – more granularity in a larger world
- Variability – changes in shape and format
- Velocity – data that needs to be moved quickly for performance
- Veracity – too much copying of data which breaks link with golden source and no resilience when in memory processes are used
This requires a Toolset that can adapt to an environment with these characteristics.
Our Toolset, DataFusion, is designed provide business views of highly federated data using a centralised query engine to federate queries to all the underlying systems and joins these queries in a highly performant and scalable way. The architecture allows the data to meet the requirements of the 5 V’s and the data is harmonised into a common schema creating a virtual data warehouse. The critical feature is that we use a data query approach rather than a data copy approach.
The Central Query Engine is essentially a lightweight Appliance on inexpensive hardware. This query engine allows a user to write simple SQL queries to access data across the disparate underlying data sources as though they were directly operating on the underlying system itself. Key features of this are:
- Ultra-High performance and scalability to billions of rows
- Proven – Leverages existing technology & techniques from astronomy & telecoms
- Cost & speed of deployment – Quickly deployed on inexpensive Intel servers & SAN storage, no significant tech infrastructure engagement
The Federated Plug-ins on each of the underlying data sources essentially act as a broker between the central query engine and the underlying data source to expose the underlying data schema to the query engine and translate (or map) that data schema to a global schema owned centrally. Key features of the plug-ins are:
- Data access plug-ins allow central query engine to federate queries and access local data model. These can range from simple mapping layers through to more complex plug-ins that can adapt to accommodate weakness in the underlying data sources. This approaches leverages the existing systems and corporate knowledge by ensuring that these plug-ins are owned by the local teams since they know and understand their system and data model best and can manage the necessary complexities of accessing the local data.
- Workflow plug-ins allow data to be exposed to the workflow engine to provide data governance and de-couple the data from the business process (“take process to the data, rather than data to the process”)
The benefits of using this approach
Accountability – The critical feature using the approach is that owners of the data remain accountable for their data and cannot absolve themselves of this responsibility in the same way they can when providing provided a feed to some centralised store. Whilst this may seem trivial our years of experience in running reporting functions in large organisations has taught us that this accountability is the most critical requirement to ensure better data quality.
Transparency – the performance of the query engine allows significant volumes of data to be queried which means that drill-down of data right down to the most granular trade or position level is possible giving a fully transparent view to end users of the original source data (“Sunshine is the best disinfectant”)
Realistic ambition – Many large organisations attempt to resolve their data problems by proposing a re-building of the entire systems architecture from scratch. This however is too difficult and not realistic. A more pragmatic approach would be to use current systems and find a way to show early progress towards the end goal. Showing these early “quick wins” will begin the momentum necessary to get the wider buy-in across a large organisation. This can be achieved using simple plug-ins on existing systems which are integrated via the central query engine which is significantly more achievable than a greenfield development.
Empowers local teams – a softer, but no less important, benefit is that this approach empowers local teams since they retain ownership of the data and local IT teams are empowered to continue to enhance their local systems.
Facilitates change – using a query-based approach rather than copy-based allows underlying data sources to continue to change (as the typically do, especially in highly regulated financial services environments) without requiring change to the central query engine. Individual plug-ins can be adapted rapidly to accommodate pipelines of work in underlying systems without large amounts of central co-ordination, planning and testing effort.
Aposteriori – Most data warehouse programmes require an apriori knowledge of all the possible questions required to be answered to decide on the data elements required to build the physical data model in the databse. Our approach adopts an aposteriori approach where all the data elements do not need to be established in advance since the query engine has a “soft schema” which is a virtual view of a logical data model that can cope with adding data elements trivially. This allows a pragmatic approach to delivery where a small number of data fields can be adopted in the initial schema and these can be enhanced with other fields at a later date without the need for expensive changes to a physical data model within a data warehouse. More importantly however, is that the end users do not need to know in advance all of the possible questions they wish answered using this infrastructure, they can simply adapt the infrastructure across new data sources or exapnd the coverage of existing data sources as their needs develop.
Cost – The central query engine is designed to run on vanilla hardware and uses highly compressed data which allows it to have performance superior to comparative Appliances for a fraction of the cost.
Facilitates de-duplication – initially our approach adopts the default view of using existing systems, however we recognise that in many large organisations there are indeed many instances of systems and data duplication which creates a significant opportunity for cost savings and streamlining of ongoing operational and IT support. Using our query engine it is possible to query across duplicate systems or data sources and perform the necessary analysis required to identify differences, overlaps and gaps which is an important (and significant) element of the effort required prior to de-duplication. It can also provide a clear representation of the logical data model for the data superset which is required to be stored in the target system.
Environment aware – when accessing data across multiple systems, realistically not all data sources or systems will be of the same quality. For this reason the plug-ins can adapt to make up for weaknesses in these systems, for example they can make fragile systems appear faster by using a resilient caching layer which a-synchronously updates against the underlying system.
Security – the default position, for sake of simplicity, is for the plug-ins to inherit the security privileges of the underlying systems, this allows the architecture to obey the necessary audit and compliance rules which can be onerous in large, heavily regulated financial services organisations. In addition, it is possible to add further security which may be required to allow data views to be tailored specific to specific individuals. A further feature is the ability to use the plug-ins to anonamise data on-the-fly such that queries on data sources where there are strict data privacy laws will be in compliance with those laws. This approach canot be achieved through a centralised data approach.