How data synchronisation solutions architecture can deliver information to your audience faster, more securely, and more robustly
In this article we cover how Info Rhino adapted our software and third party software to automate content and data publication to our website for automated delivery to our audiences. Whilst we use our proprietary software to achieve many of these tasks, important plain to understand is how we work with thinking in terms of responsibilities to meet the needs of each individual client. So whilst we may offer this specific solution for your needs we can equally come up with the right solution for your organisation.
If you want to find out more about this service, feel free to contact us here.
If you are a data evangelist at an enterprise wishing to deliver more to their customers, get in touch.
The problem statement
How do we consistently consume data from multiple external sources including APIs, transform it, deliver it online to a wide audience quickly - at low cost.
Defining our requirements
We want to collect multiple data feeds from API providers and publish it to both data stores online and a Data Mart. Our focus is to minimise licencing costs by using software hosted so our website hosting provider. We can reduce costs because our web hosting provider allocates ample storage and a SQL Server database up to two gigabytes in size. By focusing on promises whereby a process or service performs a lightweight action on data, we can make our promises interchangeable if required.
Looking around for inspiration
The obvious thing to always do in technology is to avoid reinventing the wheel. Most companies are not going to tell us how they rapidly deliver data from disparate sources onto a web platform for their audience. What we know from our experience of consulting with an enterprise is that there can be a large technology team performing many small roles to manage the data within there enterprise. As an experiment to take any company in the FTSE or NYSE, we would find the availability of information on their websites to for their audience to be minimal. There are many reasons for this some being licencing and compliance related, mainly, enterprises use enterprise level architecture that requires highly specialised skills and isn't replicable that easily.
We looked to more modern industries such as cryptocurrency exchanges and data providers of cryptocurrency information. We tried to envision how these organisations would try to build their data architecture so they can operate in a lean manner with a lower headcount, but still offer a great service to their customers. An obvious starting point is the scale out approach, where a set of processes are defined and can be templated to be deployed to a new host or area.
When deciding to build out your data architecture always look to design a promise based architecture so that processes can be interchangeable and run as cheaply as possible until scaling up is required. A major consideration in the decision making process for deciding upon your data architecture is the hosting. Sometimes being a contrarian can be a massive benefit. We do use the cloud for many functions but went the other way. Windows hosting and desktop/server is more than sufficient to get our technology off the ground.
Content management systems allow website designers and content managers to maintain information on fairly on small to medium sized websites. A bigger challenge with website content management is the need to move data between different media. For example most content starts out in Word documents, OneNote or other text editors, before being taken into CMS editor screens and configuration files for eventually appearing on the website.
Content is not just text but images, data, key files, and different document formats for different audiences.
The Internet is becoming more automated. Not only bots, AI language processing models are running against online content. Normally this data is structured, but a key element of data is timeliness.
Thinking about the Data Synchronisation process
We took a good look at the many technologies that exist for storing data. There are many different cloud storage providers, we have blockchain based data storage, we have APIs within cloud providers that can capture data to data lakes, container storage, to name but a few. The challenge with each option is the possibility that future versions of these technologies either won't exist or will be change their versions meaning there will be a need to upgrade code.
Do we need a complex event processing architecture?
One of the very powerful features of complex event processing is the ability to continually bring in knew content and data and add decision decision-making logic to it in real-time semi real-time. The advantages we Of complex event processing architecture is the ability to minimise gaps in data points. For our use case and many use cases of both large enterprises and smaller organisations, regular publications of information is way more powerful to audiences. Rather than trying to pack everything into streams using workflows and complex programming, the data warehouse/data lake approach where we have wide and varying dimensional and fact based data available to multiple consumers offers far more value.
We took a look at our requirements for publishing data to our website platform for cryptocurrency data and analytics, the answer was relatively straight forward and cost effective.
High on the list of things to avoid is writing proprietary code that directly interfaces with third party cloud storage providers unless we need to.
Data risk assessment and Data Governance
There is a very real risk as legislation tightens and automation, coupled with AI, will lead to many more false positives when it comes to the posting of information on centralised information data stores. Honest actors can find themselves having significant challenges when working with cloud storage providers. We think there are so many excellent blockchain based providers but keeping this a lot simpler why not have synchronisation between your content and your website.
How we have solved our needs for data synchronisation and how you may can benefit by working with our technologies
We cannot eliminate manual process altogether and neither should we. Developing front ends for content management systems is an expensive process ever point and often over the top. Most times we are just taking information from one format and putting it into a system of another.
Our systems have been improving. We look at structured data, and where possible, bring it into reporting solutions whereby users can access that information through dashboards and other front end solutions. We think more in terms of whether the data can be automatically bought into a website.
Ad-hoc and Schedule
Periodic and Scheduled onboarding of data. We added the ability to have a time schedule to discover data, and to be notified of changes.
Once data is known of, it should be made available in the right format to the right audience. Our Web Data Platform has a report manager and other data aware solutions within it that knows how and where to present this information.
Requirements gathering and solution architecture process
We tend to focus on what is known as promise theory in a way we see "Jobs To Be Done" as complementary to promise theory. Rather than focusing on building more technology into web data platform we thought about what is the primary need.
- Situation - we have data on our systems
- Motivation - my audience can benefit from our information
- Expected outcome - more users will visit our website and consume our services
Defining possible requirements
Once we know these three basic elements we realise the technology is secondary to the requirement. Rather than using all the modern available technologies the requirement is quite straight forward although still not necessarily the simplest to achieve.
- Data is collected from different systems continuously and periodically
- Data is additionally processed and stored in relation to data collection
- Our audiences will want to see and consume this information in a variety of formats
- We don't want to put too much extra technology into web data platform
- Rather than building more into each individual process we may want to add new independent responsibilities that are independent.
Thinking of responsibilities
- Data collection
- Data Processing and collection
- Data publication
- Data onboarding
- Data presentation
We now start to recognise that we may have categories of information that can be grouped into domains. This can help us to simplify thinking about common types of responsibilities. We may be in a position to create templates of responsibilities to automate and parameterise the creation of many of their artefacts and responsibilities.
The importance of responsibilities
Responsibilities are discrete actions that are almost entirely independent to other responsibilities. They are stateless in that they accept or detect an input, perform a process, and produce an output. In the example of data presentation, it has no need to understand about how data onboarding occurred, it definitely does not need to understand how data publishing occurred. We think of this as allowing for interchangeability of responsibilities and effectively Jobs To Be Done. A great analogy - we don't need to know how the grass was cut, we just appreciate that the lawn is tidy.
We now know but there are processes we need to do our data synchronisation needs.
- Scheduled processing
- Completion notification
- Data Consumption
- Data Processing
- Data Delivery
- Data Presentation
Process Tasks Breakdown
We won't list all of these but would rather give a couple of examples;
- Detection - looking for new information on a folder.
- Data Delivery - synchronising information between point A and point B.
Risks appraisal, Cost Benefit Analysis, Business Continuity
You see how we have not looked at the costs until we have an understanding of say the mains responsibilities processes and tasks that we need to achieve. This is for a very important reason because we don't want to solution lies the implementation before understanding our risk appetite. For a specific implementation, we know there things we absolutely need and things that we can live without. Specific information, we know our web host gives us an abundance of cheap storage space - we don't need to have a cloud based infrastructure at the moment, we can run most of our processes from a desktop, VM in the cloud, and then potentially Docker/Kubernetes cloud hosting down the line. We always thinking in terms of the customer service level we wish to offer our audience and what are competitors do too.
We look at the medium in which the process operates within. For example, we understand that Data Delivery is of information from a file system on a PC to a web server. It should be fairly straight forward to understand that the FTP protocol is probably the best way to achieve this. If we can find software to synchronise this information this may meet our needs. This would lower our development time needed to write bespoke code to talk to an API.
Focusing on low code to no code, convention over configuration, configuration over coding, coding only in the problem domain
Each application we build focuses on using intelligent configuration to process information dynamically. The advantage of this is the remover coding of requirements to the public domain. A great example is the data mart. It is acceptable to build database code to transform data and deliver reports when we only have small number of interfaces. Once this becomes over burning or complicated then we will find techniques too produce much of the code to speed up development. Companies with their own databases may already have a data warehouse in place and with a few simple steps we could incorporate our software to deliver data to their data warehouse and to publish it. We can change some application code if the time taken is fairly minimal. Always think in terms of removing repetitive tasks and automating.
The technologies behind our solution
We will list our applications with a brief description of each one. The important thing is to use our strengths where necessary - for example, C#, dotnet, Business intelligence, automation, parallel and asynchronous execution.
The most important focus on our software we have built is that they are typically "one and done". We expect each application to perform one to a few tasks, and once they are running, we don't expect to make major changes to them.
We set up definitions of jobs and processes that can be run to perform one or more tasks. We keep this lightweight and technology can detect processes making it easier to maintain this information.
Executor Processor - Batch Process Publication
Job store information is translated into creating batches of processes for execution. These processes are typically applications or batch files that perform a specific responsibility.
Processor application - Execution
This is a lightweight application that runs applications within it. The processor can either run to completion on a schedule or based upon detecting a file which can itself be zero or more times. One benefit is that this software can react to events to execute processes.
WinSCP FTP application (Open Source)
Script automation capabilities exists within WinSCP. Whilst we have written FTP solutions in dotnet, we want to avoid reinventing the wheel. This solution is perfectly capable of synchronising information between a client and FTP server, and can backup information.
DOS Command Prompt capabilities
whilst we have many input output code processes within the technology we always seek to avoid reinventing the wheel where possible. In some circumstances a simple batch file within XCopy or RoboCopy can be preferable 2 writing an extra class or library feature in.net or Java.
Reporting of artefacts
To reduce complexity we have processes within our DevOps software To detect files only in and below a folder and bring them into centralised locations where we can read what process and artefacts we have. For example the location of log files or batch files. We could take this information and automate this too come on for example housekeeping.
DevOps for deployment automation
Our Full Deployer application has a host of features for helping to deploy and publish applications in addition to generating and maintaining configuration. This is used in many of our Domain processes. We have multiple websites that are easily deployable to our integration test server, and continuous integration is occurring.
Web data platform detection
We added jobs to the WDP that can detect new content and bring this into our application data store.
ETL Process - Two applications - Data Processor, Report Publisher
Two lightweight C# applications that are highly sophisticated by their simplicity. Two simple roles;
- Deliver data to the database generically to push data into the data mart - Data Processor
- Run Reports within the datamart to download this data to an area - Data Publisher
In terms of the data delivery pipeline, our processes become fairly predictable;
- Data Retrieval
- Data Transformation (Optional calculations)
- Data Transformation (putting the data in a format suitable for dynamic interpretation)
- Data Processing (Pushing the data to the data mart etl, dimension, and fact tables)
- Data Publication (Producing Reports and delivering them to the website for automatic ingestion)
Web data platfrm data presentation
We have multiple interfaces for website audiences to consume our content and data;
- Searchable APIs
We already have our own website we don't want to change our website
Many customers will already have their own website. for these customers we can set up a standalone solution or take elements of our data architecture into the cloud to help enterprises get their data to a wider audience.
We hope you enjoyed this article, feel free to reach out anything in this article is of interest or potentially difficult to conceptualise. We hope you see how focusing on the job to be done is a much better way to breakdown requirements for your organisation or customer.
Written with StackEdit.