Automated Data Management with Data Lakes on AWS

Modernizing Global Training App for a Leading Ethics & Legal Compliance Education Firm in the US

The Opportunity

Ethics and compliance (E&C) play a vital role in strengthening business culture. Derived from a company’s core values, ethics determine decisions, choices, actions, and behaviors, and can be thought of as codes of conduct. Alongside ethics sits compliance – conforming to additional external standards, rules, or laws. For example, a financial organization provides baseline employee training that underpins its company codes of conduct, whereas industry-specific regulations may require additional training, e.g., cybersecurity training.

Most enterprises provide annual online E&C programs, using various available platforms. The customer at the center of this case study inspires principled performance in organizations by providing end-to-end ethics and compliance management programs. However, the question remained: how can the value of these programs be measured?

The customer was looking for an innovative solution to consolidate data from disparate sources and then display company insights and industry/competitor benchmarks based on measuring ongoing program effectiveness.

Key Challenges

Providing 500+ courses in 70+ languages, the company inspires 30+ million learners each year – that generates a lot of data. It provided different dashboards to its clients representing critical metrics on the effectiveness of its E&C programs. The dashboards were developed using Angular, Node.js, and third-party open-source libraries, with MongoDB as a data warehouse.

The entire architecture depended on many separate systems, with incoming data being collated from a variety of sources to display different results/reports:

Dashboard / Report Data Sources
Culture Pulse Survey and EOCS (End of Course Survey)
  • MongoDB
  • QuestionPro (API)
  • Salesforce (for benchmarking)
Internal Usage POC
  • Pendo (API)
  • MongoDB
  • PostgreSQL
Certification Data
  • MongoDB
  • PostgreSQL
  • Salesforce (API)
Manager’s Report
  • PostgreSQL

After data ingestion, the transformation and business logic implementation was handled over JavaScript, which created a dataset for visual purposes and a benchmarking dataset, a major component for the end-user. In addition, there were repetitive scripts for different regions, since the benchmarking was to be displayed over all the regions; to manage this a further process script was required. These datasets were used to represent the data in visual form, using an in-house management platform, which was the main overlay of manual transformation and transition.

One of the key challenges was the ability to segregate client data efficiently and effectively, and reduce manual data processing. In addition, due to the number of languages served, the company wanted to effortlessly manage region-level data being generated using a generalized structure. The core challenges included:

1Adapting the system data into a data lake

The concept of a data lake was not implemented, with extracted data being handled in the non-data lake Amazon S3, which was not adaptive to different types of file formats within the system. Also, the files were not encrypted in any form nor in a reliable structure needed for accuracy in the analytical phase.

2Lack of user-level authentication and authorization

User-level authorization, to access and visualize permission-based data, was unavailable via the dashboards, and in theory could be accessed by any other third party. Along with this, each end-user could view the general benchmarking being set in the market for positioning and business growth purposes.

3Use of an analytic layer before visualization

After the data extraction and before moving the data over to visualization, an analytical layer was required to maintain the DQC (data quality check). And also, to make the necessary transformation on the go, without interfering with the other process within the system.

4Low-security environment

The current system didn’t provide a fully secure environment for both data and scripts. A private level of security was adopted for the individual visual dashboard as the benchmarking could reveal the different end-user data. Access of data to an individual account in different regions needed to be handled along with the security of data within the environment.

5Unreliable CI/CD process

The existing system was not reliable in maintaining continuous integration and continuous deployment (CI/CD), as the business requirement got updated or changed without interfering with the existing pipeline. A proper CI/CD process was needed within the secured pipeline in order to handle future business requirements.

The Solution

SourceFuse was selected to partner with the company following a successful RFP response, based on its cloud migration experience and technical expertise.

Taking a discovery-first approach, SourceFuse carried out a full assessment of the customer’s current ecosystem and took the time to identify and understand the desired business objectives. It then developed the business case for providing a very well-structured Data Lake on AWS which could create an infrastructure capable of handling data from different regions. All incoming data is now consolidated into a single data lake providing the appropriate level of advanced security, scalability, and availability.

Additionally, the proposed AWS infrastructure provided a reliable structure to manage the various types of data ingestion and transformation. In this way, the datasets created would be represented visually via a dashboard in a flawless form, which will also incorporate end-user access authentication and authorization.

Sourcefuse delivered a modernized automated data migration and extracted data processing residing in a well-architectured cloud Data Lake. This handles complex data formats while controlling the flow of data in a very segmented form, helping to rapidly process data to its endpoint. The speedy data mart, used for processing the results, increases the data refresh rate from a daily to an hourly basis, which enables real-time data visualization on the dashboard. A secured data pipeline enables unified data sets, provides controlled access to only authorized end-users, and enables cross-region data sharing for comparative data visualization.

Application Features

Language and sentiment analysis

As the customer’s system has more than 70 languages, the application handles the language detention and then the generation of desired sentiment. In addition, it goes on to provide an analytic description of the various client dashboards.

Handling and merging complex datasets into a unified form

Various types of data sets were needed for a number of different dashboard sets. According to business needs, this is now managed by modest unified datamarts with PII data protection. These integrated datamarts create the data readiness that huge dashboard datasets subsequently gain.

Centralized Data Lake with encrypted files and supporting multiple file formats

Here the data lake overcomes the handling of multiple data and file formats. This makes data processing much more rapid plus by incorporating AWS Key Management Service (KMS) data protection is improved.

Automated data pipeline and capture of changes in data

The whole data pipeline is organized in a way that the frequent source data changes are immediately reflected in the final dashboard. And the whole process is automated to minimize the data processing time and minimize manual tasks. Deploying automation also takes care of bugs and exceptions while processing complex data.

Simplified visualization through Amazon Quicksight

Leveraging Amazon QuickSight, the application dashboard provides a very broad way to visualize data, allowing end-users to grasp the detailed representation of their data. Additionally, it enables end users to observe how their current profiled data has been standardized into several levels.

Cross-region feature with dashboard access and viewing control

With dashboard visualization, the application enables end-users to view comparative data to check self-standardization on various factors. In addition, the application ‘master-user’ has overarching cross-region access and control, as per requirement.

The Results

Modernized Infrastructure

The entire pipeline has been moved to the AWS Cloud infrastructure, which resolves the customer challenges and also offers a simple way to check the data and visualization. In addition to providing excellent cross-region access in accordance with business requirements, the scope of maintaining data privacy for various end-users has been considered.

Real-time Data Visualization

Data Lake on AWS has reduced the data processing speed, using unified datamarts for improved data fetching and processing, and increased data refresh frequency to provide real-time data visualization. Full CDC (capture data change) and incremental support for ongoing data flow management, with proper bugs and exception handling.

Dashboards Providing Business Intelligence

Amazon QuickSight implements machine learning to deliver business intelligence and insights, improve efficiencies, and empower data-driven decisions. With well-structured data, a more simple and real-time dashboard was generated, making it easy for end-users to compare and view their data appropriately.

Increased Security

Through the use of Amazon Virtual Private Cloud (VPC), a further layer of security has been added, preventing any group or individual with malicious intent from accessing the entire system in a secure environment.

This was an exceptionally smooth project, in large part to the collaborative approach between the customer’s NetOps team and our DevOps experts - we really worked in sync as we proposed solutions to address their challenges. Each customer project comes with unique complexities, but our phased approach helps overcome any roadblocks early on and achieves rapid customer success.
Kabir Chandhoke
COO, SourceFuse

About The Customer

This company has helped 15 million people in 700 companies worldwide navigate complex legal and regulatory environments and foster ethical cultures. Its combination of practical tools, education, and strategic advisement helps companies translate their values into concrete practices and leadership behaviors that create sustainable, competitive advantages. Its offerings mitigate the risk of costly ethical lapses and compliance failures while building trust and earning the company a reputation for lawful and ethical conduct.

It is a trusted long-term partner to more than 400 client companies, enabling them to create an active and growing community to acquire and disseminate proven strategic and tactical insights and develop solutions based on real-world experiences.

Download Case Study PDF