TL Consulting Group

machine learning

The Importance of Feature Engineering in ML Modelling

When building Machine Learning (ML) models, we often encounter unorganised and chaotic data. In order to transform this data into explainable features, we rely on the process of feature engineering. Feature engineering plays a crucial role in the Cross Industry Standard Process for Data Mining (CRISP-DM). It is an integral part of the Data Preparation Step, responsible for organising the data effectively before it is ready for modeling. The diagram below illustrates the significance of feature engineering (FE) in the data mining process. CRISP-DM Process Model What is Feature Engineering? Feature Engineering (FE) is the process of extracting and organising important information from raw data in such a way that it fits the machine learning (ML) model. Feature Engineering Process(FE) Source: https://www.omnisci.com/technical-glossary/feature-engineering) Why is Feature Engineering Important? Feature Engineering (FE) has many benefits to offer in the CRISP-DM process. They include: Provides more flexibility and less complexity in models Faster data processing Understanding of models becomes easier Better understanding of the problem and questions to be answered Feature Engineering Techniques for Machine Learning (ML) Below is a list of feature engineering techniques and we will summarise each: Imputation: Handling Outliers Log Transformation One-Hot Encoding Scaling 1. Imputation Missing values is one of the most typical problems when it comes to data preparation. Human errors and dataflow interruptions are some of the major contributors to this problem. Moreover, missing values can detrimentally impact the performance of the ML models. An example of an imputation of NA values with Zero Imputation is frequently employed in healthcare research, such as when dealing with patient records that may have missing values for certain medical measurements. By imputing the missing data using methods like mean imputation or regression imputation, researchers can ensure that a complete dataset is available for analysis, allowing for more accurate assessments and predictions. 2. Handling Outliers Handling Outliers within datasets is an important technique with the purpose of creating an accurate representation of the data. This step must be completed prior to the model training step. There are various methods of handling outliers that include removal, replacing values, capping, and discretization. These methods will be discussed in detail in future blogs. An example of outliers Handling outliers is essential in financial analysis, for instance, when examining stock market data. By detecting and appropriately treating outliers using techniques like Winsorization or trimming, analysts can ensure that extreme values do not unduly influence statistical measures, leading to more robust and reliable insights and decision-making. 3. Log Transformation Log Transformation is one of the most prevalent methods used by data professionals. The technique transforms a skewed distribution of data into normally distributed or slightly skewed data. Therefore, making the data approximate for normal applications is required for different kinds of data analysis. Examples of Log Transformed Data Log transformation is commonly applied in skewed data distributions, such as when dealing with income or population data. By taking the logarithm of the values, the skewed distribution can be transformed into a more symmetric shape, facilitating more accurate modeling, analysis, and interpretation of the data. 4. One Hot Encoding One-Hot Encoding is a technique of preprocessing categorical variables into ML models. The encoding transforms a category variable into a binary feature for each category. It typically assigns a value of ‘1’ to the binary feature it corresponds to and all other binary features are set to ‘0’. An example of One Hot Encoding One-hot encoding is widely used in categorical data processing, such as in natural language processing tasks like sentiment analysis. By converting categorical variables into binary vectors, each representing a unique category, one-hot encoding enables machine learning algorithms to effectively interpret and utilize categorical data, facilitating accurate classification and prediction tasks. 5. Scaling Feature scaling is one of the hardest problems in data science to get right. However, it is not a mandatory step for all machine learning models. It is only applicable to distance-based machine learning models. The training model process requires data with a known set of features that need to be scaled up or down where it is deemed appropriate. The outcome of the scaling operation transforms continuous data to be similar in terms of range. The most popular techniques for scaling are Normalization and Standardisation, which will be discussed in detail in future blogs. Examples of Scaling Scaling is often used in image processing, such as when resizing images for a computer vision task. Scaling the images to a consistent size, regardless of their original dimensions ensures that the images can be properly processed and analysed, allowing for fair comparisons and accurate feature extraction in tasks like object recognition or image classification. Feature Engineering Tools There are a set of feature engineering tools that are popular in the market in terms of the capabilities it provides. We have listed a few of our recommendations: FeatureTools AutoFeat TsFresh OneBM ExploreKit Conclusion In summary, Feature Engineering is a crucial step in the CRISP-DM process before we even think about training our machine learning models. One of the core advantages include the training time of models is reduced significantly. As a result, it allows for a drastic reduction of cost in terms of utilisation of expensive computing resources. In this article, we learned a number of feature engineering techniques and tools that are used in the industry. Here at TL Consulting, our data consultants are experts at using feature engineering techniques to build highly accurate machine learning models, enabling us to deliver high-quality outcomes to support our customer’s data analytics needs. TL Consulting provides advisory and transformation services in the data analytics & engineering domain and has helped many organisations achieve their digital transformation goals. Visit TL Consulting’s data-engineering page to learn more about our service capabilities and send us an enquiry if you’d like to learn more about how our dedicated consultants can help you.

The Importance of Feature Engineering in ML Modelling Read More »

Data & AI, , , ,

Top Cloud Plays in 2023: Unlocking Innovation and Agility

Top Cloud Plays in 2023: Unlocking Innovation and Agility Cloud Computing has been around since the early 2000’s, while the technology landscape continues to evolve rapidly and adoption increased (20% CAGR), offering unprecedented opportunities for innovation and digital transformation. The meaning of digital transformation is also changing with cloud decision makers viewing Digital transformation as more than a “lift and shift”, instead they see vast opportunity within the Cloud ecosystems to help reinforce their long-term success. As businesses increasingly embrace cloud, certain cloud plays have emerged as key drivers of success, underpinned by companies including Microsoft, AWS, Google Cloud and VMWare who have all developed very strong technology ecosystems that have transitioned from a manual and costly Data Centre model. In this blog, we will explore the top cloud plays, from our perspective, that organisations should consider unlocking to reach their full potential in 2023. Multi-Cloud and Hybrid Cloud Strategies Multi-Cloud and Hybrid Cloud Strategies: Multi-cloud and hybrid cloud strategies have gained significant traction in 2023. Organisations are leveraging multiple cloud providers and combining public and private cloud environments to achieve greater flexibility, scalability, and resilience through their investment. Multi-cloud and hybrid cloud approaches allow businesses to choose the best services from different providers while maintaining control over critical data and applications. This strategy helps mitigate vendor lock-in leveraging Kubernetes Container orchestration, including AKS, EKS & GKE and VMWare Tanzu, optimise costs, and tailor cloud deployments to specific business requirements and use cases. Cloud-Native Application Development Cloud-Native Application Development: Cloud-native application development is a transformative cloud play that enables organisations to build and deploy applications, through optimised DevSecOps practices, specifically designed for advanced cloud environments. This model leverages containerization, CICD, microservices architecture, and orchestration platforms again emphasising Kubernetes, a strong Cloud Native foundational play. Cloud-native applications are designed to be highly scalable, resilient, and agile, allowing organisations to rapidly adapt to changing business needs. By embracing cloud-native development, businesses can accelerate time-to-market, improve scalability, and enhance developer productivity embedding strong Developer Experience (DevEx) practices. Serverless Computing Serverless computing: is a game-changer for businesses seeking to build applications without worrying about server management. With serverless computing, developers can focus solely on writing code while the cloud provider handles infrastructure provisioning and scaling. An example of this is Microsoft Azure Serverless Platform or AWS Lambda. This cloud play offers automatic scaling, cost optimisation, and event-driven architectures, allowing businesses to build highly scalable and cost-effective applications. Serverless computing simplifies development efforts, reduces operational overhead, and enables companies to quickly respond to changing application workloads. Cloud Security and Compliance Cloud security and compliance: are critical cloud plays that organisations cannot afford to overlook in 2023 particularly with recent data breaches at Optus and Medicare. Leveraging security as a foundational element of your cloud native journey is crucial for ensuring the protection, integrity, and compliance of your applications and data. Cloud providers offer robust security frameworks, encryption services, identity and access management solutions, and compliance certifications. By leveraging these cloud security products and practices, businesses can enhance their data protection, safeguard customer information, and ensure regulatory compliance. Strong security and compliance measures build trust, mitigate risks, and protect organisations from potential data breaches. Data Analytics and Machine Learning:  Data analytics and machine learning (ML) are powerful cloud plays that drive data-driven decision-making and unlock actionable insights. Cloud providers offer advanced analytics and ML services that enable businesses to leverage their data effectively. By harnessing cloud-based data analytics and ML capabilities, businesses can gain valuable insights, predict trends, automate processes, and enhance customer experiences. These cloud plays empower organisations to extract value from their data, optimize operations, and drive innovation while providing an enhanced customer experience. As the evolution of Cloud Native, Multi-Cloud and Hybrid Cloud Strategies accelerate, strategically adopting the above drivers help enable innovation, agility, and business growth. Importantly Multi-cloud and hybrid cloud strategies provide enhanced security, flexibility, while cloud-native application development empowers rapid application deployment and better developer experience (DevEx), leveraging DevSecOps and Automation practices. These are critical initiatives to consider, if you are looking to advance your technology ecosystem and migrate and/or port workloads for optimum flexibility and Return on Investment (ROI). It is evident the traditional “lift and shift strategy” does not provide this level of value to the consumer. Instead, the above “on-demand cloud plays” may not be realised, with inefficient cloud resource management and unexpected expenses, leading to increased OPEX and TCO. By embracing these top cloud plays, it enables businesses investing in innovation to develop and deploy applications that can scale seamlessly on Cloud, adapting to changing customer demands, reduce TCO/ OPEX, accelerate time-to-market, maintain high availability and security, while future proofing themselves in this competitive digital landscape. For more information about Cloud, Cloud-Native, Data Analytics and more, visit our services page.

Top Cloud Plays in 2023: Unlocking Innovation and Agility Read More »

Cloud-Native, Data & AI, DevSecOps, , , , , , , ,

Top 5 Data Engineering Techniques in 2023

Top 5 Data Engineering Techniques in 2023 Data engineering plays a pivotal role in unlocking the true value of data. From collecting and organising vast amounts of information to building robust data pipelines, it is a complex and vital capability that is becoming more prevalent in today’s complex technology world. There are various intricacies in data engineering, while exploring its challenges, techniques, and the crucial role it plays in enabling data-driven decision making. In this blog post, we explore the top 5 trending data engineering techniques that are expected to make a significant impact in 2023. TL Consulting see Data engineering as an essential discipline that plays a critical role in maximising the value of key data assets. In recent years, several trends and technologies have emerged, shaping the field of data engineering, and offering new opportunities for businesses to harness the power of their data. These techniques enable better and more efficient management of data, unlocking valuable insights and helping enable innovation in a more targeted manner. Since Data engineering is a rapidly evolving domain, there is a continuous need to introduce new data engineering techniques and technologies to handle the increasing volume, variety, and velocity of data. Data Engineering Techniques DataOps One such trend is DataOps, an approach that focuses on streamlining and automating data engineering processes leveraging agile software engineering and DevOps. By implementing DataOps principles, organisations can achieve collaboration, agility, and continuous integration and delivery in their data operations. This approach enables faster data processing and analysis by automating data pipelines, version controlling data artefacts, and ensuring the reproducibility of data processes aligning to DevOps and CICD practices. DataOps improves quality, reduces time-to-insights, and enhances collaboration across data teams while promoting a culture of continuous improvement. DataMesh Another significant trend is Data Mesh, which addresses the challenges of scaling data engineering in large enterprises. DataMesh emphasises domain-oriented ownership of data and treats data as a product. By adopting DataMesh, organisations can establish cross-functional data teams, where each team is responsible for a specific domain and the associated data products. This approach promotes “self-service” data access through a data platform capability, empowering domain experts to manage and govern their data. Furthermore, as the data mesh gains adoption and evolves, with each team that shares their data as products, enabling data-driven innovation. Data Mesh enables scalability, agility, and improved data quality by distributing data engineering responsibilities across the organisation. Data Streaming Real-time data processing has also gained prominence with the advent of data streaming technologies. Data streaming allows organisations to process and analyse data as it arrives, enabling immediate insights and the ability to respond quickly to dynamic business conditions. Platforms like Apache Kafka, Apache Flink, Azure Stream Analytics and Amazon Kinesis provide scalable and fault-tolerant streaming capabilities. Data engineers leverage these technologies to build real-time data pipelines, facilitating real-time analytics, event-driven applications, and monitoring systems to further. This type of capability can lead to optimised real-time stream processing and can gain valuable insights into understanding of customer behaviours and trends. These insights can help you make timely and informed decisions to drive your business growth. Machine Learning The intersection of data engineering and machine learning engineering has become increasingly important. Machine learning engineering focuses on the deployment and operationalisation of machine learning models at scale. Data engineers collaborate with data scientists to develop scalable pipelines that automate the training, evaluation, and deployment of machine learning models. Technologies like TensorFlow Extended (TFX), Kubeflow, and MLflow are utilised to operationalise and manage machine learning workflows effectively. Real-time data streaming offers numerous benefits and empowers you to make informed business decisions. Data Catalogs Lastly, from our experience, Data Catalogs and metadata management solutions have become crucial for managing and discovering data assets. As data volumes grow, organising and governing data effectively becomes challenging. Data cataloguing enables users to search and discover relevant datasets and helps create a single source of knowledge for understanding business data. Metadata management solutions facilitate data lineage tracking, data quality monitoring, and data governance, ensuring data assets are well-managed and trusted. Data cataloguing accelerates analysis by minimising the time and effort that analysts spend finding and preparing data. These trends and technology advancements are reshaping the data engineering landscape, providing organisations with opportunities to optimise their data assets, accelerate insights, and make data-driven decisions with confidence. By embracing these developments, understanding your data assets and associated value, can lead to smarter informed business decisions. By embracing these trending techniques, organisations can transform their data engineering capabilities to enable some of the following benefits: Accelerated data-driven decision-making. Enhanced customer insights, transparency and understanding of customer behaviours. Improved agility and responsiveness to market trends. Increased operational efficiency and cost savings. Mitigated risks through robust data governance and security measures. Data engineering is vital for optimising organisational data assets since these are an important cornerstone of any business. It ensures data quality, integration, and accessibility, enabling effective data analysis and decision-making. By transforming raw data into valuable insights, data engineering empowers organisations to maximize the value of their data assets and gain a competitive edge in the digital landscape. TL Consulting specialises in data engineering techniques and solutions that drive transformative value for businesses enabling the above benefits. We leverage our expertise to design and implement robust data pipelines, optimize data storage and processing, and enable advanced analytics. Partner with us to unlock the full potential of your data and make data-driven decisions with confidence. Visit TL Consulting’s data services page to learn more about our service capabilities and send us an enquiry if you’d like to learn more about how our dedicated consultants can help you.

Top 5 Data Engineering Techniques in 2023 Read More »

Data & AI, , , , ,

Key Considerations When Selecting a Data Visualisation Tool

Data visualisation is the visual representation of datasets that allows individuals, teams and organisations to better understand and interpret complex information both quickly and more accurately. Besides considering the cost of the tool itself, there are other key considerations when selecting a data visualisation tool to implement within your business. These include: Identifying who are the end-users that will be consuming the data visualisation What level of interactivity, flexibility and availability of the data visualisation tool is required from these users?   What type of visualisations are needed to fit the business/problem statement and what type of analytics will drive this?  Who will be responsible for maintaining and updating the dashboards and reports within the visualisation tool? What is the size of the datasets and how complex are the workloads to be ingested into the tool? Is there an existing data pipeline setup or does this need to be engineered? Are there any requirements to perform pre-processing or transformation on the data before it is ingested into the data visualisation tool? The primary objective of data visualisation is to help individuals, teams and companies explore, monitor and explain large amounts of data by organizing and allowing for more efficient analysis and decision-making by enabling users to quickly identify patterns, correlations, and outliers in their data.  Data visualisation is an important process for data analysis and other interested parties as it can provide insights and uncover hidden patterns in data that may not be immediately apparent through either tabular or textual representations. With data visualisation, data analysts and other interested parties such as business SMEs can explore large datasets, identify trends from these datasets, and communicate findings with stakeholders more effectively.      There are many types of data visualisations that can be used depending on the type of data being analysed along with the purpose of the analysis. Common types of visualisations include graphs, bar charts, line scatter plots, heat maps, tree maps, and network diagrams.  For data visualisation to be effective, it requires careful consideration of the data being presented, the intended audience, and the purpose of the analysis. The visualisation that is being presented should be clear, concise, and visually appealing, with labels, titles, and colours used to highlight important points and make the information more accessible to the audience. The data visualisation needs to an effective storytelling mechanism for all end-users to understand easily. Another consideration is the choice of colours used, as the wrong colours can impact the consumers of the data visualisation and can impact visually impaired people (i.e., colour blindness, Darker vs Brighter contrasts as examples)  In recent years, data visualisation has become increasingly important as data within organisations continues to grow in complexity. With the advent of big data and machine learning technologies, data visualisation is playing a critical role in helping organisations make sense of their data, and become more data-driven with increased ‘time to insight’, as organisations facilitate better and faster decision-making.    Data Visualisation Tools & Programming Languages  At TL Consulting, our skilled and experienced data consultants use a broad range and variety of data visualisation tools to help create effective visualisations of our customer’s data. The most common are listed below:   Power BI is a business intelligence tool from Microsoft that allows users to create interactive reports and dashboards using data from a variety of sources. It includes features for data modelling, visualisation, and collaboration.  Excel: Excel is a Microsoft spreadsheet application and from a data visualisation perspective includes the capability to represent numerical data in a visual format.  Tableau: Tableau is a powerful data visualisation tool that allows users to create interactive dashboards, charts, and graphs using drag-and-drop functionality. It supports a wide range of data sources and has a user-friendly interface.  QlikView: QlikView is a first-generation business intelligence tool that allows users to create interactive visualisations and dashboards using data from a variety of sources. QlikView includes features for data modelling, exploration, and collaboration.  Looker:  Looker is a cloud-based Business Intelligence (BI) tool that helps you explore, share, and visualise data that drive better business decisions. Looker is now a part of the Google Cloud Platform. It allows anyone in your business to analyse and find insights into your datasets quickly.  Qlik Sense: Qlik Sense is the next-generation platform for modern, self-service-oriented analytics. Qlik Sense supports from self-service visualisation and exploration to guided analytics apps and dashboards, conversational analytics, custom and embedded analytics, mobile analytics, reporting, and data alerting.      In conjunction with the data visualisation tools listed above, there are a variety of programming languages using their various libraries that TL Consulting use in delivering outcomes to our customers that support not just Data Visualisation but also Data Analytics.  Python is a popular programming language that can be used for data analysis and visualisation. This can be done via tools such as Jupyter, Apache Zeppelin, Google Colab and Anaconda to name a few. Python includes libraries such as Matplotlib, Seaborn, Bokeh and Plotly for creating visualisations.  R is a programming language used for statistical analysis and data visualisation. It includes a variety of packages and libraries for creating charts, graphs, and other visualisations.  Scala is a strong statically typed high-level general-purpose programming language that supports both object-oriented programming and functional programming. Scala has several data visualisation libraries such as breeze-viz, Vegas, Doodle and Plotly Scala.  Go or Golang is a statically typed, compiled high-level programming language designed at Google. Golang has several data visualisation libraries that facilitate the creation of charts such as pie charts, heatmaps, scatterplots and boxplots.  JavaScript is a popular programming language that is a core client-side language of the w3.  It has rich data visualisation libraries such Chart JS, D3, FusionCharts suite, Pixi etc.      Conclusion In conclusion, there are several data visualisation tools and techniques available in the market. For organisations to extract meaningful insights from their data in a time-efficient manner, it’s important to consider these factors before selecting and implementing a new data visualisation tool for your business. TL

Key Considerations When Selecting a Data Visualisation Tool Read More »

Data & AI, , , , , , , ,

Building a Robust Data Governance Framework in 2023

In today’s data-driven world with accelerating advancements in Artificial Intelligence (AI) and advanced analytics, organisations play an important role in ensuring that the data they collect, store, and analyse is underpinned by a strong data governance framework. Embedding the right data governance framework is the enablement of an organisation’s data strategy which requires dedicated planning and strategic direction from various business & technical stakeholders and should be driven from the “top-down” rather than “bottom-up”. To achieve this, organisations should focus on defining their information and data lifecycle management, data relationships and classification, data privacy, data quality, and data integrity to become more competitive and resilient.  The key fundamental challenge for organisations is to embed data standardisation, data security & compliance horizontally across the enterprise, thereby eliminating silos with their own disparate ways of working. In addition, it’s important for organisations to align their data governance framework with their data lifecycle, business strategy and goals, enabling a more agile approach to accommodate the organisation’s current & future needs.   Data Governance Framework Best Practices As organisations collect more and more data points, it’s important to define the right standards, policies, controls, accountability, and ownership (roles & responsibilities). A data governance framework will ensure the organisation abides by these standards while ensuring data that is collected and stored is secure, with a focus on maintaining data integrity and data quality. Ultimately, the data that is consumed by end-users should enable informative, data-driven decisions to be made. A constant re-evaluation is recommended to ensure the organisation’s data governance program is modernised and caters to the latest advancements in data and technology. Prior to defining a data governance framework, a comprehensive data discovery should be performed across the business landscape to create a unified view. This would aid in establishing data governance across the following areas: Data cataloging of data relationships, data quality, and data lineage Data classification and sourcing Metadata definition (Technical and Enterprise metadata) Data compliance, security, and privacy Data analytics & engineering Data storage & sharing The following diagram is a high-level example of a data governance framework. This model should be aligned with the organisation’s data and information management lifecycle. The framework definition should be evaluated from a People, Processes & Technology/Tooling perspective considering data stewardship, efficiencies, data security & access controls, alongside standardised processes governing the technology and tools that facilitate the production, consumption, and processing of the organisation’s data. The following sections highlight a few key areas which the data governance framework should address: Alignment to the Organisation’s Cloud Strategy When uplifting the data governance program, another important consideration for organisation’s that are building technology solutions on Cloud is to define an integrated data governance architecture across their environments, whether it be hybrid or multi-cloud. Alignment to their cloud strategy can help in the following areas: Improve data quality with better management & tooling available around data cleansing and enrichment Build a holistic, unified view of the organisation’s data through discovery and benchmarking Gain higher visibility into data lineage and track data end-to-end from source to target Build more effective data catalogs to ensure it benefits organisational needs to search and access the right data when needed Proactively review, monitor, and measure the data to ensure data consistency and data integrity is preserved For example, Microsoft offers an Azure Governance service as a management and governance cloud solution that features advanced capabilities to help manage data throughout its entire IT lifecycle and track data flows end-to-end, ensuring the right people have access to reliable, accurate data they need, whenever they need it. Data Privacy & Compliance As organisations continue building insights and implementing advanced analytics to learn more about their customers and create more tailored experiences, protecting sensitive data attributes including Personal Information (PI) should be at the heart of the organisation’s data security & data privacy practices, as part of their data governance framework. With the rise of cyber-attacks & data breaches, organisations should consider implementing data obfuscation techniques to “mask” or “encrypt” their PI source data, especially across non-production environments where the access controls are considered weaker than production environments, and the “internal” threat can be considered just as high as the external cyber threats. Applying data obfuscation techniques would ensure the PI data attributes are de-sensitized prior to their use in development, testing and data analytics. In addition, organisations should ensure data controls & access policies are reviewed more frequently than ever. Understanding who has access to the underlying data sources and platforms will help organisations maintain a good risk posture and should be assessed against their data governance framework, across their environments whether on-premise or on Cloud. Augmented Analytics & Machine Learning Without advanced analytics, data loses a lot of its usability and power. Advanced analytics combines the power of machine learning and artificial intelligence to help teams make data-driven decisions based on in-depth insights. Advanced analytics tools greatly streamline the data analysis process and help to provide a competitive edge, uncovering patterns and insights that manual data analysis may overlook. With the introduction of open-source machine learning models such as Open AI’s ChatGPT, how do organisations ensure the data that is collected, analysed, and presented is highly accurate and high quality? Depending on the data models & training algorithms used, these insights can be deeply flawed and it’s important for organisations to embed the right data governance policies around the use of open-source data models, including the collection, use, and analysis of the data points collected. A few roles that data governance plays in the world of augmented analytics, machine learning, and AI include: Providing guidance on what data is collected and how it’s used to train and validate data models for machine learning models to generate advanced analytics Providing standardization on the data science lifecycle and algorithms applied for generating insights, along with data cleansing & enrichment exercises Defining the best practices and policies when introducing new data models, along with measures to fine-tune and train models to increase data accuracy

Building a Robust Data Governance Framework in 2023 Read More »

Cloud-Native, Data & AI, , , , , ,