How To Refine Your Data Lake Strategy For Analytics

The InfoSec Consulting Series #33

By Jay Pope

 

The meteoric growth in cloud use and the increasing numbers of Internet of Things devices means that businesses are coping with larger volumes of data than ever before. Dealing with this data is a challenge and its volume and variety, together with the cost of storing and managing it can prove to be overwhelming. One of the technologies to which enterprises are increasingly turning to cope with this problem is the data lake. But what are they and what advantages do they offer?

 

What Is A Data Lake?

Data lakes are often confused with the concept of a data warehouse but there are key differences which are important to understand. A data lake is a central resource where you can store data whether it is in a structured or unstructured form. It can operate at any scale and it’s possible to run analytic tools over the data. A data warehouse, on the other hand, deals only with structured data. This is aggregated into various categories in order to make it easier to access for analytics purposes. In most cases, a warehouse is intended to serve a specific purpose, whereas a data lake contains data that is less defined and can be used for anything.

That’s not to suggest that data lakes are without organisation. A complete lack of any kind of control can lead to a data lake becoming a data swamp. Although a data lake is not highly structured, it does use features such as metadata to allow information to be found. A correctly managed data lake also needs to have a clear governance strategy.

 

Data Lake Applications

The nature of the data in lakes makes them ideal for use with artificial intelligence applications. Data warehouses, in contrast, are better suited to traditional database techniques. One key advantage of a data lake is that it’s less defined nature makes it more suited to research tasks, where data scientists may not have an exact idea as to what they are seeking. Because the data is in a raw format, it can be the basis of a self-service solution where people can use data analytics tools to create their own custom reports. This makes a data lake a good source of data for use in dashboard and business reporting applications.

Setting Up A Data Lake

So, what is the roadmap to developing a data lake for your business? There are four main stages involved which we’ll look at in more detail.

Repository

The first step is to create a repository for the data. This involves IT elements such as storage, either locally but more commonly in the cloud, together with relevant network connections. This allows the lake data to be stored in its raw format. You also need to identify what data is going to flow into the lake, where it originates and on what timescale it arrives. Security measures need to be considered at this stage as does GDPR compliance to ensure that the data you are holding is safe and legal.

Environment

Secondly, you need to create the environment in which your analytics or data science tasks can be carried out. This be a ‘sandbox’ wherein experiments can be carried out on the data stored and prototypes can be built for providing the information that the business needs. A range of proprietary and open-source tools may be used at this level.

Integration

The next stage is integrating the data lake with your business information systems. This might involve preparing data to load into a more structured data warehouse environment, where day-to-day business queries can be carried out. More speculative activities and experiments can still be carried out on the raw lake data.

Operational

The fourth and final stage is for the data lake to become a core component of the business’ data operations. It’s likely to be the case at this stage that most, if not all, of the company’s data will be passing through the data lake. It will, therefore, have become a key component of the IT infrastructure of the business and will have replaced many of the more traditional data storage silos.

Once the business has completed all four stages, it should be able to use data-as-a-service within the organisation, in addition to being positioned to introduce machine learning and artificial intelligence applications that will create value from the stored data.

 

It’s important to note that building a data lake is not an end. It should be a component of a wider strategy in dealing with information. It can be used to prepare data for injection into a data warehouse, to run one-off analyses or to act as a pool of ‘big data’ for AI applications. Implementation is often combined with an agile approach to ensure that the entire process is carried out quickly and effectively.

 

Does Your Organisation Need Top Security or Technical Delivery Talent?

Cyber Smart Associates provides certified Cyber Security & Technical Delivery Specialists to organisations looking to undertake their transformation challenges with foresight and with confidence.  With our own practicing Cyber Security Consultants, we are well positioned in the industry to specialise in sourcing top security and technical delivery talent across all related technical roles and skillsets. We source the best people for permanent, contract, and interim roles for organisations that need specialist skills, and delivery experience.

Are you interested in learning more about how we can help you?

Hiring Managers click here

Candidates click here

Contractors click here