The Virtual Data Lake for a Data Scientist

Data scientist are the major users of data lakes, and probably the demographic that can benefit the most from the virtual version of the concept. For a data scientist, a data lake provides the place to store and massage the data they need, with the capability of dealing with large volumes. At the same time, it provides a modern tool for science with a rich ecosystem of tools and libraries for machine learning, predictive analytics and artificial intelligence. The Virtual Data Lake needs to fulfill those same needs and provide some extra value.

The workflow of a data scientist with a Virtual Data Lake
A typical workflow for a data scientist is as follows:

Let’s review how data scientists can use a Virtual Data Lake with Denodo to their advantage

Identify useful data

Use the Denodo Data Catalog to identify potential sources for the current study.

Data can be previewed directly in the catalog to quickly validate if the data is useful

Users can add their own comments to help others in the future

Denodo’s Data Catalog web query wizard

Store that data into the lake

Since real time access via the virtual layer is already possible, this step may not be necessary. SQL queries to Denodo will be translated to the corresponding technology of the data source (For example, a call to a table from a web application will translate SQL to an HTTP call) and Denodo will respond with a result set.

If for some scenarios direct access is not appropriate, the scientist can easily persist that same data in the lake using Denodo.

Cleanse data into into a useful format

Combine and transform data to create the dataset that will be used as the input for the data analysis.

Like in the section above, the results of the models can be persisted if necessary.

Analyze data

Zeppelin running queries on Denodo and Spark ML accessing Denodo via Data Frames

Execute data science algorithms (ML, AI, etc.)

Use the tool of choice for Machine Learning, Deep Learning, Predictive Analytics, etc. The tool can access data directly from Denodo, from the cluster if persisted, or via a file export

Denodo can be accessed directly via JDBC and ODBC, which opens the door for most tools and languages (R, Scala, Python, etc.).

If the tables created in steps 3. and 4. were persisted in the lake, the scientist could use the native lake libraries, for example Spark’s libraries for Machine Learning and Analytics, or tools like Mahout.

Iterate process until valuable insights are produced

This process usually involves small changes and repetitions until the right results are generated. A virtual approach to this process adds many benefits:

Final results can be saved, documented, and promoted to a higher environment to be consumed by Business Analysts


As you can see, data virtualization can play a key role in the toolkit of the data scientist. It can accelerate the initial phases of the analysis, where most time is consumed. It integrates with the traditional data science ecosystem, so there is no need to move the data scientist into a different platform. Results are easily publishable via the same platform for non-technical users to review and use, without the need for advanced knowledge of the Big Data ecosystem. As a result,

