Analysis of raw data is only half of the job. You have to know how to present all findings, visualize key insights and statistics. There are tons of data visualization tools out there and it is up to…
Data scientist are the major users of data lakes, and probably the demographic that can benefit the most from the virtual version of the concept. For a data scientist, a data lake provides the place to store and massage the data they need, with the capability of dealing with large volumes. At the same time, it provides a modern tool for science with a rich ecosystem of tools and libraries for machine learning, predictive analytics and artificial intelligence. The Virtual Data Lake needs to fulfill those same needs and provide some extra value.
The workflow of a data scientist with a Virtual Data Lake
A typical workflow for a data scientist is as follows:
Let’s review how data scientists can use a Virtual Data Lake with Denodo to their advantage
Identify useful data
Use the Denodo Data Catalog to identify potential sources for the current study.
Data can be previewed directly in the catalog to quickly validate if the data is useful
Users can add their own comments to help others in the future
Store that data into the lake
Since real time access via the virtual layer is already possible, this step may not be necessary. SQL queries to Denodo will be translated to the corresponding technology of the data source (For example, a call to a table from a web application will translate SQL to an HTTP call) and Denodo will respond with a result set.
If for some scenarios direct access is not appropriate, the scientist can easily persist that same data in the lake using Denodo.
Cleanse data into into a useful format
Combine and transform data to create the dataset that will be used as the input for the data analysis.
Like in the section above, the results of the models can be persisted if necessary.
Analyze data
Execute data science algorithms (ML, AI, etc.)
Use the tool of choice for Machine Learning, Deep Learning, Predictive Analytics, etc. The tool can access data directly from Denodo, from the cluster if persisted, or via a file export
Denodo can be accessed directly via JDBC and ODBC, which opens the door for most tools and languages (R, Scala, Python, etc.).
If the tables created in steps 3. and 4. were persisted in the lake, the scientist could use the native lake libraries, for example Spark’s libraries for Machine Learning and Analytics, or tools like Mahout.
Iterate process until valuable insights are produced
This process usually involves small changes and repetitions until the right results are generated. A virtual approach to this process adds many benefits:
Final results can be saved, documented, and promoted to a higher environment to be consumed by Business Analysts
Conclusions
As you can see, data virtualization can play a key role in the toolkit of the data scientist. It can accelerate the initial phases of the analysis, where most time is consumed. It integrates with the traditional data science ecosystem, so there is no need to move the data scientist into a different platform. Results are easily publishable via the same platform for non-technical users to review and use, without the need for advanced knowledge of the Big Data ecosystem. As a result,
Silver Sparrow by Tayari Jones is a book that takes place in the 1980’s about two young girls, Dana Lynn Yarbor and Bunny Chaurisse Witherspoon, that are half siblings. They are teenagers close in…
A small blender might sound like a complete waste of space. However, have you considered the benefits to shrinking down an appliance? Depending on your lifestyle and family size, a portable blender…
Are you a movie lover? Have you watched Pink Panther Movies? Are you an avid film watcher? A fan of cinematic brilliance? Certain classic movies simply must be seen by all movie fans, as they each…