Get Started with Data Science — Minimum Viable Tool (MVT)

Emmanuel Ogungbemi
Data Science Unlimited
4 min readJan 20, 2022

--

It is an image to illustrate the glowing nature of data science
https://www.information-age.com/hottest-jobs-data-science-right-now-123496406/

Data science is an exciting field that turns our world into numbers and actionable insights. It is a fast-growing field that has shown no signs of slowing down. With recent advancements in technology and access to vast data, it’s now more important than ever to have skilled data scientists on your team. Data scientists need skills in many areas, including data mining and statistical analysis. One of the barriers limiting people wishing to start data science is understanding which tools to learn and where to start.

This article will get you started by presenting you with the minimum viable tool (MVT) to begin your data science journey.

The core activities in Data Science

Even though many tools and activities are involved in data science. They all revolve around these three;

  1. Collect data
  2. Data mining for insight
  3. Report data

1. Collect Data

Nowadays, we have unlimited data sources, from traditional databases to cloud storage and even online sources like Facebook or Google reviews. It is necessary to pull these data from all these sources for analysis. Most organisations store their data in a database like MySQL, AzureSQL, and PostgreSQL. You will need to understand SQL (Structured Query Language), a standardised programming language used to manage relational databases.

SQL for Database Manipulation

SQL is a powerful language used by data scientists to extract information from databases. IBM developed the command language in the 1970s, and it’s now ubiquitous in the world of data science. Data scientists use SQL to query databases, extract records of interest, transform those records into new formats for analysis, and summarise data sets.

SQL stands for “structured query language,” referring to how it organises and presents data into columns and rows. The tool is also helpful for designing tables, creating new tables, modifying existing tables, creating indexes on tables for faster searching, and deleting old or unwanted records. SQL is for structured data. There are other databases for unstructured data, but you can start from here.

My top pick to learn SQL is Programming with Mosh and Tutorialspoint.

2. Data Mining for Insights

Data mining helps to understand what is happening (descriptive) or know what is likely to occur in the future (predictive). Most of the data science activities occur at this stage. Python, R, and MatLab are programming languages for data mining. My number one recommendation for you here is Python.

Programming Language — Python or R

Python is a popular programming language for data science. You can clean and analyse data, create machine learning models, and build complex algorithms with Python. Many industries used Python as a universal programming language from web development to scientific research.

Python is an easy to use, powerful programming language. It’s also the most popular programming language globally, with an estimated 3 million users across all disciplines.

Why is Python so popular? It’s easy to learn but still offers complex functionality for more experienced programmers. Python gives you the ability to write naturally rather than memorising arcane syntax rules.

Machine learning, NLP and Deep learning are Python’s most popular unique use cases.

Although you can find many Python tutorials on the web, my top pick is the PCAP course by the Python Institute, the originator of the programming language.

R

R is the most popular programming language for data science with over 2 million users in more than 50 countries, developed by statisticians at Bell Labs in 1993. It focuses on statistical inference and graphical modelling. It’s one of the most commonly used languages for data science. It is free to use, open-source, and has extensive tools that simplify complex analyses with simple code.

The programming language helps to manipulate and query data sets while also producing graphics to visualise the insights gathered from the data. Users can access R via an integrated development environment (IDE), console, or the rstudio cloud.

If you are in a dilemma about which programming language to learn, I recommend Python.

3. Report Data

No matter the quality of work, it is crucial to create an understandable report for the relevant stakeholders. Most people prefer it in graphs and images — a picture is more than a thousand words. The most popular tools for this are PowerBI and Tableau. You can learn either of these as the knowledge gained is transferable. However, my top pick is Microsoft PowerBI.

Reports and Visualisation — PowerBI or Tableau

Power BI is part of the Microsoft Power Platform developed by Microsoft. It is a tool for transforming raw data into informative and interactive visuals. PowerBI can transform any dataset into a story that provides insights to anyone, regardless of their technical background. What makes it different from other tools? It’s quick and easy to get started, even if you’re not familiar with data science. The free version is enough for learning and can be downloaded here.

Fill out this survey if you have experience using business intelligence (BI) tools, and send it to your IT friends with a similar level of expertise. If successful, you will get $50.

Conclusion

Data science is an evolving field, with new tools and methods developed every day. Data scientists must collect data from databases, explore the data for insights, and create reports. Knowledge of SQL, Python and PowerBI will give you the ability to do this. Data collection can be expensive and time-consuming, which is why data scientists are so valuable. Here are my recommended minimum viable tools to get you started.

Next week I will discuss how data science is helping towards Sustainability.

Follow me on LinkedIn.

See you next week. Thank you.

--

--

Emmanuel Ogungbemi
Data Science Unlimited

PhD in Engineering and Computing (Sustainable & Renewable Energy). Passionate about politics and data science.