When you get started with data science, the extensiveness involved in this field can feel overwhelming. It may appear like you need to become an expert and experienced in relational databases, big-data systems, computer sciences, statistics, linear algebra, data visualization, machine learning, SQL, data engineering, Kubernetes, Docker, and a lot more. One of the largest misunderstandings that I often notice is a perception that one has to become an expert across every area in order to secure your first role.
One of the first things I tell most “aspiring” data scientists, is extracting value from the data is probably the most important aspect to focus on when becoming a data scientist. Let this become your personal mantra and allow it to guide you over the process of developing your skills. If for example, the company that you work for is a big user of tool Z, then make sure you learn tool Z. But don’t assume that you need to become an expert when it comes to tools X, Y, and Z, in order to secure your role as an experienced data scientist. You won’t have to, and I think it is far better to secure a solid foundation in a couple of techniques and tools than only a very shallow understanding when it comes to all of them.
There are some exceptions. I do think that any person that wants to pursue a career as a data scientist must gain an in-depth understanding of SQL, and some of the introductory machine-learning algorithms, and a scripting language that is used commonly in data science such as R or Python. Yet at the same time, as I mentioned above, data science should be getting as much value out of the data as possible. The most important element of all is an ability to fully understand the business problem, followed by applying a data science technique to solve the issue.
Solve Problems For Efficiency
The tools and skills that you need to be focusing on are the types that will give you the ability to solve any business problems efficiently and quickly. For example, with automated machine-learning, if you are not yet experienced in AutoML, I would suggest that you look at TPOT, an open-source AutoML library. Once you extract the features that you need, TPOT will use genetic programming which locates the ideal machine-learning pipeline. It will even generate Python code for the pipeline.
The importance here has to do with TPOT or a similar offering that makes it much easier to generate and build the appropriate machine-learning models. This means that an aspiring data scientist should not spend a lot of time on these tools, since it is likely it will become automated very soon unless they have an interest in algorithm development work. I am starting to expect that a lot of data scientists are slightly scared when it comes to this reality. While TPOT and any other automated solution will not always result in the ultimate model, they often come extremely close, and then the question is raised whether it is worth spending your time going after that 0.02% improvement when it comes to model performance.
ETL, cleaning, and data ingestion are generally a major drain on time for many data scientists. I have loved Apache Drill for the longest time as it enables data scientists to query any self-describing data with the use of SQL. The Python module allows you to query Drill, which provides a way to seamlessly import data into the Pandas Dataframe, it all of a sudden becomes more time-efficient and trivial to query any arbitrary data, allowing you to enter a very vectorized data structure. You can also choose to combine this with an auto-summarizing library like a pandas-profiling, paving the way to turn raw data into an exploratory summary in around 2 to 3 lines of code. Add this to any of the other automated machine-learning tools I mentioned before and you will build models in far less time when compared to trying to do this manually.
Data Will Never Be Clean, So Deal With It
I have watched many new data scientists start working on projects only to find out that there are data corruptions, hard to access, incomplete, or necessitates considerable effort in order to use, which contains a lot more than a standardized one or a Kaggle dataset used in the data science “boot camps”.
Impure states of data were and still are a major challenge when it comes to data science. So the best advice I can give to a newbie data scientist is that they need to learn how to deal with impure or imperfect data. This means that you need to develop your focus, skills, and a bit of effort into techniques and tools that allow you to successfully work with these difficult datasets. I enjoy Apache Drill since it allows you to query and access large volumes of complex data efficiently and quickly without the need to write code. There are other tools you can use, but as you start developing your skills, focus on finding more effective and faster ways to manipulate and access data of any variety.
Data Science Involves More Than Just Machine Learning
When you look at the data science curriculum at either a boot camp or university, you may notice an intense focus or highlight on “machine learning”. Machine learning may be the main component when it comes to data science, yet data science entails far more. It is actually about identifying appropriate techniques to extract value out of the data. In some cases, the solution is simple statistics, while at other times it will involve very complex and complicated machine-learning models. The point of all of this is that the data scientist must be able to prescribe just the right solution for stakeholders.
Here is a personal story I want to share with you. When working for one of my clients I discovered that the most valuable and useful analytics that I built took 2 datasets that I joined together. I can’t discuss any details, while the mechanics were also very complex, yet this very simple analytic involved no machine learning and drove policy.
Don’t Tell Me Your Worth, Rather Prove It
I have spoken with a lot of people after they have completed a data science training course or boot camp and most of them ask about how to go about finding their first job. If your professional experience is lacking, I suggest finding a project you are enthusiastic about that you are allowed to share and make sure you share it. Use you skills that you have just acquired on something that genuinely interests you. I have seen many projects about restaurant data, sports analytics, and more. Whatever you decide to choose, document the journey on a blog and/or Github. It doesn’t really matter what problem it was, but work hard on this and then you can use this when you start going for interviews.
In the role of an employer, this will show me a couple of things. To begin with, you have the ability to solve non-scripted problems, This is very important since real-world problems won’t come with scripts that you can follow. It will also show me that you can conceive projects from end-to-end, which creates value for stakeholders. This is once again very important since this is the actual role of a data scientist. Lastly, you will be able to display your technical abilities and skills in a much more meaningful way.