With the article on "What is data science?", the following question arises: what are the skills needed to tackle this inevitable transition?
Let's take a look at the 4 main components of future big data experts:
- Soft skills,
- How to know.
Concerning knowledge, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is to big data what DMAIC is to LEAN 6 SIGMA. It therefore already sheds light on the different steps to carry out such a project with a clear process:
- A business understanding phase: a data scientist is, par excellence, results-oriented and will be interdependent with the other business experts in his or her company, both in the definition of objectives and in the management of the project. In short, in order to make the data speak, the data scientist must first and foremost make the experts speak
- A phase of understanding the data: a data scientist will often be in the field to understand the anomalies that have been reported, the missing values, and will proceed with the necessary cleaning.
- A data preparation phase: this is the art of featuring engineering, which consists of grouping variables or pre-processing data to facilitate future analyses.
- A modeling phase: this is the core activity of data analysts who will use machine learning or deep learning depending on the sensitivity of the subject, with a whole range of possible algorithms. They will have the possibility to choose between supervised learning (we know the target to reach), unsupervised learning (we don't have a target), learning by reinforcement (we give a reward when one finds it, a principle used in video games for example).
- An evaluation phase: since the data scientist is looking for efficiency above all, she/he will give priority to a short-loop validation (on a line, on a set of data) before deploying in the field.
- A deployment phase: here, the understanding of how networks and servers work (Spark, Hadoop, etc.) will be key to optimize the result of the modeling phase. In a word, a good data scientist already knows how she/he will implement her/his model from the beginning, just by "entering" her/his dataset. This is the heart of the skill: a multiple profile that will combine the different phases in a process approach (Lean par excellence!) and that always projects itself towards the achievement of the result as soon as possible. In this sense, a data scientist differs from a "pure" statistician by a vision that is "not perfectly academic" but rather oriented towards the most appropriate approach while highlighting the error found in the time frame (thus the notion of transparency is essential, even mandatory! For example, in facial recognition, which uses artificial neural networks, the error rate can be as high as 15% depending on the color of the skin. It must be integrated into any decision...). Some tools (such as GOOGLE AUTO TABLE ML) propose to enter a "dataset" and ask you how much time you have to run servers with different algorithms in parallel to then compare them and finally propose the most "accurate" one.
- These tools allow non-coders to have access to quick results in machine learning without going through code (PYTHON or R for example) which remains the privilege of data scientists. They allow a good democratization of AI, especially in the industrial field. A data scientist will also have a key role in reducing the amount of data transiting through servers with an environmental concern (less power consumption and less cooling required) (in a word, to move from "big" data to "bit" data) and will be able to resist the "temptation" to use in-extenso all the existing algorithms, most of which have fallen into the public domain and are therefore available on the shelf free of charge, sharpening through practice her/his detection of the most adapted to her/his concrete cases.
As far as know-how is concerned, this is the essence of a data scientist: she/he practices above all on datasets again and again. Better still, she/he advances with her/his project in all dimensions: prototyping, "tinkering" with solutions, testing in short loops and above all training her/himself on the best techniques with tutorials on the internet, MOOCs (Fun-MOOC, OpenClassrooms, etc...), libraries already available (the best known of which are TensorFlow, Scikit-Learn, Plotly) by coding with open source languages! Moreover, a data scientist applies learning by project: she/he learns "with" her/his project and has no preconceived idea before confronting her/his dataset, she/he knows that she/he will be able, after some research, to use the developments of others already operating in her/his application domain, including the famous "deep learning" (artificial neural networks) which continues to evolve actively. The algorithmic arsenal is quite impressive to master: naive Bayesian classification, linear and logistic regressions, nearest neighbor method, decision trees (CART (recently integrated in Minitab), Random forest, gradient boosting), support vector machines and kernel methods, etc...
An important point is also clustering, a science that consists in searching for affinities without setting a target, offering a research domain in the future because it does not require training a model (a concrete example is the TV show on French Channel M6: "Married at first sight", based on this type of algorithm).
Concerning the soft skills, a data scientist is part of the Millenium generation, therefore rather "geeky". She/he is not fundamentally looking for hierarchical evolution, nor even recognition based on merit, but rather to "live" an experience at all times. Clearly, what keeps a data scientist going is the very interest of the mission she/he is carrying out, not the promises of tomorrow or even less the prospects of a promotion later on. It is a motivation that comes from the dataset itself! For those who have experienced real similar profiles, they are easy to spot: they are the ones who eat pasta in a parking lot several days in a row without wanting to be disturbed until they have finished their missions...
Concerning the soft skills, once again, it is in the challenge of confronting the data that will arouse the interest of a real data scientist. Obviously, this passion is lived on data challenge platforms like the most widespread one (kaggle.com), or even on the website of the French college (challenge data). Numerous trade shows take place regularly, such as https://aiparis.fr/2020/ or https://www.bigdataparis.com/2020/newsletters/, which bring together most of the companies operating in this sector. The democratization of training courses is achieved through bootcamps, for example in LYON or PARIS: https://jedha.co/lyon/. All this shows that it is in the confrontation with the data that remains the driving force of data science! Data scientists are obviously interested in development challenges at GAFAMs, but they are also very fond of concrete cases drawn from real industrial difficulties such as reducing scraps or increasing the TRP, which are more and more in demand in industry.
Data manipulation cannot be improvised and requires real experience and multiple skills to master it.
Following the example of the Lean 6 Sigma certification, the authors recommend a certification in big data that would allow to validate a dexterity in handling data sets, proven by the success of several projects. More precisely, the authors are particularly interested in developing in France an independent certification in big data, data mining, data analytics, machine learning and deep learning, particularly applied to the industrial world, which combines an IT mastery of server and cloud implementation, a facility to "stitch" codes and algorithms by exploiting ready-made libraries, and a phenomenal capacity to go out and find results in a pragmatic and passionate way by working in a team and by promoting a unique experience of data exploration in each different case. In the end, data science will probably be divided between two worlds: those who will use software (industrialists for example) with neat interfaces (Minitab's Salford SPM, Weka, Auto table ML, etc...) and those who will code them (the "real" data scientists).
If you want to contribute to this project or "crack data", do not hesitate to join us in this adventure!