Who here didn't open a job application with a role title completely out of context with the role description and requirements? Who else have already opened a data job application asking for 10 years of experience on a specific language or tool when it was actually created 3 years ago?
Ring a bell? Yeah, I bet you do.
This is exactly why we wanted to write something about it, something that will help differentiating all these different jobs and help you better understand who's who around here. Also, we will go more in-depth in a second section of this article to help you building the right team with specific constraints (budget, size, and so on) in order to let you make the right choice when recruiting.
This section aims to clarify the many positions around data in order to:
The three most common jobs in data are hugely complementary and most big companies have all of them, performing different tasks on different projects on some companies, or working in the same project in some others.
The data trinity is composed of these three different positions: the data scientist, the data analyst and the data engineer. Let's dig into the differences between them.
**In each position, some details on the salaries are given, they are entry-level US-based salaries in $-dollars. Meaning that you might want to normalize the salary with your respective country. For example in France, because of the social charges and taxes and stuff, you can give roughly estimates in multiplying them by 2 to have the gross salary in euros.
This is the most widely spread role of all time, but with the highest error rate between the job title "Data Scientist" and the job description - what the recruiting company wants you to do.
Among all applications I've could opened, been through, helped friends with, this almost never matches the right job: statistically and mathematically solve business problems through data, building algorithms or models in order to answer to the problem.
Many words have been used, like models, and solving business problems. This job is the only one on this list that perform those tasks. A data analyst won't use statistical models or mathematical models, when you see "models" in a data analyst job description, it means data models. Data analysts won't solve business problems but will interpret the data in order to provide the best data-driven concealing, let's dive more into that.
Conversely to the data scientist, an analyst is closer to the business decisions and need a stronger business understanding than data scientists that are actually recruited to solve business needs but not necessarily to locate and identify business leverages.
A data analyst is someone who has a strong business understanding and sufficient skills in programming and statistics to be able to dig through shaped data (sometimes raw) data in order to help the company making better decision through a data-driven approach using reporting tools, visualization and other delivery documents.
The data analyst will ensure the data acquisition from the many different sources of the company and its maintenance, and will be responsible for the statistical analysis and data interpretation. See the difference from the data scientist? Basically, the analyst will allow him/herself or the business team to identify a business problem that can either be solved by some other units from the company, or by data scientists that will dig deeper to develop a new feature on a product, fine-tune existing features, make precise analyses, etc.
Also, the data analyst has to deal with all the pre-processing and the data gathering beforehand, job mainly dug deep by the data analyst or scientist and then optimized by the data engineer.
We talked a lot on data analysis for the data scientist and data analyst, but somehow they've got to get access to the data, and to be effective they've got to have the easiest and smoothest access to it.
Here the data engineers comes in the game, bringing alongside very specific skills that neither the both previous jobs don't have: advanced programming knowledge, and all the associated ramifications.
Indeed, neither the scientist nor the analyst are good programmers (they are of course some, but a few, and we're talking on average here). Data warehouses, data lakes, data pipelines, ETL, databases, CICD, deployment, automation and all this stuff I'm sure you've already seen or read, is handled by data engineers.
Also, they're in charge of the data accuracy and maintenance, "but hey haven't you said that the data analyst is in charge of all that?", you're right: the data engineer will maintain the quality of data in terms of extraction and processes, but the data analyst will maintain the quality of data in terms of business: is the needed data in order to make the analysis sufficient today?
Shaped data used by the analysts in order to create the reports? Models automation and deployment and construction of APIs? Yeah, a data engineer can help you a lot with these tasks.
This position is very complementary with the two above, we'll see that in the last section of this article but the more you recruit analysts or scientists, the higher the number of data pipelines will be needed to be handled and data stuff to be automated or scripted.
All this stuff need a tremendous amount of skills in plenty of tools and services, and now that the Cloud services quickly expand and get bigger and bigger every day, data engineers must keep pace but you can discharge them from a lot of issues and let them on what they do best: create, optimize and maintain pipelines. You can do so with a new position in the team: the data architect.
Beside this data trinity, there are still many other data-related jobs out there for many reasons:
Data engineers tend to emphasize the job of data architecture when there is no data architect available, but however it takes them a lot of time, which is not spent into their real job.
Data architects tend to handle all the abstract notions about data, such as the frameworks, the models and database management. For a long time, data architects have done the job of data engineers before data engineer became a unique ans separate carrier field.
Alongside a guidance and support to the data team, data architects will conceptualize and visualize the data frameworks (that data engineers build and maintain) and have a much higher expertise on Cloud platforms in order to advise specific services to automate, deploy, handle databases, and so on.
Automation and deployment are tasks than can be complex enough to keep a data engineer or architect busy fulltime when they have other stuff to do. Hence comes into play the DataOps engineer.
First of all lets define it:
DataOps is the application of DevOps principles to data projects
Alright, some of you who know what a DevOps is might have understood the definition but let's summarize what are the DevOps principles: it is the creation, optimization and maintenance of dedicated infrastructure to support the work of the team.
In other words, DataOps engineers will build the right infrastructure for your data team, with the constraints you give them: budget, performance, accessibility, and so on. They tend to help the team reducing the development time and normalizing the production cycle.
DataOps is mostly a way of applying the right principles instead of a separate carrier field, meaning that someone in your team skilled and disciplined enough to handle this kind of principles of coordination and standardization of the development can do the job, just let that person be on that, full time.
Also named by convention ML engineer, the position is straightforward to understand: it gathers all the good things between a data engineer and a data scientist.
ML engineers can develop single-handedly complex modeling projects from the data collection, going through the modelization and creation the needed automated and deployed deliverables.
However in big companies ML engineers are not paid to do that, but instead to use all this knowledge to oversee and implement a model designed by the data scientist, knowing the architecture and logic of the system behind all the libraries and the company's infrastructure, in order to deliver a top-notch fully automated and deployed model to deliver specific business results.
The data scientist designs the model but in a such conceptional and proof-of-concept viewpoint that the need of a usable architecture and a ready-to-deploy infrastructure adaptation arises. However, the data scientist is still needed because he is the most focused and has the most knowledge in machine learning algorithms and are the most skilled at it than any other data workers, simply because they spend the most time on focusing and thinking about these algorithms. For example, data scientists are needed to certify that ML engineers did not introduce biases within their work.
Also known by convention as DL engineer, this part will be quite quick because it is actually the same than ML engineer but with the addition that the DL engineer masters Deep Learning models and infrastructural needs: machine learning models and deep learning models use significantly different computational resources and might need to be handled differently.
Now let's dive into more exotic positions around data, jobs often done by current data workers but that take so much time the bigger the company, that you might need to recruit someone specialized on the topic, and full time.
Some jobs are often handled by the other members of the company, like a company with only one data scientist will have this one handling all the analytics and engineering duty, and so on until the company grows enough you need to recruit.
Well, these jobs are the step even more advanced, when you need specific people for specific jobs often regarding tasks that the above jobs can fill but that don't have the time to do now the data team is big and split into different units, even different countries, etc.
Many different definition for this job, some say its data-wise skilled library professionals, let's stick to a more general one: the data librarian will help you supporting your company in data management (curation - see next job, data curator, data issues such as copyright or intellectual properties, licensing, etc.) but also in metadata management with the creation of metadata plans and their maintenance, applying industry standards to define them.
Creating data catalogs and a referencing tool to browse it, developing the awareness around it in the company and helping the data workers and business operators to use it and gain tons of time for the data analysis.
Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.
Data curators look a lot like data librarians but are more technical and go more in-depth in the code to help the first holy trinity: they will be skilled at curating data, meaning the step where the data is shaped between the data sources handled by the data engineer and the data analysis handled by data analysts and data scientists.
They need to understand both worlds in order to be useful between the two and build the fittest bridge in order to help them both. They will make sure all the data gathered by the data engineer does not lose any quality during the process of sending it to the analysis team. And that is, my friends, tremendous work.
This is a job relating especially to the data models and data governance, they're usually the ones you want to contact in order to ask for any data-related question. Data stewards crucial mission is to ensure the data quality and the confidence one has using the data. They define all the used data and their respective dependencies to avoid any kind of conflict, maintain the quality of data and the workflows by creating processes and data procedures around monitoring, standards. Also they can have compliance and data security tasks.
As the ROI value of such a recruitment is hardly quantifiable, most companies don't sense the need to get a steward. This job is most of the time taken care of by the data engineer in small to medium companies, however, the data modeling et governance of big companies with different teams located in multiple countries with data stored in different cloud regions might be really painful to handle for someone that is not full-time on the issue. Having a dedicated data steward can truly help your company gain huge benefits with up-to-date data models and governance and avoid expensive months of technical debt.
A data protection officer is needed if every Europe-based company but is often handled by the CTO or CDO, however, if your company is handling touchy kinds of data or your collected data is wide enough to require legal advices, some companies tend to recruit a data protection associate.
The main functions of a data protection associate are to recommend to stakeholders data protection improvements, but also investigate and document any data failures such as data loss, theft or other related topics. Also, the data protection associate will assess the risks for the ongoing data assets of the company and ensure the compliance around. Moreover, the evaluation of the data protection processes for the existing and new projects are key functions for the position, alongside the promotion of data protection awareness within the company and the maintenance of the data protection communication plan. Any data privacy and security issue, basically, the position will ensure you're company is doing great.
This is the part where your data team will always depend on your company philosophy, data driven teams will tend to have a high percentage of data workers overall, standard companies will tend to have a small data unit of a few people. Big companies might have multi data units, heterogeneously composed among different countries, etc.
For small and medium companies, if you're starting and you want to be data-driven, a data analyst or scientist that will do the job of analyst will always be the best choice. While this person will have a full schedule analyzing your data, he will handle the data engineering process along the way. Hiring a data engineer first is non-sense, you already have developers "trying to" collect and store data, nothing that a data analyst or scientist can't handle. Moreover data engineers need other data workers' input and feedback in order to correctly being able to work efficiently. There are some useful cases of hiring a data engineer first, such as for non data-centric companies willing to scale their operational data collection and get big data engineers, but not that much.
The more quality you want to put in the quality of the analyses, the more time you will want to let your data analysts/scientist dig into the data, and hence you might want to hire more of them in order to give them sufficient time to get the great analyses that will really enhance your business choices.
However, more data analysts and scientists will handle more data, will need a proper workspace and you might want to gain time, instead of recruiting a new data analyst or scientist, to hire a data engineer: the data engineer will remove all the data engineering time handled and borne by the data analysts/scientists and make them thus gain a collective among of time that will result in a huge increase of the quality of the analyses because they'll focus only on the analyses and stop wasting time on collecting/gathering/shaping data (simply because they're not as skilled as data engineers to do data engineering stuff).
Never forget that data engineers are scarce and expensive, and will result in a long (hence costly) recruitment, you might want to have one data engineer for three data analysts or scientists. This is a rule of thumb completely depending on your business and your needs but on average that can do the trick.
You want want a data architect when you have more or more data engineers, best is before recruiting your second data engineer. At first the data architect will help the data engineer in his daily tasks because if you were to recruit a second data engineer, it means the first was overloaded, but will define and consolidate a great data architecture to smoothen the daily work of all the data team.
Having some vision on the trinity and subsequent positions:
This is also why "data scientist" is used on every job description from companies that do not really know which position to hire. Because basically a true data scientist can do most of these jobs, and the same thought as a lower extend goes to data engineers. This also explains the subsequent salaries...
As long as you're not a company with several data units, or collecting data from multiple world regions, you might not have the need for a data steward and librarian. Your current data team will be skilled enough to handle the job correctly.
However, when your company starts to grow and to expand worldwide, having offices in multiple continents mean different data regulations, data collection methods and so on, you might start to want legal advices and an operational stewarding and data library with your to help all the data teams within your company.