Data Engineer Interview 101

I sourced different sites when I was attending interviews to start off as a Data Engineer. Aside from blogging about the things I learned, there are some sites that came in really useful and would love to share them with you as you prepare your Data journey.

The idea with data is to build a portfolio just like you would as a photographer or model. This is by no means a standard on interviews but just as we have cracking the code interview book and all but I believe it helps to have a section that relates to data engineering and best practices. This gives some ideas as you prepare for data engineering interviews.

  • SQL: I know the good old SQL. I think this is a vital language to have in one’s toolbox. How can you use this with a large amount of data? One should be able to answer questions such as
    • what is a transaction?
    • Of the 4 V’s which is the most important and why?

Here you could also be asked to talk about data transfer and popular tools like Presto and Apache Sqoop come to mine. Yes! I know you are thinking about it too the famous join questions. However, with big data comes more advanced join questions such as hash join, streaming join, or broadcast join.

  • Understand Database: Here we are taking into consideration database design and also looking at data warehousing. Every application rest on some form of a database management system. So what is the best model in terms of data cataloging and governance? One common question during interviews is the difference between a cluster and a non-clustered index.
  • Data Processing Architecture: With an increase in the use of social media applications that rest on cloud services and all of our IoT appliances. It is important to decide on a proper architecture when building your system. The two most important data processing architectures are Lambda and Kappa. Which of these do you prefer and why? Or when should it be used?
  • Data Pipeline and Automation: This is a common requirement in job descriptions. How do you plan to handle the ETL process? Having a proper data pipeline reduces the work by almost 70% There are different applications such as Airflow, Kafka, or Luigi. Automation is vital to the ETL process. Data is ever-changing and it will be daunting to manually run scripts daily to pull data.
  • Others: There are other areas one should be familiar with such as Version Control(Git), Documentation(Confluence), Agile Scrum
  • Cloud: We are in the cloud age, so it helps to know about the popular cloud infrastructure and how it ties with different data tools. When we talk about storage the most popular S3 bucket.
  • Data Governance and Strategy: This is a vital skill to have. How the application is designed and problems are solved sits on a solid data strategy foundation. The talk of privacy and security is not going away anytime soon so it is important that we address this at the PoC stage.

This in no way covers the data engineering journey but it does help to have an idea of what the interview process is all about. Aside from behavioral questions to gauge ones fit for the team there is the analytical approach to see how you approach problems. In our next post, we will be tackling some of these areas in depth with personal and of course 21st-century compliant questions 🙂

Leave a comment