Overall, I thoroughly enjoyed working on this project and it’s something I look forward to continuing to work on. If you’d like to check out my project as it currently stands, please note that due to the nature of the project, it contains content that is racist, sexist, ableist, homophobic, transphobic, and offensive in numerous other ways. I have censored the figures for this blog post, but my actual project notebook and README are not censored. You’ve been warned, and now you can find my project here.
It can be difficult to find any sort of consensus on what the concept of “feature engineering” specifically refers to. Some people consider feature engineering to include data scrubbing steps that get your data into a format useable by machine learning (ML) algorithms. This includes steps such as dealing with missing values and outliers, encoding non-numerical data, and transforming and scaling variables. I tend to think of those things as steps in a preprocessing pipeline, separate from feature engineering. To me, feature engineering is mainly focused on using the variables you already have to create additional features that are (hopefully) better at representing the underlying structure of your data. But before we go any further, we need to step back and answer an important question.
Outliers are essentially data points that seem like they may not belong with the rest of the data. They are observations that stick out and may cause you to question whether that observation was generated by the same process or mechanism as the rest of the data. If it wasn’t, then it doesn’t rightfully belong in the population represented by the rest of the data.
Bar plots (also known as bar graphs or bar charts) are some of the most commonly used methods of visually representing data. They are extremely useful for displaying categorical data, or numerical data that is easily and intuitively binned into a handful of groups. Let’s go ahead and look at an example that I’ll continue to work with throughout the rest of this post. The graph below is a bar plot, created with Python’s Seaborn library. It plots house price along the y-axis (vertical) against house grade (a measure of the relative condition and quality of the home) on the horizontal x-axis. Note that you can also create horizontal bar graphs, and all you would need to do would be to switch the axes.
I took a very roundabout way to get started on my journey into learning data science. Now that I’ve officially set off on this path, it feels right. I’ve been trying to sort out what I want and need from a career since I graduated from college in 2013. I went to grad school thinking I might like to be a professor. I have always enjoyed learning, and I knew that to have a fulfilling career I would need to continue learning new things and being exposed to different ways of thinking. Eventually I realized that higher academia was not quite the right path for me, even though there were aspects of it that I really enjoyed.