The more I think about it, the more I’m always grateful that things worked out for me post-PhD exactly as I had hoped – a job I love in a city I love where I can apply the skills I spent years building to investigate questions I find interesting. All the romantics of it aside, I know I’m incredibly fortunate to have landed a data scientist position immediately following my PhD. Even putting aside my fresh-postgrad/no industry experience background, it’s a competitive job market, and for a social science PhD like myself, a lot of these data science positions effectively pit me against people with MS and PhDs in computer science, math, or other far quant fields. And you just know the industry just salivates for those quant-tech PhDs.
I knew I had some of the central skills needed for data science – for example, R, Python, and an understanding of common predictive algorithms and models that can be implemented in such languages. I had been taught such topics in class, and as such, they aren’t covered in this article. But as I browsed countless data scientist positions, I quickly became aware that there were various critical skills that no one had ever broached the importance of with me. There were also several more that I had always known were important but not that important, to the extent the job market seemed to emphasize it.
And because of this, while I was on the job market, I found myself scrambling to pick up or brush the mental rust off a few skills as I went, to get just decent enough (again) to get myself further down the interview pipeline. Part of me feels like to some extent it’s normal and fun – heck, I’m good with R, but I picked up familiarity with a few more niche packages I thought were useful in the process of doing various take-home assignments for employers. But still, it was a nerve-wracking process, and I know I bombed a few technical tests early on because I just had never been trained with certain skills.
So, in case it saves some dear stranger on the internet – or poor friend of mine I send this link to thereby burdening them with a passive obligation to at least check it out – some grief on the job market, I wrote this blog post! If you studied information science/data science/perhaps computer science and such in school, you might find all this a bit elementary. But hey, some of us don’t have a formal education in information science/computer science/related fields, so, here’s to the rest of us who had to fill in the gaps like the most raggedy-ass (but also somewhat badass, I’d like to think) quilt of skills ever.
The thing that had the most insane Importance (very high) to People Telling Me It’s Important (almost nobody) ratio was SQL. I had always heard of SQL, had some vague notion of its existence and place in enterprise data management or such, but I feel comfortable saying this word was never mentioned in any of my classes. Not even in my data mining or machine learning class. We always just got handed neat and tidy (or not, if the prof expected us to do EDA and cleanup) CSV files with exactly the data we needed in exactly the formats we needed them in.
The realization that struck me head-on in my first days at work was that a huge part of being a data scientist wasn’t just running analyses or generating models – it was pulling down exactly the data you need to run those analyses and make those models. And it’s not just a matter of SELECTing FROM – boy, I wish it were just a matter of selecting from (SELECT and FROM are central phrases in SQL queries). It’s about joining together all stuff you need, grouping by various variables and generating aggregates and such appropriately.
That I need to mention SQL here might blow the minds of some of the more industry or tech oriented, but it is actually entirely possible for people to pick up solid R/Python chops without ever having touched SQL! I know legions of people who fall into that camp. Hell, I bet I know people who are absolutely stats and R geniuses who have never touched SQL in their entire life.
Anyway, it should really say something about my experience regarding SQL’s importance and its lack of coverage in at least social science academia that I’m not only mentioning it, but mentioning it first and foremost.
But Really, SQL – And Table-Pushing in General
I know I just harped on about SQL for an entire section, but the importance of SQL generalizes to the importance of overall fluency in “table-pushing”. I use the slang “table-pushing” to describe any and all extraction and transformation of data regardless of language, whether SQL, Python (e.g. Pandas), or R (e.g. dplyr). In layman’s terms: how to take multiple tables of data and transform, manipulate, and aggregate them to produce tables with exactly the data you need for a particular task.
The logic and lingo stay somewhat consistent regardless of which language you use. In fact, there are even settings or functions dedicated to letting you use SQL syntax to manipulate data in Python or R. Like, I will say that during my job hunt, these table-pushing skills were what was most examined in various take-home or live technical tests. I started out terrible at it and got better as I went.
And the reality is, these general table-pushing skills can be clutch even without any big databases from which to query data. I can’t even count the times in years of yore when I was stupid(er) and young(er) I probably wrote some stupid for loop where I could’ve just joined the table had I been more knowledgeable about joins.
I was tempted to put Tufte here like the cliche I am… but then I decided I’d rather put nothing.
So, I technically took a data visualization class, which I enjoyed, and I’d like to think I generally have a decent design sense. Visualization’s inclusion here isn’t to say I never learned it – it’s to say that no one really properly drills into you its importance. Being able to generate plots that effectively communicate a substantive depth of information is practically an artform, and one could theoretically sit there all day thinking up ways to embed as much information as reasonably possible into a single visualization.
Like, we think of generating barplots, line graphs, pie graphs (useless), and whatever and think that’s that, but there’s often more beyond that to consider. How can you mess with the colors of the bars? Do you want grouped bars offset or stacked, why, and not to mention, what are you grouping the bars by? How are you distinguishing the different bars? Patterns or colors or alpha or what? What color palette are you going to use? That’s just one example, getting obnoxious with a barplot. But you could do the same with scatter plots, with point size and color and transparency and stroke.
There is indeed such a thing as putting too much up at once though, and it’s worth practicing visualization and bouncing your work off others to constantly iterate your sensibilities about what looks like “too much”. There is something to be said about the holistic’ nature of design sensibilities, how it’s something you kinda have to get a feel for by looking and trying things.
Sharing Your Work
Get used to being able to simplify ideas. Sometimes you run some complicated analyses, with certain elements that you actually do understand the somewhat complicated technical underpinnings of. You might even be proud of the particular technical tasks you pulled off in doing analyses.
Unfortunately, nobody cares about the details as much as you do. Well, not literally – some people care! To varying degrees! Obviously when you talk to your team, you can be pretty technical, get into the nitty gritty. Sometimes your team members might be able to go even more into the nitty gritty than you can in various ways. But pretty much everyone outside of your data organization cares more about you being able to take all that nitty gritty and communicate it in a way that emphasizes the practical import and business relevance.
Because realistically speaking, it’s these practical findings – the practical meaning, impact, and consequences that everyone can grasp – that make the world go round. No one cares about interactions, collinearity, ntry, mtree, epochs, k, whatever. They care that you can accurately predict X on Y% of occasions. They care that the data suggests with an X unit increase in A, B goes up by Y. So on. So make sure you hone your ability to keep one eye on the practical implications and explain your super technical work in relation to said implications.