On column names

Most people seem to agree that column names should be singular and without white spaces. However, aside from that, there are various stylistic suggestions for how to correctly go about something as simple as naming a column, here I explore the use of underscores vs mixed case characters to separate words. A reservation before starting: If there is an existing naming convention in place already, it is better to follow this convention than to mix in another one.

Shortly after starting to learn Python and pandas, I transitioned from using snake case (i.e. column_name), to Pascal case (i.e. ColumnName, also called upper camel case). The principal motivation was that I considered the mixed case more appealing in the table representations of the data in the console and on axes labels when plotting. The image analyses software I worked with at the time also exported its data as csv files with Pascal cased column names.

As I am part of the teaching staff for an upcoming class in reproducible research practices, I recently exchanged some opinions regarding the default naming conventions we will use in R. My limited R-experience and the lack of an official style guide, drove me to research the conventions used by major R data analyses packages. Aside from appreciating that the tidyverse (a collection of R-packages, including dplyr, ggplot, magrittr, and more) recommended against period delimited variable names in favor of underscores, I noted that there were no official recommendation for the column names of a data frame. However, in all their their examples, they used snake case. Likewise, another popular R-package for data analyses, data.table, followed the same conventions in their tutorials.

In addition to studying R, I have recently started to learn more about SQL. While working through the examples in the PostgreSQL documentation, I noticed that they also follow the snake case convention for column names. Further research showed that the MySQL documentation had made the same choice. Although there is ongoing debate around the appropriate naming conventions for column names in SQL tables, there are good reasons for favoring underscores over mixed case in SQL. Both PostgreSQL and MySQL are case insensitive and will internally convert unquoted upper case characters to lower case. To differentiate between column names that only differ in their case, surrounding quotation marks need to be added, which amounts to unnecessary typing.

In addition to the aforementioned packages, pandas and seaborn also uses underscores for column names in their documentation. The sample datasets in seaborn all follow this convention. The prevalence of snake case in these renowned packages, have made me reconsider my initial choice of using Pascal case. My main reason for switching over initially was just a minor convenience, especially since I relabel axes labels before publication anyways (to be more descriptive and to separate words with spaces, e.g. Column name (units)). Unless I discover specific major disadvantages of using snake case, I am better off developing habits consistent with the default from these packages, as I lay significance in that so many major analytical and database packages adhere to the same standard in their documentation.

Interestingly, I noticed another pattern in the documentation of these packages: they tend to use plural for naming data frames / tables. Based on my readings on StackOverflow and similar sites, this seems to be more of a divisive issue in the SQL community than among R or Python users. I will probably stick to plural both for data frames and database tables, which is what I have been using in pandas so far, and also consistent with my naming conventions of directories in my file hierarchy (although I often abbreviate these without the trailing ’s’).

Further reading

PhD candidate in stem cell bioengineering

Developmental biology researcher and data science educator