How to Write Clean R Code - 5 Tips to Leave Your Code Reviewer Commentless

Reading time:

time

min

March 15, 2021

Updated: September 27, 2022. <h2>How to Write Clean R Code</h2> Over many years of experience delivering successful projects, I've found one common element in each implementation. A clean, readable, and concise codebase is the key to effective collaboration and provides the highest quality value to the client. Code review is a crucial part of maintaining a high-quality code process. It is also a great way to share best practices and distribute knowledge among team members. At Appsilon, we treat code review as a must for every project. Read more about how we organize our work in <a class="editor-rtfLink" href="https://wordpress.appsilon.com/remote-data-science-team-best-practices-scrum-github-and-docker/" target="_blank" rel="noopener noreferrer">Olga's blog post</a> on best practices recommended for all data science teams. Having a well-established code review process does not change the fact that the developer is responsible for writing good, clean code! Pointing out all of the code's basic mistakes is painful, and time-consuming, and distracts reviewers from going deep into code logic or improving the code's effectiveness. Poorly written code can also harm team morale - code reviewers are frustrated while code creators might feel offended by a huge number of comments. That is why before sending the code to review, developers need to make sure that the code is as clean as possible. Also, note that there is not always a code reviewer that can come to the rescue. Sometimes you are on your own in a project. Even though you think the code is ok for you now, consider rereading it in a few months - you want it to be clear to avoid wasting your own time later on. In this article, I summarize the most common mistakes to avoid and outline best practices to follow in programming in general. Follow these tips to speed up the code review iteration process and be a rockstar developer in your reviewer's eyes! Navigate to a section: <ul><li><a href="#comments">#1 Comments - First Tip for Clean R Code</a></li><li><a href="#strings">#2 Strings - Don't Overuse Certain Functions</a></li><li><a href="#loops">#3 Loops - Are They too Heavy?</a></li><li><a href="#sharing">#4 Code Sharing - Make Things Easier for Your Peers</a></li><li><a href="#practices">#5 Good Programming Practices - A Must for Writing Clean R Code</a></li><li><a href="#conclusion">Summing up How to Write Clean R Code</a></li></ul> <hr /> <h2 id="comments">Comments - First Tip for Clean R Code</h2> Adding comments to the code is a crucial developer skill. However, a more critical and harder-to-master skill is knowing when not to add comments. Writing good comments is more of an art than a science. It requires a lot of experience, and you can write entire book chapters about it (e.g., <a class="editor-rtfLink" href="https://books.google.pl/books/about/Clean_Code.html?id=hjEFCAAAQBAJ" target="_blank" rel="noopener noreferrer">here</a>). There are a few simple rules that you should follow, to, well, avoid comments about your comments: <ul><li>The comments should add external knowledge to the reader: if they're explaining what is happening in the code itself, it is a red flag that the code is not clean and needs to be refactored. If some hack was used, then comments might be used to explain what is going on. Comment required business logic or exceptions added on purpose. Try to think of what can be surprising to the future reader and preempt their confusion.</li><li>Write only crucial comments! Your comments should not be a dictionary of easily searchable information. In general, comments are distracting and do not explain logic as well as the code does. For example, recently, I recently saw a comment like this in the code: <code>trimws(.) # this function trims leading/trailing white spaces</code> - which is redundant. If the reader does not know what the function <code>trimws</code> is doing, it can be easily checked. A more robust comment here can be helpful, e.g.: <code>trimws(.) # TODO(Marcin Dubel): Trimming white spaces is crucial here due to database entries inconsistency; data needs to be cleaned.</code></li><li>When writing functions in R, I recommend <a class="editor-rtfLink" href="https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html" target="_blank" rel="noopener noreferrer">using {roxygen2} comments</a> even if you are not writing a package. It is an excellent tool for organizing knowledge about the function goal, parameters, and output.</li><li>Only write comments (as well as all parts of code) in English. Making it understandable to all readers might save you encoding issues that can appear if you use special characters from your native language.</li><li>In case some code needs to be refactored/modified in the future, mark it with the <code># TODO</code> comment. Also, add some information to identify you as the author of this comment (to contact in case details are needed) and a brief explanation of why the following code is marked as TODO and not modified right away.</li><li>Never leave commented-out code un-commented! It is ok to keep some parts for the future or turn them off for a while, but always mark the reason for this action.</li></ul> Remember that the comments will stay in the code. If there is something that you would like to tell your reviewer, but only once, add a comment to Pull (Merge) Request and not to the code itself. Example: I recently saw removing part of the code with a comment like: "Removed as the logic changed." Ok, good to know, but later that comment in the code looks odd and is redundant, as the reader no longer sees the removed code. <h2 id="strings">Strings - Don't Overuse Certain Functions</h2> A common problem related to texts is the readability of string concatenations. What I encounter a lot is an overuse of the <code>paste</code> function. Don't get me wrong; it is a great function when your string is simple, e.g.: <pre><code class="language-r">paste("My name is", my_name)</code></pre> But, for more complicated forms, it is hard to read: <pre><code class="language-r">paste("My name is", my_name, "and I live in", my_city, "developing in", language, "for over", years_of_coding)</code></pre> A better solution is to use <code>sprintf</code> functions or <code>glue</code>, e.g. <pre><code class="language-r">glue(“My name is {my_name} and I live in {my_city} developing in {language} for over {years_of_coding}”)</code></pre> Isn't it clearer without all those commas and quotation marks? When dealing with many code blocks, it would be great to extract them to separate locations, e.g., to a .yml file. It makes both code and text blocks easier to read and maintain. The last tip related to texts: one of the debugging techniques, often used in Shiny applications, is adding <code>print()</code> statements. Double-check whether the prints are not left in the code - this can be quite embarrassing during code review! <h2 id="loops">Loops - Are They too Heavy?</h2> Loops are one of the programming building blocks and are a very powerful tool. Nevertheless, they can be computationally heavy and thus need to be used carefully. The rule of thumb that you should follow is: always double-check if looping is a good option. It is hardly a case that you need to loop over rows in <code>data.frame</code>: there should be a <code>{dplyr}</code> function to deal with the problem more efficiently. Another common source of issues is looping over elements using the length of the object, e.g. <code>for(i in 1:length(x)) ...</code>. But what if the length of x is zero? Yes, the loop will go another way for iterator values 1, and 0. That is probably not your plan. Using <code>seq_along</code> or <code>seq_len</code> functions are much safer. Also, remember about the <code>apply</code> family of functions for looping. They are great (not to mention <code>{purrr}</code> solutions)! Note that using <code>sapply</code> might be commented by the reviewer as not stable - because this function chooses the type of the output itself! So sometimes it will be a list, sometimes a vector. Using <code>vapply</code> is safer, as the programmer defines the expected output class. <h2 id="sharing">Code Sharing - Make Things Easier for Your Peers</h2> Even if you are working alone, you probably would like your program to run correctly on other machines. And how crucial it is when you are sharing the code with the team! To achieve this, never use absolute paths in your code, e.g. <code>"/home/marcin/my_files/old_projects/september/project_name/file.txt"</code>. It won't be accessible to others. Note that any violation of folder structure will crash the code. As you should already have a project for all coding work, you need to use paths related to the particular project - in this case; it will be <code>"./file.txt"</code>. What is more, I would suggest keeping all the paths as variables in a single place - so that renaming a file requires one change in code, not, e.g., twenty in six different files. Sometimes your software needs to use some credentials or tokens, e.g., to a database or private repositories. You should never commit such secrets to the repository! Even if the entries are the same among the team. Usually, the good practice is to keep such values in <code>.Renviron</code> file as environmental variables that are loaded at the start, and the file itself is ignored in the repo. You can read more about it <a class="editor-rtfLink" href="http://www.dartistics.com/renviron.html" target="_blank" rel="noopener noreferrer">here</a>. <h2 id="practices">Good Programming Practices - A Must for Writing Clean R Code</h2> Finally, let's focus on how you can improve your code. First of all, your code should be easily understandable and clean - even if you are working alone, when you come back to code after a while, it will make your life easier! Use specific variable names, even if they seem to be lengthy - the rule of thumb is that you should be able to guess what is inside just by reading the name, so <code>table_cases_per_country</code> is ok, but <code>tbl1</code> is not. Avoid abbreviations. Lengthy is preferable to vague. Keep consistent style for object names (like camelCase or snake_case) agreed upon among the team members. Do NOT abbreviate logical values <code>T</code> for <code>TRUE</code> and <code>F</code> for <code>FALSE</code> - the code will work, but <code>T</code> and <code>F</code> are regular objects that can be overwritten while <code>TRUE</code> and <code>FALSE</code> are special values. Do not compare logical values using equations, like <code>if(my_logical == TRUE)</code>. If you can compare to <code>TRUE</code>, it means your value is already logical, so <code>if(my_logical)</code> is enough! If you want to double-check that the value is <code>TRUE</code> indeed (and not, e.g., <code>NA</code>), you can use the <code>isTRUE()</code> function. Make sure that your logic statements are correct. Check if you understand the difference in R between <a class="editor-rtfLink" href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html" target="_blank" rel="noopener noreferrer">single and double logical operators</a>! Good spacing is crucial for readability. Make sure that the rules are the same and agreed upon by the team. It will make it easier to follow each other's code. The simplest solution is to stand on the shoulders of giants and follow the <a class="editor-rtfLink" href="https://style.tidyverse.org/" target="_blank" rel="noopener noreferrer">tidyverse style guide</a>. However, checking the style in every line during the review is quite inefficient, so make sure to introduce linter and styler in your development workflow, as presented in <a class="editor-rtfLink" href="https://wordpress.appsilon.com/remote-data-science-team-best-practices-scrum-github-and-docker/" target="_blank" rel="noopener noreferrer">Olga's blog post</a>. This can be lifesaving! Recently we found an error in some legacy code that would have been automatically recognized by linter: <pre><code class="language-r">sum_of_values <- first_element + second_element</code></pre> This does not return the sum of the elements as the author was expecting. Speaking of variable names - this is known to be one of the hardest things in programming. Thus avoid it when it is unnecessary. Note that R functions return by default the last created element, so you can easily replace that: <pre><code class="language-r">sum_elements <- function(first, second) { my_redundant_variable_name <- sum(first, second) return(my_redundant_variable_name) }</code></pre> With something shorter (and simpler, you don’t need to think about names): <pre><code class="language-r">sum_elements <- function(first, second) { sum(first, second) }</code></pre> On the other hand, please DO use additional variables anytime you repeat some function call or calculation! It will make it computationally more effective and easier to be modified in the future. Remember to keep your code <a class="editor-rtfLink" href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself" target="_blank" rel="noopener noreferrer">DRY - don't repeat yourself</a>. If you copy-paste some code, think twice whether it shouldn't be saved to a variable, done in a loop, or moved to a function. <hr /> <h2 id="conclusion">Summing up How to Write Clean R Code</h2> And there you have it - five strategies to write clean R code and leave your code reviewer commentless. These five alone will ensure you're writing great-quality code that is easy to understand, even years down the road. What are your top tips on how to write clean R code? Please let us know in the comment section below. Also, don't hesitate to reach out to us on Twitter - <a href="http://tter.com/appsilon">@appsilon</a>. We'd love to hear from you. Happy coding!

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.