How to Write Clean R Code - 5 Tips to Leave Your Code Reviewer Commentless
<em><strong>Updated</strong>: September 27, 2022.</em>
<h2>How to Write Clean R Code</h2>
<span data-preserver-spaces="true">Over many years of experience delivering successful projects, I've found one common element in each implementation. A clean, readable, and concise codebase is the key to effective collaboration and provides the highest quality value to the client.</span>
<span data-preserver-spaces="true">Code review is a crucial part of maintaining a high-quality code process. It is also a great way to share best practices and distribute knowledge among team members. At Appsilon, we treat code review as a must for every project. Read more about how we organize our work in </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/remote-data-science-team-best-practices-scrum-github-and-docker/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Olga's blog post</span></a> on best practices<span data-preserver-spaces="true"> recommended for all data science teams.</span>
<span data-preserver-spaces="true">Having a well-established code review process does not change the fact that the developer is responsible for writing good, clean code! Pointing out all of the code's basic mistakes is painful, and time-consuming, and distracts reviewers from going deep into code logic or improving the code's effectiveness.</span>
<span data-preserver-spaces="true">Poorly written code can also harm team morale - code reviewers are frustrated while code creators might feel offended by a huge number of comments. That is why before sending the code to review, developers need to make sure that the code is as clean as possible. </span><span data-preserver-spaces="true">Also, note that there is not always a code reviewer that can come to the rescue. Sometimes you are on your own in a project. Even though you think the code is ok for you now, consider rereading it in a few months - you want it to be clear to avoid wasting your own time later on.</span>
<span data-preserver-spaces="true">In this article, I summarize the most common mistakes to avoid and outline best practices to follow in programming in general. Follow these tips to speed up the code review iteration process and be a rockstar developer in your reviewer's eyes!</span>
<span data-preserver-spaces="true">Navigate to a section:</span>
<ul><li><a href="#comments">#1 Comments - First Tip for Clean R Code</a></li><li><a href="#strings">#2 Strings - Don't Overuse Certain Functions</a></li><li><a href="#loops">#3 Loops - Are They too Heavy?</a></li><li><a href="#sharing">#4 Code Sharing - Make Things Easier for Your Peers</a></li><li><a href="#practices">#5 Good Programming Practices - A Must for Writing Clean R Code</a></li><li><a href="#conclusion">Summing up How to Write Clean R Code</a></li></ul>
<hr />
<h2 id="comments"><span data-preserver-spaces="true">Comments - First Tip for Clean R Code</span></h2>
<span data-preserver-spaces="true">Adding comments to the code is a crucial developer skill. However, a more critical and harder-to-master skill is knowing when <em>not</em> to add comments. Writing good comments is more of an art than a science. It requires a lot of experience, and you can write entire book chapters about it (e.g., </span><a class="editor-rtfLink" href="https://books.google.pl/books/about/Clean_Code.html?id=hjEFCAAAQBAJ" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">here</span></a><span data-preserver-spaces="true">). </span>
<span data-preserver-spaces="true">There are a few simple rules that you should follow, to, well, avoid comments about your comments:</span>
<ul><li><span data-preserver-spaces="true">The comments should add external knowledge to the reader: if they're explaining what is happening in the code itself, it is a red flag that the code is not clean and needs to be refactored. If some hack was used, then comments might be used to explain what is going on. Comment required business logic or exceptions added on purpose. Try to think of what can be surprising to the future reader and preempt their confusion.</span></li><li><span data-preserver-spaces="true">Write only crucial comments! Your comments should not be a dictionary of easily searchable information. In general, comments are distracting and do not explain logic as well as the code does. For example, recently, I recently saw a comment like this in the code: <code>trimws(.) # this function trims leading/trailing white spaces</code> - which is redundant. If the reader does not know what the function <code>trimws</code> is doing, it can be easily checked. A more robust comment here can be helpful, e.g.: <code>trimws(.) # TODO(Marcin Dubel): Trimming white spaces is crucial here due to database entries inconsistency; data needs to be cleaned.</code></span></li><li><span data-preserver-spaces="true">When writing functions in R, I recommend </span><a class="editor-rtfLink" href="https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">using {roxygen2} comments</span></a><span data-preserver-spaces="true"> even if you are not writing a package. It is an excellent tool for organizing knowledge about the function goal, parameters, and output.</span></li><li><span data-preserver-spaces="true">Only write comments (as well as all parts of code) in English. Making it understandable to all readers might save you encoding issues that can appear if you use special characters from your native language.</span></li><li><span data-preserver-spaces="true">In case some code needs to be refactored/modified in the future, mark it with the <code># TODO</code> comment. Also, add some information to identify you as the author of this comment (to contact in case details are needed) and a brief explanation of why the following code is marked as TODO and not modified right away.</span></li><li><span data-preserver-spaces="true">Never leave commented-out code un-commented! It is ok to keep some parts for the future or turn them off for a while, but always mark the reason for this action.</span></li></ul>
<span data-preserver-spaces="true">Remember that the comments will stay in the code. If there is something that you would like to tell your reviewer, but only once, add a comment to Pull (Merge) Request and not to the code itself.</span>
<strong><span data-preserver-spaces="true">Example</span></strong><span data-preserver-spaces="true">: I recently saw removing part of the code with a comment like: "Removed as the logic changed." Ok, good to know, but later that comment in the code looks odd and is redundant, as the reader no longer sees the removed code.</span>
<h2 id="strings"><span data-preserver-spaces="true">Strings - Don't Overuse Certain Functions</span></h2>
<span data-preserver-spaces="true">A common problem related to texts is the readability of string concatenations. What I encounter a lot is an overuse of the <code>paste</code> function. Don't get me wrong; it is a great function when your string is simple, e.g.:</span>
<pre><code class="language-r">paste("My name is", my_name)</code></pre>
But, for more complicated forms, it is hard to read:
<pre><code class="language-r">paste("My name is", my_name, "and I live in", my_city, "developing in", language, "for over", years_of_coding)</code></pre>
<span data-preserver-spaces="true">A better solution is to use <code>sprintf</code> functions or <code>glue</code>, e.g. </span>
<pre><code class="language-r">glue(“My name is {my_name} and I live in {my_city} developing in {language} for over {years_of_coding}”)</code></pre>
<span data-preserver-spaces="true">Isn't it clearer without all those commas and quotation marks?</span>
<span data-preserver-spaces="true">When dealing with many code blocks, it would be great to extract them to separate locations, e.g., to a </span><strong><span data-preserver-spaces="true">.yml file</span></strong><span data-preserver-spaces="true">. It makes both code and text blocks easier to read and maintain.</span>
<span data-preserver-spaces="true">The last tip related to texts: one of the debugging techniques, often used in Shiny applications, is adding <code>print()</code> statements. Double-check whether the prints are not left in the code - this can be quite embarrassing during code review!</span>
<h2 id="loops"><span data-preserver-spaces="true">Loops - Are They too Heavy?</span></h2>
<span data-preserver-spaces="true">Loops are one of the programming building blocks and are a very powerful tool. Nevertheless, they can be computationally heavy and thus need to be used carefully. The rule of thumb that you should follow is: always double-check if looping is a good option. It is hardly a case that you need to loop over rows in <code>data.frame</code>: there should be a <code>{dplyr}</code> function to deal with the problem more efficiently. </span>
<span data-preserver-spaces="true">Another common source of issues is looping over elements using the length of the object, e.g. <code>for(i in 1:length(x)) ...</code>. But what if the length of x is zero? Yes, the loop will go another way for iterator values 1, and 0. That is probably not your plan. Using <code>seq_along</code> or <code>seq_len</code> functions are much safer.</span>
<span data-preserver-spaces="true">Also, remember about the <code>apply</code> family of functions for looping. They are great (not to mention <code>{purrr}</code> solutions)! Note that using <code>sapply</code> might be commented by the reviewer as not stable - because this function chooses the type of the output itself! So sometimes it will be a list, sometimes a vector. Using <code>vapply</code> is safer, as the programmer defines the expected output class.</span>
<h2 id="sharing"><span data-preserver-spaces="true">Code Sharing - Make Things Easier for Your Peers</span></h2>
<span data-preserver-spaces="true">Even if you are working alone, you probably would like your program to run correctly on other machines. And how crucial it is when you are sharing the code with the team! To achieve this, never use absolute paths in your code, e.g. <code>"/home/marcin/my_files/old_projects/september/project_name/file.txt"</code>. It won't be accessible to others. Note that any violation of folder structure will crash the code. </span>
<span data-preserver-spaces="true">As you should already have a project for all coding work, you need to use paths related to the particular project - in this case; it will be <code>"./file.txt"</code>. What is more, I would suggest keeping all the paths as variables in a single place - so that renaming a file requires one change in code, not, e.g., twenty in six different files.</span>
<span data-preserver-spaces="true">Sometimes your software needs to use some credentials or tokens, e.g., to a database or private repositories. You should never commit such secrets to the repository! Even if the entries are the same among the team. Usually, the good practice is to keep such values in <code>.Renviron</code> file as environmental variables that are loaded at the start, and the file itself is ignored in the repo. You can read more about it </span><a class="editor-rtfLink" href="http://www.dartistics.com/renviron.html" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">here</span></a><span data-preserver-spaces="true">.</span>
<h2 id="practices"><span data-preserver-spaces="true">Good Programming Practices - A Must for Writing Clean R Code</span></h2>
<span data-preserver-spaces="true">Finally, let's focus on how you can improve your code. First of all, your code should be easily understandable and clean - even if you are working alone, when you come back to code after a while, it will make your life easier! </span>
<span data-preserver-spaces="true">Use specific variable names, even if they seem to be lengthy - the rule of thumb is that you should be able to guess what is inside just by reading the name, so <code>table_cases_per_country</code></span><em><span data-preserver-spaces="true"> </span></em><span data-preserver-spaces="true">is ok, but <code>tbl1</code> is not. Avoid abbreviations. Lengthy is preferable to vague. Keep consistent style for object names (like camelCase or snake_case) agreed upon among the team members. </span>
<span data-preserver-spaces="true">Do NOT abbreviate logical values <code>T</code> for <code>TRUE</code> and <code>F</code> for <code>FALSE</code> - the code will work, but <code>T</code> and <code>F</code> are regular objects that can be overwritten while <code>TRUE</code> and <code>FALSE</code> are special values. </span>
<span data-preserver-spaces="true">Do not compare logical values using equations, like <code>if(my_logical == TRUE)</code>. If you can compare to <code>TRUE</code>, it means your value is already logical, so <code>if(my_logical)</code> is enough! If you want to double-check that the value is <code>TRUE</code> indeed (and not, e.g., <code>NA</code>), you can use the <code>isTRUE()</code> function.</span>
<span data-preserver-spaces="true">Make sure that your logic statements are correct. Check if you understand the difference in R between </span><a class="editor-rtfLink" href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">single and double logical operators</span></a><span data-preserver-spaces="true">!</span>
<span data-preserver-spaces="true">Good spacing is crucial for readability. Make sure that the rules are the same and agreed upon by the team. It will make it easier to follow each other's code. The simplest solution is to stand on the shoulders of giants and follow the </span><a class="editor-rtfLink" href="https://style.tidyverse.org/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">tidyverse style guide</span></a><span data-preserver-spaces="true">. </span>
<span data-preserver-spaces="true">However, checking the style in every line during the review is quite inefficient, so make sure to introduce <strong>linter</strong> and <strong>styler</strong> in your development workflow, as presented in </span><a class="editor-rtfLink" href="https://wordpress.appsilon.com/remote-data-science-team-best-practices-scrum-github-and-docker/" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">Olga's blog post</span></a><span data-preserver-spaces="true">. This can be lifesaving! Recently we found an error in some legacy code that would have been automatically recognized by linter:</span>
<pre><code class="language-r">sum_of_values <- first_element
+ second_element</code></pre>
<span data-preserver-spaces="true">This does not return the sum of the elements as the author was expecting.</span>
<span data-preserver-spaces="true">Speaking of variable names - this is known to be one of the hardest things in programming. Thus avoid it when it is unnecessary. Note that R functions return by default the last created element, so you can easily replace that:</span>
<pre><code class="language-r">sum_elements <- function(first, second) {
my_redundant_variable_name <- sum(first, second)
return(my_redundant_variable_name)
}</code></pre>
<span data-preserver-spaces="true">With something shorter (and simpler, you don’t need to think about names):</span>
<pre><code class="language-r">sum_elements <- function(first, second) {
sum(first, second)
}</code></pre>
<span data-preserver-spaces="true">On the other hand, please DO use additional variables anytime you repeat some function call or calculation! It will make it computationally more effective and easier to be modified in the future. Remember to keep your code </span><a class="editor-rtfLink" href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself" target="_blank" rel="noopener noreferrer"><span data-preserver-spaces="true">DRY - don't repeat yourself</span></a><span data-preserver-spaces="true">. If you copy-paste some code, think twice whether it shouldn't be saved to a variable, done in a loop, or moved to a function. </span>
<hr />
<h2 id="conclusion"><span data-preserver-spaces="true">Summing up How to Write Clean R Code</span></h2>
<span data-preserver-spaces="true">And there you have it - five strategies to write clean R code and leave your code reviewer commentless. These five alone will ensure you're writing great-quality code that is easy to understand, even years down the road. </span>
What are your top tips on how to write clean R code? Please let us know in the comment section below. Also, don't hesitate to reach out to us on Twitter - <a href="http://tter.com/appsilon">@appsilon</a>. We'd love to hear from you.
<span data-preserver-spaces="true">Happy coding!</span>