Much more of the world’s workforce is working remotely than ever before. This new normal of remote work is likely to remain the status quo even if the global pandemic situation dramatically improves. Data Science teams are no exception. Distributed teams bring unique challenges, and data science team leaders may be looking for new tools. In this article we’ll explain how RStudio Connect helps organizations to properly organize data science teams and overcome the typical inefficiencies of remote work. We’ll also show you some interesting features of RStudio Connect that you might not have heard about previously.
Some common problems for distributed teams include:
At Appsilon we’ve grappled with these challenges for years as we’ve promoted a remote-work friendly culture since the early days of our company. Our data scientists and developers collaborate with each other daily from at least three cities in two different countries, and we frequently work with clients around the globe in faraway time zones. We’ve found that RStudio Connect is a tool that can aid all of the parties involved with Data Science in an organization: producers of artifacts, consumers of artifacts, and IT Administrators. RStudio Connect empowers employees to consume and distribute information within an organization and reduce a lot of unnecessary labor going into these processes. Some features of RStudio Connect that we’ll cover in this article include:
One of the first problems that an organization may encounter in a remote work scenario is onboarding new individuals and teams to the data science ecosystem. RStudio Connect shortens the time it takes to get remote teams up and running with sharing and consuming R/Shiny applications. One of the main reasons for this is that much of the infrastructure work is completed for you automatically – there’s no need to design and maintain your own internal solutions for problems like user authentication. We’ve seen organizations spend vast amounts of developer time endlessly replicating features that are included automatically in RStudio Connect.
Maybe an organization does not have IT Administrator support for its data science team and users. In this case, the data scientists themselves may have to deploy and manage RStudio Connect. Connect’s developers had this use case in mind. RStudio has provided a “Jump Start Examples” tutorial within Connect to help Data Scientists adapt to their new environment and quickly learn best practices. This reduces the hands-on work that team leaders have to do to onboard new users and ensures that everyone gets started with the same common knowledge of the ecosystem and its capabilities.
RStudio Connect can help simplify the role of the system administrator by offering tools to manage visitor load:
Then there is the issue of access management. A recent release (1.8.0) makes it even easier to support data science teams with one enhancement in particular: seamless single sign-on (SSO) integration. RStudio Connect can integrate with the SAML Identity Provider (or IdP) of your company’s choice to perform user authentication and, optionally, user/group membership management. In the SAML world, RStudio Connect fulfills the role of service provider (or SP).
Plus, Every RStudio Connect user account is configured with a role that controls their default capabilities on the system. Data scientists, analysts and others working in R will most likely want “publisher” accounts. Other users are likely to need only “viewer” accounts.
One powerful feature of RStudio Connect is the ability to schedule tasks. These tasks can be everything from simple ETL jobs to daily reports. Version 1.8.0 makes it easier for administrators to track these tasks across all publishers in a single place. This new view makes it possible to identify conflicts or times when the server is being overbooked.
An important reason to use Rstudio Connect is the single source of truth feature. It is built around the “pins” R package and provides a way for R users to easily share resources using RStudio Connect. Your resources may be text files (CSV, JSON, etc.), R objects (.Rds, .Rda, etc.), or any other type of files you want to share. Sharing these files can be useful in many situations, such as when multiple pieces of content require the same data. Rather than copying that data, each piece of content references a “single source” of truth hosted on RStudio Connect.
When content depends on processed datasets or model objects that need to be regularly updated, rather than redeploying the content each time the information changes, use a pinned resource and update only the dataset or model. The update can be automated using a scheduled R Markdown document. Other deployed content will read the newest data on each run.
Connect is also helpful when you need to share resources that aren’t structured for traditional tools like databases. Models saved as R objects aren’t easy to store in a database. Rather than using email or file systems to share these R objects, use RStudio Connect to host these resources as pins. This ensures that everyone has easy access to the R objects in a single place.
A single source of truth means time savings for all participants, wherever they may be located. Read more about how data quality and data validation saves time and resources here.
So now your data science ecosystem is up and running. Next – sending plots, tables, and results inline in emails is a powerful way for data scientists to make an impact. RStudio Connect allows you to create custom emails to send daily reminders, conditional alerts, and to track key metrics. A recent release of the blastula package makes it even easier for data scientists to specify these emails programmatically:
Imagine sending emails about updates to datasets and dashboards manually for a year or more. Now imagine sharing R Shiny applications (and/or Plumber APIs, Pins, R Markdown docs, etc.) as easily as you share memes on Instagram. Which scenario is more appealing?
With the deployment of a new network – a whole new ecosystem really – security should be a primary concern. For instance, you need to be thinking about preventing Brute Force and Dictionary attacks. By default, RStudio Connect allows as many login attempts as it can handle from any source when using the PAM, LDAP, and Password authentication providers. Users will be able to log in directly by entering their username and password. Setting the Authentication.ChallengeResponseEnabled flag to true enables a CAPTCHA form in the login screen, and requires that CAPTCHA be solved in order to authenticate. Both visual and audio CAPTCHA challenges are provided for accessibility needs.
Additionally, we recommend setting up separate instances of RStudio Connect depending on their purpose – one public instance and a second instance accessible only from the internal infrastructure. This means that you can host publicly accessible demos of Shiny dashboards while keeping your internal RStudio Connect infrastructure inaccessible to unauthorized access. This way it’s easy to show off your work to clients or provide public access without compromising on security.
Just as Olga Mierzwa-Sulima points out in her article on Remote Data Science Team Best Practices, distributed and non-distributed Data Science teams alike can benefit from efficient workflows and collaborative tools. We’ve found that RStudio Connect has solved many of our workflow problems with a wide array of available tools and packages. Further, when sharing your data work is as simple as a couple of clicks, you can raise the data literacy of your entire organization by increasing access to meaningful data insights.
We encourage other Data Science teams around the world to consider reaching out to certified RStudio partners for further consultation to make sure that RStudio Connect is the right choice for you. As an RStudio Full Certified Partner, we’re well-positioned to help you make the leap or provide further advice. Reach out to us at [email protected].