03-supp-loops-in-depth.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>Software Carpentry: Programming with R</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      <div class="banner">
        <a href="http://software-carpentry.org" title="Software Carpentry">
          <img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
        </a>
      </div>
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">Programming with R</h1></a>
          <h2 class="subtitle">Loops in R</h2>
          <section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Compare loops and vectorized operations</li>
<li>Use the apply family of functions</li>
</ul>
</div>
</section>
<p>In R you have multiple options when repeating calculations: vectorized operations, <code>for</code> loops, and <code>apply</code> functions.</p>
<p>This lesson is an extension of <a href="03-loops-R.html">Analyzing Multiple Data Sets</a>. In that lesson, we introduced how to run a custom function, <code>analyze</code>, over multiple data files:</p>
<pre class="sourceCode r"><code class="sourceCode r">analyze &lt;-<span class="st"> </span>function(filename) {
  <span class="co"># Plots the average, min, and max inflammation over time.</span>
  <span class="co"># Input is character string of a csv file.</span>
  dat &lt;-<span class="st"> </span><span class="kw">read.csv</span>(<span class="dt">file =</span> filename, <span class="dt">header =</span> <span class="ot">FALSE</span>)
  avg_day_inflammation &lt;-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, mean)
  <span class="kw">plot</span>(avg_day_inflammation)
  max_day_inflammation &lt;-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, max)
  <span class="kw">plot</span>(max_day_inflammation)
  min_day_inflammation &lt;-<span class="st"> </span><span class="kw">apply</span>(dat, <span class="dv">2</span>, min)
  <span class="kw">plot</span>(min_day_inflammation)
}</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">filenames &lt;-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="st">&quot;data&quot;</span>, <span class="dt">pattern =</span> <span class="st">&quot;inflammation&quot;</span>, <span class="dt">full.names =</span> <span class="ot">TRUE</span>)</code></pre>
<h3 id="vectorized-operations">Vectorized Operations</h3>
<p>A key difference between R and many other languages is a topic known as vectorization. When you wrote the <code>total</code> function, we mentioned that R already has <code>sum</code> to do this; <code>sum</code> is <em>much</em> faster than the interpreted <code>for</code> loop because <code>sum</code> is coded in C to work with a vector of numbers. Many of R’s functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.</p>
<p>For example, to add pairs of numbers contained in two vectors</p>
<pre class="sourceCode r"><code class="sourceCode r">a &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">10</span>
b &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">10</span></code></pre>
<p>You could loop over the pairs adding each in turn, but that would be very inefficient in R.</p>
<p>Instead of using <code>i in a</code> to make our loop variable, we use the function <code>seq_along</code> to generate indices for each element <code>a</code> contains.</p>
<pre class="sourceCode r"><code class="sourceCode r">res &lt;-<span class="st"> </span><span class="kw">numeric</span>(<span class="dt">length =</span> <span class="kw">length</span>(a))
for (i in <span class="kw">seq_along</span>(a)) {
  res[i] &lt;-<span class="st"> </span>a[i] +<span class="st"> </span>b[i]
}
res</code></pre>
<pre class="output"><code> [1]  2  4  6  8 10 12 14 16 18 20
</code></pre>
<p>Instead, <code>+</code> is a <em>vectorized</em> function which can operate on entire vectors at once</p>
<pre class="sourceCode r"><code class="sourceCode r">res2 &lt;-<span class="st"> </span>a +<span class="st"> </span>b
<span class="kw">all.equal</span>(res, res2)</code></pre>
<pre class="output"><code>[1] TRUE
</code></pre>
<h3 id="vector-recycling">Vector Recycling</h3>
<p>When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:</p>
<pre class="sourceCode r"><code class="sourceCode r">a &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">10</span>
b &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">5</span>
a +<span class="st"> </span>b</code></pre>
<pre class="output"><code> [1]  2  4  6  8 10  7  9 11 13 15
</code></pre>
<p>The elements of <code>a</code> and <code>b</code> are added together starting from the first element of both vectors. When R reaches the end of the shorter vector <code>b</code>, it starts again at the first element of <code>b</code> and contines until it reaches the last element of the longest vector <code>a</code>. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector <code>a</code> by 5:</p>
<pre class="sourceCode r"><code class="sourceCode r">a &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">10</span>
b &lt;-<span class="st"> </span><span class="dv">5</span>
a *<span class="st"> </span>b</code></pre>
<pre class="output"><code> [1]  5 10 15 20 25 30 35 40 45 50
</code></pre>
<p>Remember there are no scalars in R, so <code>b</code> is actually a vector of length 1; in order to add its value to every element of <code>a</code>, it is <em>recycled</em> to match the length of <code>a</code>.</p>
<p>When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:</p>
<pre class="sourceCode r"><code class="sourceCode r">a &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">10</span>
b &lt;-<span class="st"> </span><span class="dv">1</span>:<span class="dv">7</span>
a +<span class="st"> </span>b</code></pre>
<pre class="error"><code>Warning in a + b: longer object length is not a multiple of shorter object
length
</code></pre>
<pre class="output"><code> [1]  2  4  6  8 10 12 14  9 11 13
</code></pre>
<h3 id="for-or-apply"><code>for</code> or <code>apply</code>?</h3>
<p>A <code>for</code> loop is used to apply the same function calls to a collection of objects. R has a family of functions, the <code>apply</code> family, which can be used in much the same way. You’ve already used one of the family, <code>apply</code> in the first <a href="01-starting-with-data.html">lesson</a>. The <code>apply</code> family members include</p>
<ul>
<li><code>apply</code> - apply over the margins of an array (e.g. the rows or columns of a matrix)</li>
<li><code>lapply</code> - apply over an object and return list</li>
<li><code>sapply</code> - apply over an object and return a simplified object (an array) if possible</li>
<li><code>vapply</code> - similar to <code>sapply</code> but you specify the type of object returned by the iterations</li>
</ul>
<p>Each of these has an argument <code>FUN</code> which takes a function to apply to each element of the object. Instead of looping over <code>filenames</code> and calling <code>analyze</code>, as you did earlier, you could <code>sapply</code> over <code>filenames</code> with <code>FUN = analyze</code>:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sapply</span>(filenames, <span class="dt">FUN =</span> analyze)</code></pre>
<p>Deciding whether to use <code>for</code> or one of the <code>apply</code> family is really personal preference. Using an <code>apply</code> family function forces to you encapsulate your operations as a function rather than separate calls with <code>for</code>. <code>for</code> loops are often more natural in some circumstances; for several related operations, a <code>for</code> loop will avoid you having to pass in a lot of extra arguments to your function.</p>
<h3 id="loops-in-r-are-slow">Loops in R Are Slow</h3>
<p>No, they are not! <em>If</em> you follow some golden rules:</p>
<ol style="list-style-type: decimal">
<li>Don’t use a loop when a vectorised alternative exists</li>
<li>Don’t grow objects (via <code>c</code>, <code>cbind</code>, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column</li>
<li>Allocate an object to hold the results and fill it in during the loop</li>
</ol>
<p>As an example, we’ll create a new version of <code>analyze</code> that will return the mean inflammation per day (column) of each file.</p>
<pre class="sourceCode r"><code class="sourceCode r">analyze2 &lt;-<span class="st"> </span>function(filenames) {
  for (f in <span class="kw">seq_along</span>(filenames)) {
    fdata &lt;-<span class="st"> </span><span class="kw">read.csv</span>(filenames[f], <span class="dt">header =</span> <span class="ot">FALSE</span>)
    res &lt;-<span class="st"> </span><span class="kw">apply</span>(fdata, <span class="dv">2</span>, mean)
    if (f ==<span class="st"> </span><span class="dv">1</span>) {
      out &lt;-<span class="st"> </span>res
    } else {
      <span class="co"># The loop is slowed by this call to cbind that grows the object</span>
      out &lt;-<span class="st"> </span><span class="kw">cbind</span>(out, res)
    }
  }
  <span class="kw">return</span>(out)
}

<span class="kw">system.time</span>(avg2 &lt;-<span class="st"> </span><span class="kw">analyze2</span>(filenames))</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="output"><code>   user  system elapsed 
  0.101   0.006   0.113 
</code></pre>
<p>Note how we add a new column to <code>out</code> at each iteration? This is a cardinal sin of writing a <code>for</code> loop in R.</p>
<p>Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the <code>f</code>th column of our results matrix <code>out</code>. This time there is no copying/growing for R to deal with.</p>
<pre class="sourceCode r"><code class="sourceCode r">analyze3 &lt;-<span class="st"> </span>function(filenames) {
  out &lt;-<span class="st"> </span><span class="kw">matrix</span>(<span class="dt">ncol =</span> <span class="kw">length</span>(filenames), <span class="dt">nrow =</span> <span class="dv">40</span>) ## assuming 40 here from files 
  for (f in <span class="kw">seq_along</span>(filenames)) {
    fdata &lt;-<span class="st"> </span><span class="kw">read.csv</span>(filenames[f], <span class="dt">header =</span> <span class="ot">FALSE</span>)
    out[, f] &lt;-<span class="st"> </span><span class="kw">apply</span>(fdata, <span class="dv">2</span>, mean)
  }
  <span class="kw">return</span>(out)
}

<span class="kw">system.time</span>(avg3 &lt;-<span class="st"> </span><span class="kw">analyze3</span>(filenames))</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="error"><code>Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
</code></pre>
<pre class="error"><code>Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
</code></pre>
<pre class="error"><code>Warning in mean.default(newX[, i], ...): argument is not numeric or
logical: returning NA
</code></pre>
<pre class="output"><code>   user  system elapsed 
  0.110   0.003   0.117 
</code></pre>
<p>In this simple example there is little difference in the compute time of <code>analyze2</code> and <code>analyze3</code>. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.</p>
<p>Note that <code>apply</code> handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to <code>apply</code>. At its heart, <code>apply</code> is just a <code>for</code> loop with extra convenience.</p>
        </div>
      </div>
      </article>
      <div class="footer">
        <a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
        <a class="label swc-blue-bg" href="https://github.com/swcarpentry/r-novice-inflammation">Source</a>
        <a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
        <a class="label swc-blue-bg" href="LICENSE.html">License</a>
      </div>
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
    <script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
    <script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    
      ga('create', 'UA-37305346-2', 'auto');
      ga('send', 'pageview');
    
    </script>
  </body>
</html>