Skip to content

Commit

Permalink
Merge pull request #6 from metaforr/pkgsearch-analysis
Browse files Browse the repository at this point in the history
Fix typo
  • Loading branch information
bharxhav authored Nov 30, 2023
2 parents fba16b1 + c2d313a commit ee4ff75
Show file tree
Hide file tree
Showing 6 changed files with 36 additions and 59 deletions.
20 changes: 11 additions & 9 deletions data.qmd
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
## Data
Our first data, available_pkgs, is collected by the available.packages function. The utils package in R provides this function as part of its set of utility functions. The author of this package is the R core team who contribute to the development and maintenance of the R programming language.
Our dataset, cran_packages.csv, is collected by the available.packages function. The utils package in R provides this function as part of its set of utility functions. The author of this package is in the R core team who contribute to the development and maintenance of the R programming language.

The dataset includes details about packages currently available at one or more repositories. The list of packages is obtained by downloading it over the internet or copying it from a local mirror. The data is given as a matrix, which consists of 20102 rows and 17 columns. The columns include information such as “Package,” “Version,”Depends,” and “Repository.”
The dataset includes details about packages currently available at one or more repositories. The list of packages is obtained by downloading it over the internet or copying it from a local mirror. The data is given as a matrix, which consists of 20113 rows and 17 columns. The columns include information such as “Package,” “Version,”Depends,” and “Repository.”

The frequency of updates is 1 hour. Although there are no particular concerns about the data, we may need to note that the default behavior of the function includes reporting only packages whose version and OS requirements match the running version of R, and it provides information only on the latest versions of packages.

In addition to this, we collected the total number of downloads for each package between January 1st, 2013 and November 29th, 2023 using cranlog API. Collected data is merged with avialble_pkgs.
In addition to this, we collected the total number of downloads for each package between January 1st, 2013 and November 29th, 2023 using cranlog API. Collected data is merged with cran_packages.csv. This update to our main csv is will be done periodically via github actions. For now, we are using data collected as of 29th Nov, 2023.


## Description
Our research aim is to provide a comprehensive understanding of the factors influencing package popularity and characteristics within the ecosystem. We will investigate it through the following 4 subquestions:
Our research aim is to provide a comprehensive understanding of the factors influencing package popularity and characteristics within the ecosystem. We will investigate it through the following 3 sub-questions:

1. Dependency Analysis: How do package dependencies (Depends, Imports, LinkingTo) influence the popularity or adoption of R packages?
2. Licensing Trends: What are the prevailing trends in package licensing within the R ecosystem?
Expand All @@ -24,7 +24,9 @@ Lastly, the third research question about maintainability of R packages, we will
## Missing value analysis

```{r}
library(ggplot2)
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(redav))
df = read.csv("./assets/cran_packages.csv")
Expand All @@ -35,17 +37,17 @@ ggplot(missing_df, aes(x = reorder(variables, -missing_percentage), y = missing_
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Percentage of Missing Data by Variable", x = "Variables", y = "Missing Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

As you can see, for the first 7 variables more than 99% of rows are missing them so we dropped these columns.

```{r}
library(dplyr)
library(redav)
#| fig-width: 14
#| fig-height: 10
threshold <- 0.95
dropped <- df %>% select(names(which(colMeans(is.na(.)) <= threshold)))
plot_missing(dropped,percent=TRUE)
suppressMessages(plot_missing(dropped,percent=TRUE))
```

After excluding the previously identified columns, it becomes evident that six columns—Package, Versions, License, MD5, NeedsCompilation, and Repository—contain no missing values. Notably, the LinkingTo column, which signifies that the current package links to additional packages necessary for compilation or linking, has over 75% of its rows missing.
Expand Down
65 changes: 20 additions & 45 deletions docs/data.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,13 +201,13 @@ <h1 class="title"><span class="chapter-number">2</span>&nbsp; <span class="chapt

</header>

<p>Our first data, available_pkgs, is collected by the available.packages function. The utils package in R provides this function as part of its set of utility functions. The author of this package is the R core team who contribute to the development and maintenance of the R programming language.</p>
<p>The dataset includes details about packages currently available at one or more repositories. The list of packages is obtained by downloading it over the internet or copying it from a local mirror. The data is given as a matrix, which consists of 20102 rows and 17 columns. The columns include information such as “Package,” “Version,”Depends,” and “Repository.”</p>
<p>Our dataset, cran_packages.csv, is collected by the available.packages function. The utils package in R provides this function as part of its set of utility functions. The author of this package is in the R core team who contribute to the development and maintenance of the R programming language.</p>
<p>The dataset includes details about packages currently available at one or more repositories. The list of packages is obtained by downloading it over the internet or copying it from a local mirror. The data is given as a matrix, which consists of 20113 rows and 17 columns. The columns include information such as “Package,” “Version,”Depends,” and “Repository.”</p>
<p>The frequency of updates is 1 hour. Although there are no particular concerns about the data, we may need to note that the default behavior of the function includes reporting only packages whose version and OS requirements match the running version of R, and it provides information only on the latest versions of packages.</p>
<p>In addition to this, we collected the total number of downloads for each package between January 1st, 2013 and November 29th, 2023 using cranlog API. Collected data is merged with avialble_pkgs.</p>
<p>In addition to this, we collected the total number of downloads for each package between January 1st, 2013 and November 29th, 2023 using cranlog API. Collected data is merged with cran_packages.csv. This update to our main csv is will be done periodically via github actions. For now, we are using data collected as of 29th Nov, 2023.</p>
<section id="description" class="level2" data-number="2.1">
<h2 data-number="2.1" class="anchored" data-anchor-id="description"><span class="header-section-number">2.1</span> Description</h2>
<p>Our research aim is to provide a comprehensive understanding of the factors influencing package popularity and characteristics within the ecosystem. We will investigate it through the following 4 subquestions:</p>
<p>Our research aim is to provide a comprehensive understanding of the factors influencing package popularity and characteristics within the ecosystem. We will investigate it through the following 3 sub-questions:</p>
<ol type="1">
<li>Dependency Analysis: How do package dependencies (Depends, Imports, LinkingTo) influence the popularity or adoption of R packages?</li>
<li>Licensing Trends: What are the prevailing trends in package licensing within the R ecosystem?</li>
Expand All @@ -222,17 +222,19 @@ <h2 data-number="2.2" class="anchored" data-anchor-id="missing-value-analysis"><
<div class="cell">
<details>
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(ggplot2)</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>df <span class="ot">=</span> <span class="fu">read.csv</span>(<span class="st">"./assets/cran_packages.csv"</span>)</span>
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">suppressMessages</span>(<span class="fu">library</span>(ggplot2))</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">suppressMessages</span>(<span class="fu">library</span>(dplyr))</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">suppressMessages</span>(<span class="fu">library</span>(redav))</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>missing_percentage <span class="ot">&lt;-</span> <span class="fu">colMeans</span>(<span class="fu">is.na</span>(df)) <span class="sc">*</span> <span class="dv">100</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>missing_df <span class="ot">&lt;-</span> <span class="fu">data.frame</span>(<span class="at">variables =</span> <span class="fu">names</span>(missing_percentage), missing_percentage)</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="fu">ggplot</span>(missing_df, <span class="fu">aes</span>(<span class="at">x =</span> <span class="fu">reorder</span>(variables, <span class="sc">-</span>missing_percentage), <span class="at">y =</span> missing_percentage)) <span class="sc">+</span></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a> <span class="fu">geom_bar</span>(<span class="at">stat =</span> <span class="st">"identity"</span>, <span class="at">fill =</span> <span class="st">"skyblue"</span>) <span class="sc">+</span></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a> <span class="fu">labs</span>(<span class="at">title =</span> <span class="st">"Percentage of Missing Data by Variable"</span>, <span class="at">x =</span> <span class="st">"Variables"</span>, <span class="at">y =</span> <span class="st">"Missing Percentage"</span>) <span class="sc">+</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a> <span class="fu">theme</span>(<span class="at">axis.text.x =</span> <span class="fu">element_text</span>(<span class="at">angle =</span> <span class="dv">45</span>, <span class="at">hjust =</span> <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>df <span class="ot">=</span> <span class="fu">read.csv</span>(<span class="st">"./assets/cran_packages.csv"</span>)</span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>missing_percentage <span class="ot">&lt;-</span> <span class="fu">colMeans</span>(<span class="fu">is.na</span>(df)) <span class="sc">*</span> <span class="dv">100</span></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>missing_df <span class="ot">&lt;-</span> <span class="fu">data.frame</span>(<span class="at">variables =</span> <span class="fu">names</span>(missing_percentage), missing_percentage)</span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="fu">ggplot</span>(missing_df, <span class="fu">aes</span>(<span class="at">x =</span> <span class="fu">reorder</span>(variables, <span class="sc">-</span>missing_percentage), <span class="at">y =</span> missing_percentage)) <span class="sc">+</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a> <span class="fu">geom_bar</span>(<span class="at">stat =</span> <span class="st">"identity"</span>, <span class="at">fill =</span> <span class="st">"skyblue"</span>) <span class="sc">+</span></span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a> <span class="fu">labs</span>(<span class="at">title =</span> <span class="st">"Percentage of Missing Data by Variable"</span>, <span class="at">x =</span> <span class="st">"Variables"</span>, <span class="at">y =</span> <span class="st">"Missing Percentage"</span>) <span class="sc">+</span></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a> <span class="fu">theme</span>(<span class="at">axis.text.x =</span> <span class="fu">element_text</span>(<span class="at">angle =</span> <span class="dv">45</span>, <span class="at">hjust =</span> <span class="dv">1</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</details>
<div class="cell-output-display">
<p><img src="data_files/figure-html/unnamed-chunk-1-1.png" class="img-fluid" width="672"></p>
Expand All @@ -242,39 +244,12 @@ <h2 data-number="2.2" class="anchored" data-anchor-id="missing-value-analysis"><
<div class="cell">
<details>
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(dplyr)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>threshold <span class="ot">&lt;-</span> <span class="fl">0.95</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>dropped <span class="ot">&lt;-</span> df <span class="sc">%&gt;%</span> <span class="fu">select</span>(<span class="fu">names</span>(<span class="fu">which</span>(<span class="fu">colMeans</span>(<span class="fu">is.na</span>(.)) <span class="sc">&lt;=</span> threshold)))</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="fu">suppressMessages</span>(<span class="fu">plot_missing</span>(dropped,<span class="at">percent=</span><span class="cn">TRUE</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</details>
<div class="cell-output cell-output-stderr">
<pre><code>
Attaching package: 'dplyr'</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>The following objects are masked from 'package:stats':

filter, lag</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>The following objects are masked from 'package:base':

intersect, setdiff, setequal, union</code></pre>
</div>
<details>
<summary>Code</summary>
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(redav)</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>threshold <span class="ot">&lt;-</span> <span class="fl">0.95</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>dropped <span class="ot">&lt;-</span> df <span class="sc">%&gt;%</span> <span class="fu">select</span>(<span class="fu">names</span>(<span class="fu">which</span>(<span class="fu">colMeans</span>(<span class="fu">is.na</span>(.)) <span class="sc">&lt;=</span> threshold)))</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="fu">plot_missing</span>(dropped,<span class="at">percent=</span><span class="cn">TRUE</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</details>
<div class="cell-output cell-output-stderr">
<pre><code>Scale for y is already present.
Adding another scale for y, which will replace the existing scale.</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>Scale for y is already present.
Adding another scale for y, which will replace the existing scale.</code></pre>
</div>
<div class="cell-output-display">
<p><img src="data_files/figure-html/unnamed-chunk-2-1.png" class="img-fluid" width="672"></p>
<p><img src="data_files/figure-html/unnamed-chunk-2-1.png" class="img-fluid" width="1344"></p>
</div>
</div>
<p>After excluding the previously identified columns, it becomes evident that six columns—Package, Versions, License, MD5, NeedsCompilation, and Repository—contain no missing values. Notably, the LinkingTo column, which signifies that the current package links to additional packages necessary for compilation or linking, has over 75% of its rows missing.</p>
Expand Down
Binary file modified docs/data_files/figure-html/unnamed-chunk-2-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ <h2 data-number="1.2" class="anchored" data-anchor-id="outcome"><span class="hea
</section>
<section id="data-sources" class="level2" data-number="1.3">
<h2 data-number="1.3" class="anchored" data-anchor-id="data-sources"><span class="header-section-number">1.3</span> Data Sources</h2>
<p>R packages used:</p>
<p>R packages used (for Results):</p>
<ul>
<li><code>available</code> [<a href="https://cran.r-project.org/web/packages/available/index.html">CRAN</a>]: This package let us “Check if the Title of a Package is Available, Appropriate and Interesting”.</li>
<li><code>pkgsearch</code> [<a href="https://cran.r-project.org/web/packages/pkgsearch/index.html">CRAN</a>]: This package helped us “Search CRAN metadata about packages by keyword, popularity, recent activity, package name and more. Uses the ‘R-hub’ search server, see <a href="https://r-pkg.org" class="uri">https://r-pkg.org</a> and the CRAN metadata database, that contains information about CRAN packages.”</li>
Expand Down
Loading

0 comments on commit ee4ff75

Please sign in to comment.