README.Rmd

---
title: "Collatz Sequences"
author: Dan Reznik
date: April 2019
output: github_document
---

A Collatz *Sequence* starts with N, a positive integer. The next term in the sequence is obtained from the current one as follows:

* If current is even, next is half the current: `coll[i+1] = coll[i]/2`
* If current is odd, next is 3 times the current plus 1: `coll[i+1] = 3*coll[i]+1`

The Collatz *Conjecture* states that for all starting N's, the sequence will always reach 1, explained [here](https://en.wikipedia.org/wiki/Collatz_conjecture).

Below we investigate basics of Collatz sequences with starting N from 1 to 10k, namely:

* for each N calculate the sequence and its length
* pick a starting N (e.g, 27) which produces an unusually long sequence
* study the distribution of the ratio of seq_length/N
* show the starting N's with largest seq_length/N ratios

```{r,echo=F}
knitr::opts_chunk$set(
  cache=T,
  collapse=T,
  comment="#>",
  dpi=300,
  fig.align="center",
  out.width='100%'
)
```

Load libraries

```{r,message=F}
library(tidyverse)
library(gtools) # for even()
library(furrr)
library(tictoc)
library(fs)
```

Calculate Collatz sequence with `purrr::accumulate()`

```{r}
collatz <- function(x,y) if (x%in%c(1L,2L)) done(1L) else if (even(x)) x/2L else 3L*x+1L
get_coll_seq <- function(x) (1:1e5) %>% # est. max length
  accumulate(collatz,.init=x)%>%as.integer
```

Show the `N=7` sequence

```{r}
get_coll_seq(7)
```

Plot the unusually long `N=27` sequence

```{r}
tibble(x=get_coll_seq(27)) %>%
  mutate(i=row_number()) %>%
  ggplot(aes(i,x))+
  geom_line(color="blue") + geom_point(color="black") +
  scale_y_log10() + 
  labs(title="N=27",
       y="collatz[i] (log scale)",x="Iteration")
```

Compute (in parallel) large # of Collatz sequences, and show a few of them, e.g., for N=20...30

```{r}
fname_coll <- "data/df_coll.rds"
n_max <- 10e3
tic()
if (file_exists(fname_coll)) { # avoid long calc w/ knitr
  df_coll <- read_rds(fname_coll)
} else {
  plan(multiprocess)
  df_coll <- tibble(n=1:n_max,
                    coll=n%>%future_map(get_coll_seq),
                    seq_s=coll%>%map_chr(str_c,collapse=";"),
                    seq_max=coll%>%map_int(max),
                    seq_l=coll%>%map_int(length)) %>%
    select(n,seq_l,seq_max,seq_s)
  # Save it to RDS
  df_coll %>% write_rds(fname_coll,compress = "bz")
}
toc()
```

```{r,echo=F}
df_coll %>%
  slice(20:30) %>%
  knitr::kable()
```

Plot sequence lengths vs starting N

```{r}
df_coll %>%
  filter(n>1) %>%
  ggplot(aes(n,seq_l)) +
  geom_line(alpha=.2) +
  geom_smooth() +
  labs(x='starting N',y='sequence length')
```

Same with logarithmic N, showing seemingly linear relationship

```{r}
df_coll %>%
  filter(n>1) %>%
  ggplot(aes(n,seq_l)) +
  geom_line(alpha=.2) +
  geom_smooth() +
  scale_x_log10() +
  labs(x='log(N)',y='sequence length')
```

Top ratios of sequence lengths by starting N

```{r}
df_coll %>%
  mutate(ratio=seq_l/n) %>%
  arrange(desc(ratio)) %>% 
  head(10) %>%
  mutate(rank=row_number(),ratio=round(ratio,2)) %>%
  select(rank,n,seq_l,ratio,everything()) %>%
  knitr::kable()
```

Mean and Median of length/n

```{r}
df_coll %>%
  filter(n>1) %>%
  mutate(ratio=seq_l/n,
         ratio_log=seq_l/log(n)) %>%
  summarize_at(vars(starts_with("ratio")),
               list(~mean,~median))
```

Plot density of sequence lengths divided by starting N. Conjecture: peak is caused by seemingly linear trend in the seq_l vs log(N) graph.

```{r,message=F}
df_coll %>%
  filter(n>1) %>%
  ggplot(aes(seq_l/n)) +
  geom_freqpoly(aes(y = ..density..)) +
  scale_x_log10() +
  labs(x='seq_length/N',
       y='density')
```

```{r,include=F}
ggsave("pics/steps_required.png",width=10,height=4)
```

Same distribution for seq_l/log(N). Conjecture: peak(s) are caused by seemingly linear trend in the seq_l vs log(N) graph.

```{r,message=F}
df_coll %>%
  filter(n>1) %>%
  ggplot(aes(seq_l/log(n))) +
  geom_freqpoly(aes(y = ..density..)) +
  scale_x_log10() +
  labs(x='seq_length/log(N)',
       y='density')
```

Happy Collatzing!