hardware.tex

 
\subsubsection{Heterogeneous Single-ISA Processor Architectures}
Modern computing systems are constrained by \emph{dark silicon}; i.e.,
not all transistors can be powered at all times
\cite{DaSi2011,Venkatesh2010}.  The emergence of dark silicon has led
hardware architects to rethink processor design.  Recent designs
incorporate differing cores with a common instruction set and
programming model~\cite{Kumar.2005.heterogeneous,SulemanMQP09}.  Some
cores are complex with deep pipelines and large issue width, while
others are much simpler \cite{bigLittle}.  These \emph{heterogeneous
  single-ISA} designs improve energy efficiency by executing
computation on the most appropriate type of core.

While these processors present a wide range of performance and energy
tradeoffs, they create a difficult scheduling problem.  Not only does
a software scheduler have to map tasks to core types, it also must
ensure that energy is minimized (to increase battery life) and that
critical power thresholds are not violated.  For example, the Exynos
5 processor (based on an ARM big.LITTLE architecture \cite{bigLittle})
has a 5.5W peak power that is nearly $2 \times$ the maximum
sustainable heat dissipation, limiting peak speed to less than 1
second \cite{exynos5}.  In addition, our prior work has shown that
this processor achieves greater energy efficiency as it slows down
\cite{Imes2014,Imes2014a}.  

Such a processor design creates both a challenge and an opportunity
for real-time embedded computing. The challenge is ensuring that
timing constraints are met while operating in the most
energy-efficient configuration.  The opportunity arises because the
processor is over-provisioned --- it has a much higher compute
capacity than it can safely sustain.  This extra compute capacity
could be used in short bursts to prevent timing errors.  The next
section illustrates these concepts with empirical results.

\subsubsection{Preliminary Results on Energy Efficiency and Timeliness
  for big.LITTLE Processors}

\begin{table*}[th]
\caption{Two embedded platforms with different configurable components.}
\label{tbl:machines}
\scriptsize 
\centering
\begin{tabular}{lcccccc}
  \textbf{Type} & 
  \textbf{Processor} &
  \textbf{Cores} & 
  \textbf{Core Types} &
  \textbf{Speeds (GHz)} &
  %\textbf{TurboBoost} &
  %\textbf{HyperThreads} & 
  \textbf{Idle Power (W)} & 
  \textbf{Configurations} \\
  \hline
  \hline
  Homogeneous   & Intel Haswell Mobile       & 2 & 1 & .6--1.5  & 2.5 & 45 \\
  big.LITTLE & Samsung Exynos5 Octa & 8 & 2 (A15, big \& A7, LITTLE) & .8--1.6 (A15) .5-1.2 (A7) & 0.12 & 69 \\
  \hline 
  \hline
\end{tabular}
\vskip -.7em
\end{table*}

We use an example application -- the x264 video encoder -- to compare
the different tradeoffs on an ARM big.LITTLE processor and a
traditional homogeneous multicore.  The video encoder is composed of
jobs, where each job encodes a frame. We instrument the encoder to
report job latency and measure each processor's energy consumption.
The two processors have different configurable resources, shown in
Table~\ref{tbl:machines}. The comparison of the two platforms is shown
in Table~\ref{tbl:machines}.

%\begin{comment}
\begin{figure}[h]
\vskip -1.8em
\subfloat[Energy/Latency Tradeoffs]{
\includegraphics{pdf/rt-nsf-2016-figure0.pdf}
%\input{img/tradeoffs.tex}
\label{fig:x264-motivation-tradeoffs}}
\subfloat[Heuristic Energy Consumption]{
%\input{img/heuristics.tex}
\includegraphics{pdf/rt-nsf-2016-figure1.pdf}
\label{fig:x264-motivation-heuristics}}
%  \subfloat[Intel Haswell]%
%  {\input{img/tradeoffs-v.tex}%
%  \label{fig:pwr-perf-vaio}}
\subfloat[ARM big.LITTLE Power/performance]%
{
%\input{img/tradeoffs-o.tex}%
\includegraphics{pdf/rt-nsf-2016-figure2.pdf}
\label{fig:pwr-perf-odroid}}
\caption{Characterization of ARM big.LITTLE processor for latency and
  energy tradeoffs (a), different resource allocation heuristics (b),
  and power and performance tradeoffs (c).}
\label{fig:x264-motivation}
\end{figure}
%\end{comment}

Figure~\ref{fig:x264-motivation} shows the latency and energy
tradeoffs of the video encoder on our two test platforms.  The
different shapes of these tradeoff spaces on the ARM big.LITTLE and
the homogeneous system lead to different optimal resource allocation
strategies. Empirical studies show that the \emph{race~to~idle}
heuristic, which makes all resources available and then idles after
completing a job, is near optimal on systems like the homogeneous
processor~\cite{PowerSlope,Hoelzle2009,google2,Imes2014,HotPower}.  On
systems like the big.LITTLE processor, recent approaches save energy
by keeping the system constantly busy and
\emph{never~idle}~\cite{Carroll13,LeSueur11,Imes2014,HotPower}.

To demonstrate how different scheduling decisions affect the different
platforms, we compare the energy consumption of both race-to-idle and
never-idle.  We set a latency target equal to the worst observed
latency for this application on each processor and measure the energy
consumption of encoding 500 video frames using each heuristic.
Figure~\ref{fig:x264-motivation} shows the results, normalized to
the optimal energy found by measuring every possible resource
configuration for each frame.  Both heuristics meet the latency
target, but their energy consumptions vary tremendously. On the
homogeneous processor, \emph{race~to~idle} is near optimal, but
\emph{never~idle} consumes 13\% more energy.  Conversely,
\emph{never~idle} is near optimal for the heterogeneous processor, but
\emph{race~to~idle} consumes $2 \times$ more energy.

Finally, we note that due to the presence of dark silicon in the ARM
big.LITTLE, there is additional compute power that can be brought to
bear temporarily, but cannot be sustained.
Figure~\ref{fig:x264-motivation} illustrates this property as it
shows the power and performance tradeoffs for the video encoder (all
normalized so that 1 is the maximum).  The horizontal dashed line
represents the maximum sustainable power without active cooling.  Note
two important properties of this figure: 1) the maximum performance is
almost 33\% higher than the sustainable performance and 2) the only
configurations that are guaranteed never to violate safe operating
power dissipation are those that use the little cores.

On the whole, the results in Figure~\ref{fig:x264-motivation} demonstrate
two key points that will guide our proposed research:
\begin{itemize}\itemsep 0pt \parskip 0pt
\item The conventional approach of allocating all resources and
transitioning to a low power idle state is extremely inefficient on
the big.LITTLE processor.  
\item The big.LITTLE processor can reach speeds of up to $1.33 \times$
  faster than it can safely sustain.
\end{itemize}