- [X] Learn the org mode short cuts
- [X] Figure out bibtex
- [X] Get a latex plugin working nicely in sublime (latexing isn’t working because you can’t buy it)
- [X] Setup an organisation repo to put all the org mode files and notes
- [ ] Specifics on mathematical induction, seems intuitive, must validate
- [ ] When we refer to “Order”, over and above the cyclic space “depth”, what else does it mean more generally? !!!! Same Cardinality, there is a bijection between them. Be absolutely sure on bijection. “Same number of elements of every order”?
Cardinality == The number of elements == the order
- [ ] So the key question becomes, what is the definition of structure within a group? How do I define equivalence?
- [ ] Find the research and theory behind lingual (https://github.com/julianhyde/optiq)
- [ ] Harvard algebra course: http://www.extension.harvard.edu/open-learning-initiative/abstract-algebra
- [ ] The containment problem is known to be computationaly hard? A. Chandra and P. M. Merlin. Optimal implementation of conjunctive
queries in relational databases. In STOC, 1977
- [ ] Logic programming
- [ ] Datalog: http://infolab.stanford.edu/~ullman/fcdb/aut07/slides/dlog.pdf
- [ ] Pure funtional data types
- [ ] Monad plus, category (monoid laws or otherwise)
- [ ] Monad logic category
- [ ] Extensible effects
- [ ] Free monad, write some code, you clearly don’t understand it yet. in
- [ ] Codensity and composing flatmaps
- [ ] Fundamental DB research (calculas), combined with certain Algebras should be interesting
- [ ] How does left associativity relate to laziness?
- [ ] Paxos related papers
- [ ] SQL Parsers?
- [ ] Take a look at Staash (https://github.com/Netflix/staash)
- [ ] Grab some papers from https://github.com/rxin/db-readings
- [ ] Go through the spark tutorials, get familiar with Catalyst
- [ ] Read up on the query planning and hooks for same in Hive and Impala.
- [ ] Fundamental question, do you build a query planner into the middleware? Is that where the idea is going?
- [ ] Review datalog contrib in clojure (https://code.google.com/p/clojure-contrib/wiki/DatalogOverview)
- [ ] Add papers to the list from here: http://www.vldb.org/pvldb/vol7.html
- [ ] Add papers from this: http://pdos.csail.mit.edu/dsrg/
- [ ] Add papers from http://www.sigmod.org/2014
- [ ] Find some more survey papers:
- [ ] Query optimisation
- [ ] DB in general
- [ ] DB integration
- [ ] Transaction management, consistency, and related.
- [ ] Papers to read:
- [X] Materialization Optimizations for Feature Selection Workloads
- [X] LINVIEW: Incremental View Maintenance for Complex Analytical Queries
- [ ] Opportunistic Physical Design for Big Data Analytics
- [ ] PIQL – Scale Independent Query Processing
- [ ] Distributivity versus associativity in the homology theory of algebraic structures (http://arxiv.org/pdf/1109.4850.pdf)
- [ ] Distributivity in Quandles and Quasigroups (http://arxiv.org/pdf/1209.6518v1.pdf)
- [ ] Stackless Scala with Free Monads
- [ ] Fast and loose is morally correct
- [ ] SPANStore: Cost-effective Geo-replicated Storage Spanning Multiple Cloud Services
- [ ] Multi-Data Center Consistency (http://mdcc.cs.berkeley.edu/), http://pdos.csail.mit.edu/dsrg/
- [ ] Fast Paxos (http://research.microsoft.com/pubs/64624/tr-2005-112.pdf)
- [ ] ZooKeeper: Wait-free coordination for Internet-scale systems (http://static.usenix.org/event/usenix10/tech/full_papers/Hunt.pdf)
- [ ] Viewstamped Replication Revisited (http://pmg.csail.mit.edu/papers/vr-revisited.pdf)
- [ ] Query Optimisation (http://web.stanford.edu/class/cs346/ioannidis.pdf)
- [ ] An Overview of Query Optimization in Relational Systems (http://dis.unal.edu.co/profesores/eleon/cursos/tebd/articulos/05-chaudhuri.pdf)
- [ ] Out of the tarpit (http://shaffner.us/cs/papers/tarpit.pdf)
- [ ] Scalable SPARQL Querying of Large RDF Graphs (http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf)
- [ ] HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering (http://2012.eswc-conferences.org/sites/default/files/eswc2012_submission_346.pdf)
- [ ] Distributed SPARQL query engine using MapReduce (http://www.inf.ed.ac.uk/publications/thesis/online/IM100832.pdf)
- [ ] Semantic Database Modeling: Survey, Applications, and Research Issues (http://www.cc.gatech.edu/computing/Database/readinggroup/articles/p201-hull.pdf)
- [X] Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing (http://www.vldb.org/pvldb/vol7/p1259-gupta.pdf)
- [X] Summingbird: A Framework for Integrating Batch and Online MapReduce Computations (http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf)
- [ ] A Co-Relational Model of Data for Large Shared Data Banks (http://dl.acm.org/citation.cfm?id=1924436)
- [X] Indexing HDFS Data in PDW: Splitting the data from the index (http://www.vldb.org/pvldb/vol7/p1520-gankidi.pdf)
- [X] The Unified Logging Infrastructure for Data Analytics at Twitter (http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf)
- [ ] Algebraic classifiers: a generic approach to fast cross-validation, online training, and parallel training(http://jmlr.org/proceedings/papers/v28/izbicki13.pdf)
- [X] Shark: SQL and Rich Analytics at Scale (http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf)
- [ ] S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330–339, Sept 2010
- [ ] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009
- [ ] M. Stonebraker et al. Mapreduce and parallel dbmss: friends or foes? Commun. ACM
- [ ] B. Guffler et al. Handling data skew in mapreduce. In CLOSER, 2011
- [ ] X. Feng et al. Towards a unified architecture for in-rdbms analytics. In SIGMOD, 2012
- [ ] Time, clocks and the ordering of events in a distributed system: http://web.stanford.edu/class/cs240/readings/lamport.pdf
- [ ] Reading on vector clocks (http://basho.com/why-vector-clocks-are-hard/)
- [ ] Dynamo: Amazon’s Highly Available Key-value Store (http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)
- [ ] and related riak write up: http://docs.basho.com/riak/latest/theory/dynamo/
- [ ] Fallacies of distributed computing: http://www.rgoarchitects.com/Files/fallacies.pdf
- [ ] Paxos: http://the-paper-trail.org/blog/consensus-protocols-paxos/
- [ ] Quorum based commit protocol: https://ecommons.library.cornell.edu/bitstream/1813/6323/1/82-483.pdf, http://en.wikipedia.org/wiki/Quorum_(distributed_computing)
- [ ] Impossibility of Distributed Consensus with One Faulty Process (http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf), (http://www.slideshare.net/HenryRobinson/pwl-nonotes)
- [ ] Life beyond Distributed Transactions: an Apostate’s Opinion (http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf)
- [ ] HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (http://www.vldb.org/pvldb/2/vldb09-861.pdf)
- [ ] A Theory of Data Parallel Computing (https://www.dropbox.com/s/bripus159ziaqy9/eijkhout-tr1404-b.pdf), http://lambda-the-ultimate.org/node/5016#comment-81924
- [ ] [Semantic Integration of Heterogeneous Information Sources](http://www.mswi.uni-osnabrueck.de/rieger_2000_03.pdf)
- [ ] [RICES: Reasoning about Information Consistency across Enterprise Solutions](./papers/rices.doc)
- [ ] [Semantic-Integration Research in the Database Community](http://www.aaai.org/ojs/index.php/aimagazine/article/view/1801/1699), [Alternate link](./papers/1801-1797-1-PB.pdf)
- [ ] [Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm](http://www.aui.ma/personal/~H.Haddouti/UB_Tree_paper.pdf
- [ ] [The Universal B-Tree for multidimensional indexing: General Concepts](http://citeseerx.ist.psu.edu/showciting?cid=13384)
- [ ] Bayer R (1996) The Universal B-Tree for multidimensional indexing. Technische Universitat Munchen Technical Report TUM-I9637. http://mistral.in.tum.de/results/publications/TUM-I9637.pdf
- [ ] Bayer R and McCreight E (1972) Organization and maintenance of Large Ordered Indexes. Acta Informatica 1(3):173-189. http://www.liacs.nl/~graaf/STUDENTENSEMINARIUM/OMLO.pdf
- [ ] Data Structures and Algorithms for Data-Parallel Computing in a Managed Runtime(http://axel22.github.io/resources/docs/my_thesis.pdf)
- [ ] Multidimensional Access Methods (http://mistral.in.tum.de/rwork/gg98.pdf)
- [ ] http://www.dcc.uchile.cl/~gnavarro/ps/cpm12.pdf
- [ ] http://www.ittc.ku.edu/~jsv/Papers/GVX11.WaveletTreeCCP.pdf
- [ ] http://alexbowe.com/wavelet-trees/
- [ ] http://blog.treode.com/minitransaction/
- [ ] Cache obvious b-trees
- [ ] https://github.com/analytics/analytics/blob/master/notes/papers.md
- [ ] Cluster install:
- [X] Create a vagrant script that creates a box and installs CM and the impala tpc-ds kit
- [X] Manually create the nodes
- [X] Use the wizard to add these nodes to the cluster
- [ ] Complete the data generation through running a script from the CM box
- [ ] Figure out how to add the TPC-DS data to s3, and download it.
- [ ] Move node to EBS backed
- [X] Figure out why HDFS isn’t picking up the disks first time
As a first piece of research, the idea is to use Spark SQL, do some performance benchmarks against Impala. Then implement an integration between a relational DB and catalyst such that certain queries are optimised and show the performance uplift. There will be 2 outcomes to this, firstly some numbers relating to the number of concurrent users for a given cluster. Secondly, a comparison of the performance of certain queries before and after the externalized query index has been created.
- [X] Cluster creation process on AWS
- [X] Local dev environment
- [X] Get the data generated in the small
- [ ] Write some scripts that load the source data into parquet data using hive, not impala.
- [ ] Get the data generated in the large
- [X] Get impala tests working on the local vm
- [ ] Experiment with the performance testing framework for scala
- [ ] Get equivalent spark tests working on the local VM
- [X] Write the performance scripts for impala
- [ ] Get the performance test working on the local VM
./cloudera-manager-installer.bin –i-agree-to-all-licenses –noprompt –noreadme –nooptions sudo sysctl vm.swappiness=0
To copy from HDFS to S3: hadoop distcp -Dfs.s3.awsAccessKeyId= -Dfs.s3.awsSecretAccessKey= hdfs:///user/hive/warehouse/tpcds_parquet.db/customer s3://tpcds/tpcds-cdh5/customer
7563551141
Functor is your structure, something you can map over. Free is a way of encoding an AST, a generic tree is a free monad.
Learning order: 1-Algebra 2-Pure functional data structures 3-Logic and category 4-Fusion and optimisation