This project implements sequence logo generation using d3, given input data in list form. Wikipedia gives a good overview of sequence logos.
Open index.html
from Chrome/Firefox. Alternatively, visit my webpage.
We adopted the Airbnb style guide for Javascript, and JSDoc format for our function comments. The code was run through the ESLint JS linter.
Due to the demo nature of this project, various design decisions were made that may need to be reconsidered if this application were to be deployed in production. I make note of these decisions here. I did not change them in this project because I did not want to prematurely optimize before having an understanding of what type of environments this application would be run in.
This application hardcodes sequence data in the form of a list of strings, for easy processing. In a real application setting, we would probably want to provide a thorough parser for the FASTA format, and provide useful error reporting to the end-user if their data does not conform to our specifications. This kind of thing (input validation) would ideally be performed server-side rather than client-side.
Because this is a demo project, we limited the length and number of sequences in the data that our application allows (see index.html
). This is not because the application will break if the sequence data is outside this range, but rather because the quality of visualization will be impacted. For instance, it would be impossible to visualize in a single browser window sequences with length in hundreds or thousands--at that point the visualization would cease to be useful.
How to bound the range of data that this visualization application accepts is not immediate and we may need to investigate usage patterns and end-user needs before deciding on a final solution.
We also placed the restriction that all sequences must be of identical length. If it is desired that sequences may be of varying lengths, this would be an easy change to make.
Because the application provides its own data, there is very limited error handling. Ideally, all functions which are exposed by sequence_logo.js
would have their inputs validated and provide useful debugging information.
This application was tested on recent versions of Chrome and Firefox. However, there are a vast array of browsers, both desktop and mobile, that this application could be run in. Deciding which browsers to support would probably depend on the end-user constituency--for instance, if many end-users run versions of Internet Explorer, that is something that would need to be considered and it would certainly require more work. At the very least, we would like to send end-users a message to please download a modern current browser to use the application, rather than silently failing.
Further, this project was implemented in the v4 version of d3. If an existing codebase uses v3 or an earlier version, some things would need to be changed to be compatible (for example, d3.scaleLinear() becomes d3.scale.linear()). Most of the capabilities are the same, however.
We chose to use a locally hosted copy of d3 instead of pointing to a CDN-distributed version. The disadvantage is speed, because our server may serve this file more slowly. The advantage is predictability: if the CDN goes down or the resource is changed/removed, our application could be unexpectedly impacted. Hosting our own copy prevents this.
Little consideration was made to speed, other than to ensure that the computations which run are primarily linear (iterating through the data, etc.) and that the application quickly refreshes on a modern browser in a typical (laptop) end-user environment with modest CPU.
If many end-users run in environments with severely limited computing power or limited network bandwidth, that is something to consider. Options include performance monitoring and tuning (to improve performance once on client side) and minifying the javascript file, using CDNs for distribution, etc. (to improve network latency).
In addition to the design decisions given above, there are also some technical details for which a better solution may exist.
Currently, the various data around the letters 'G', 'A', 'C', 'T' (the character itself, the SVG path data, the base transforms, etc.) are kept track of by passing around integer identifiers between various functions. This works fine for a small application, but probably makes the code a bit more difficult to modify if another developer wishes to do that.
So that transforms can be easily applied, I produced predefined SVG path data for the letters 'G', 'C', 'A', and 'T' in a design program and exported them. This created some challenges, because letter paths do not generally have the same width in fonts, whether or not they are monospace. For instance, even in a monospace font, the letter 'T' is more narrow than the letter 'A' and this is compensated for by padding with whitespace--we can't do this easily as part of the SVG path information, to my knowledge. To remedy this situation I wound up applying a lot of predefined transforms to the letters to get them to render as desired. This is an unsatisfying solution, unfortunately.
Sam Lichtenberg (splichte@gmail.com)