The work I do is always covered by Confidentiality Agreement and as a result, I do not
publish or post the code underlying the results in this deck at GitHub or other, and
gene names are redacted in this slide deck.
Life Sciences Project: Use innovative computational biology approach to identify growth factors and other
ligands useful for ex-vivo (outside the body) red blood cell development/culture.
First, I built an app using random forest
and support vector machine to classify
cellular subtype in single cell genomics
(RNAseq) datasets using data scraped
from the literature public gene
repositories. Below shows a composite
dataset I created from data provided on
Gene Expression Omnibus
corresponding to dozens of bone
marrow aspirate samples, showing the
developmental progression from stem
cells (HSC) to mature red blood cells
(OrthoE-late) (in developmental time
from left counterclockwise). Each spot is
a full transcriptome for a single cell,
corresponding to roughly 3000 bases for
about 30,000 genes each.
This dataset was used for Regulon Analysis, where I identified all of the activated transcription factors in the developmental
process. This was done by finding factors differentially expressed among the different developmental stages, where the factor
expression was correlated cell-by-cell with that of its known gene targets confirmed bioinformatically to contain canonical
factor binding sites upstream of the start site of transcription for each gene. Discovered factors are redacted in the below.
I built a number of custom algorithms for this project, including one for determining whether discovered
regulons were randomly distributed across the HSCs or co-expressed in subsets of HSCs (regulon gene names
redacted)?
Another was developed to identify which among the most differentially
expressed genes between cells of progressing developmental stages were
transmembrane proteins. For example, to mark the MEP to BFU transition it
would be helpful to know what if any transmembrane markers were available for
antibody staining to mark this transition. This was accomplished by parsing the
String and Swiss Uniprot databases for transmembrane tags.
Another was developed to associate the regulons whose differential expression marks specific developmental cell
stage progressions with specific activated genetic circuits. This was done by using the String DB API to retrieve KEGG
diagrams. Scanning these diagrams reveals the cell surface receptors driving the circuit, and hence the activating
ligand. Doing this for all of the developmental stages allowed us to create a defined media for ex-vivo RBC growth,
which contained both previously known and totally new/unexpected growth factors and organic ligands.
Other types of projects for clients have focused on scanning the peer-reviewed literature to document the state of
expression of particular genes in certain patient sets. This was necessary because the literature was incomplete/inadequate
for this particular gene, which served as a target for a potentially lucrative new chemical entity (drug).
Both bulk (older style) and single cell genomics data were used in this project
In this project, I developed a custom algorithm for conducting a candidate-gene based regulon analysis, involving computing
the canonical binding sites for all known human transcription factors, identifying genes whose cell-by-cell expression in a
single cell RNAseq dataset correlates with a differentially expressed transcription factor, parsing these for the presence of
the relevant canonical binding sites, and identification of activated regulons (genetic circuits). This app found several
important circuits missed by open-source libraries/modules for regulon analysis, such as Pyscenic.
Another client operated in the financial services industry and wanted a proprietary method of selecting value stock
investments. I developed an app to scan through the entire list of S&P500 and QQQ stock symbols for those whose
historical prices indicated present long-term value (6-months to 1 year) – the Relative Strength Index (RSI), position
with respect to moving averages (Bollinger Bands), and a number of other features (e.g. MACD) were used to rank
the stocks. When the most attractive were used as an investment portfolio, returns were significantly greater than
general market returns, indicating that we were successful in identifying oversold and overbought value stocks.
Most of the stocks in the portfolio turned into positive investments right off the bat.
Another client wanted a proprietary app for identifying stocks that represented exceptional short-term value.
To do this, I incorporated machine learning (GARCH volatility prediction) with various statistical features to
identify stocks poised for mean reversion within the next week. Over time, weekly investing results for stocks
identified in this way proved exceptionally profitable. Cumulative returns for four particular stocks repeatedly
identified in successive weekly analyses are shown below.