# Manifold Learning Yields Insight into Complex Biological State Space

You. So. Thanks, for coming out I'm super excited to host today as Smita and David who are visiting, from Columbia, yeah. Yale sorry I'm gonna I. Was. Jumping ahead did. Her postdoc at Columbia with Donna yeah so. She's been at Yale for last three years now it was like mommy before that. So. I'm off to guess or maybe I shoulda stopped now but no I'm, so. They, don't talk a lot about, high. Dimensional learning for single cell in particular and, as. I seen it came from Donna's. Group at at. Yale before that. Michigan. Or maybe it was. Again. Or something I don't know. They. Were was a University. Of Amsterdam for, his ph.d but he actually had spent most of his time at Weitzman, which is you know right next door in Israel. Working. With Iran Segal and as, now, I. Post. Talk with Smita at either Columbia or, one of those. I. Will get on if you want to ask a question type. It in and oh. I'm. Gonna go. Get. Started if you have, any questions, if you want to know more, about something, I guess, I hear you can ask me right away or you can ask me at the end and then David has extra slides on it everything, if you want to go into more detail about anything, specific. Now, I'm just going to talk about basically, what. Our group does. We're. Jointly. Located, in computer science and the medical school at Yale, and. We're. Basically a, machine, learning or, applied, math type group and we developed, techniques that, are meant to analyze, complex. Biomedical. Data and. Can usually or often be applied to other types of types of data because we believe that's a bar, for analyzing, biomedical, data is a little bit higher because there's. Can be all kinds of sparsity. Missing, information very, very noisy a lot of nonlinear, effects and things like that so we believe that if I may. Basically, we do unsupervised, machine learning technique so if we're able to detect those sorts of patterns and hopefully. We can detect other other sorts of patterns as well. So. As. I was mentioning, our. Group, develops, techniques that, deal with data. That can. Be very high dimensional so you're measuring a lot of different features. Often. We apply our techniques to single RNA sequencing, data where the features, are. Transcript. Counts mRNA, transcript, counts but. We've applied it on a variety of other data sets also, high throughput so there's, a lot of different units so we've, tended. To work with single cell data where the units are cells or, data, where the units are patients, but there's a lot of them, and.

There's. A lot of heterogeneity, within. The data set, which. Means that we want to learn what, the different you know ways. That the data itself is stratified, what, the different patterns are. How. Much information, and redundancy, there isn't in the futures, the. Challenges, this. Kind of data can pose is as, I mentioned there could be sparse it even though there's many many features a lot of them could be missing for. A lot of the data points there's, a lot of noise so there could be no very, extrinsic, noise background, ambient, noise or there could be specific types, of noise. There's. Non-linearity. In, sort, of the shape of the data and, how the features relate, to each other and just a scale of it can pose computational. Problems, so the, algorithms, need to be scalable. So. With this kind of data one question, that we often, ask is how, do we look at this data so how you. Might have a lot, of data and often. You. Might not even have, any ideas what your data can tell you or what can, you even predict, so, in this circumstance, we have unsurprised techniques that will help you just sort of look at and explore the data to even generate hypotheses, from it. So some. Of the technologies. In biology, that, are generating, this, this scale of data is math, cytometry, or sight off as you mentioned which generates. A lot of protein channels RNA. Sequencing. Hi-c, et CR sequencing, which I know a lot of lot of you work with and and, sort of the list goes on this. Is and we, think this kind of thing just keeps, exploding and there's more and more technologies. That are capable of generating this so. How. Do we sort, of deal with these challenges and. One, of the themes that. We've. Sort. Of seized upon in our lab is the, fact that even even, though the data is. Measured. In a high dimensional space it has intrinsically. Low dimensional, structure so. For example in the gene. Single. RNA sequencing, world there are, a lot of genes that you measure but, these genes have a high degree of informational, redundancy, between them and as a result, the, cell types and states are constrained, so there's, an intrinsically. Laura dimensional, shape here so the. Ambient, space was, a lot bigger than the intrinsic shape, and. So, that's one. Of the principles, we use in our lab how, do we learn this lower. Dimensional, intrinsic, shape as you can see the shape can be nonlinear, how. Do we understand the. Shape. And. Ed1, of, the. Fields. And math that we use a lot is grass. Signal processing. Where. There's a nice theory of manifolds, and we, show that we can also implement. The same concept, with deep learning where. Not only does. The data have have shape but. There. Is some smoothness in it so. The. Smoothness. Or the. Locality and the data can really help you learn, and understand, this shape, they'll give you sort of an example here. Smoothness. Meaning your data points are occupying, the state space in some. Semi-regular. Fashion even if they're not uniformly, sampled across it you have samples and different, parts of it so that you can learn learn, something about it it doesn't have to be one smooth. Shape it can be a collection of them so. Basically. If you have this kind of highly nonlinear Laura dimensional, structure how do you learn it. The. Very basic way you learn it is you. Look for very, similar data points, to every data point and to, some level of granularity even. If you have high dimensions and noise you should be able to tell data points that are very similar to your point, in. A, very local, way so, if, you're trying to look for points that are very far away you can get lost because unless you know this is structured, like a, Swiss, roll you. Might not understand, the, points, and the different arms are actually far away but, you can understand, which points are very close to each other so once you would have points, that are very close to each other you can basically take a walk by taking small steps to local neighbors and this, walk will somehow help you clarify, the shape, of the data and find this highly.

Nonlinear. Intrinsic. Dimension, that the data follows. So. When, you walk in small steps you can define. Markov. Matrix, that says starting. From any point you're gonna transition to other points based on your, measure of similarity that measure of similarity doesn't nearly it doesn't have to be necessarily very perfect, it can be pretty. Rough but as you walk longer distances, the probabilities, add up so that you, hone the longer you walk the more you Honan and the main dimensions of the shape. So. This, is this, concept is called data. Diffusion, or a, random walk and this, data diffusion, type. Concept, where you find these main, nonlinear, eigen dimensions, turns out to be the sort. Of basis, of graph signal processing. So. We used this concept, for. The. First. Project. That David and I did together that was, just published and so and. It's an algorithm called magic. And, what. Magic shows is that by learning this manifold, of the data you can. Restore. Values. To cells that. Were. Missing or, you, can denoise, cells. Such that you can further process, them so restoring. Data basically, to the main manifold, dimensions. The motivating, problem, was in. Single, saw RNA sequencing, which came out about three, years ago you have, a sell-by gene count matrix but most of its missing just, by the way, RNA, capture works, it's. Like reaching, into your cell and grabbing five to ten percent of the transcripts, and the rest of them you leave in the cell so, given, this a, lot of what I done with, Saito for example in my postdoc, which is try, to learn relationships. And signaling the relationships, between proteins, could, no longer be, done so if you're looking per, gene now and you're trying to get information from your 20,000. Genes as far as which, gene has. A relationship. Or correlation, with another gene it. Becomes very hard to discern this because, you're. Not seeing both the genes in the same cell at all so if you only see one gene and 10 percent of the cells and another gene and another, 10 razón themselves the likelihood you're saying but genes in the same cell is 1, in 100, and so a lot of this data is dropped out and, you can't do the type of gene regulatory analysis. That seems, to be promised, by the technology. So. The. Steps of magic. Basically, are that you start with a cell by gene matrix you calculate, distances, but again you don't trust far away distances, so you localize, the distances, by passing, it through a, gaussian. Infinity, so, that gives you very local, neighbors. And then you Markov normalize, it to create a random walk matrix, and then you exponentiate this matrix, to compute the probabilities, of longer. Random, walks, and. Then this. Has, honed in on the. Main dimensions, of the data so if you eye can be composed it you'd get what's called a diffusion map but we don't actually I can decompose it you basically project, this operator, onto their data to learn a weighted, average of manifold. Intrinsic, neighbors to get the imputed data. So. You. Basically, have data before magic, and after, several steps of imputation when, you project you get your restored data after magic you can see some of the other. Types of noise and the data besides missing values are also restored, and what. The subsequent. Steps of diffusion, do are, it. Connects, the manifold more and more globally. So. You get more global directions, and as you do this you, take off the low noise, dimensions. And you hone in on the main manifold, dimensions, and you see restored. Relationships. Purging, and it. Can restore it to nonlinear, values, because you're, not imposing any sort of global. Constraint, on the shape of the relationship, or what you impede. Yeah. It's. A single cell. They're, two genes two. Random genes you can think of it as so this is a single, sartet sequencing, so, becomes. Molecules. Four genes assume, so. These are molecule, comes I. Tried. To understand what. On the x-axis like, how do you determine their quality well, each expression level. It's. Just two features, so you, can think of it we had a matrix with a bunch of missing data and. I'm trying to learn the relationship between two columns of the matrix. Just. To see if the two columns have correlation, so. If you had people, I'm. Trying to figure out if there's correlation between there, I don't know right and. Salary. Or something like that but I just couldn't do it cuz most of it was missing and, when I restored, it I can restore the relationship between these two features so, so, I understand, I have a I always they say that you have a sparse, matrix. Matrix. That's. Right but. Actually it's not major to complete its itself, harder, I'll, talk about like yeah, that mean that this name is matrix completion so matrix completion, assumes, that the values you have in your matrix are okay and you just have to compute missing values but it's not this, situation, this situation you see you under sampled, everything so.

You You have a small set of samples from yourself and everything. Is under counted. What's. A missing value exactly, yeah. There's no exactly you know distinction, and, algorithm, doesn't distinguish. Like that either Oh something's, highly expressed the the chance, would be zero, whispers pretty low right you ain't got something that's got thousands of trial well usually if we see a transcript, weeks it's. Pretty, trustworthy unless, you have a mapping, what. I'm saying is that it's not even random in the sense that if something's got lots and lots of transcripts, and you have a 10% chance of taking each one if you have 10% chance of taking a thousand things dogs are in zero or very there yes. Yeah. I can't beat transcript that dominates all. Right. Cancers. Also preserve, yeah you, might see a bit of social psychology answer, or exactly something like them so, then, you accentuated interesting, sorry interesting, genes their little experience like transcription factors. Interesting. Stuff that is highly expressed yeah unfortunately actin. So. Yeah exactly, our soft, surface markers because then you can use them to sort and study cells those are exactly missing, you. Mean before your imputation. So. Does that help answer your question. Yeah and. I'm not assuming you know what these relationships, look like although if you know about epithelial. To mesenchymal transition in breast cancer you know that this is a correct. Direction of these relationships, because snail is a transcription, factor and momenta and is a mizenko marker, so, when snail goes on momentum, goes up and, then eco Darren and I meant and our markers, of epithelial, versus mesenchymal state so they're not on at the same time. We. Didn't try so, Peter, I have you, made recently so we did try matrix, completion, techniques. So. We. Always. Try to find ways to validate, it computationally, and otherwise, so here we had a gene, of worms, that were. Developing. For the first I think it's 48, hours and they. Had both sequencing. But, there was a natural sort of manifold, structure, that we. Knew about. And. We. Artificially. Dropped, this data out and weari, imputed it with magic, and. Denoise, did as you can, see and the. Original. Relationships. Got. Totally, messed up they most. Of it sunk, down to zero and then, Yury impute it with magic and you see that, the original relationship. Is restored. Or even further clarified. So. Why, is this interesting to. Us because it sort of you can learn a lot about individual, genes now you can look at their pairwise relationships. Their, individual, probability, distributions. Are. Represented. In. And, you can distinguish between different populations, and how much of a particular gene is, represented, you, and in, addition to looking at individual, genes the overall, data. Is is restored, so this is before magic, and afterwards, you see a very clear structure. Of this. Data this. Is again. Data. That's, mimicking, the epithelial mesenchymal transition on. Breast. Cancer cells so, we, see different structures corresponding. To, the, yellow is missing Komal cells that successfully, transitioned, for example. So. One thing we can do after magic that we completely, couldn't, is do. Something called archetypal, analysis, so, there's data before magic, and after and the main idea is to fit a polytope, to the extreme ends of the data to figure out what's creating, the overall shape and, when, we did that we were able to identify sort. Of extreme, end points of this shape that correspond, to chromatin, modification, so. For. Example now biologists. Could look at these intermediate. States that people hadn't really considered that much and see. That there is a sort, of long program of chromatin, modifications. That's going on so a lot of times people hone in on the, first part of this which are epithelial cells, or the last part which removes animal cells they're, not really sure how these cells transition. Through, the intermediate. Populations. And that's something that you, can start. To get a picture of after, you've restored, restored, all your gene values. This. Is comparing, magic, against sort of matrix, completion. In low-rank, approximation methods. So. You. Can see that, NMC. Is nuclear. Norm. Base matrix completion. The Candace based method and it doesn't, perform super. Well here Laura. Approximation, actually does somewhat. Better but it's a linear method so it's not completely, denoising, and then on linear, directions, and then you see magic restores.

Sort Of the data shape to something. That's identifiable. To us. We. Measured it against a lot of other ways so people often ask us you need this sort. Of complicated random walk technique and can't you just do this by imputing with Kannon and you see it's not nearly as good if you impute with just your K nearest neighbors the. Global manifold, connections, actually do make a difference, when. You're when you're learning the shapes. And. You can't use individual. Diffusion. Components, or anything like that to get the same. So. It can't be sort of just simple, smoothing, on your graph so the manifold, learning actually makes, a difference so after. We did this project so basically what we did was. We. Restored. The gene values, we. Thought we could do similar inference, because we've learned the gene cell graph on external, variables, and we thought we'd this could really help us because. Single-cell. RNA sequencing, data now is not only in a single condition but it, can, be generated in multiple conditions and now people want you to compare conditions so I have this single cell data and I have it with perturbation, so, what's the real difference and. You, know measuring the difference actually you know pretty hard because for. Safa there's all this noise that, sort. Of occludes the difference but. Also the. Measurement effect, itself could be weak or could be measured weakly. And. So. We. Thought we'd actually treat this experimental. Variable. Itself as. A, similar, signal as we treated a gene and use, the same sort, of graph manifold. De noising technique on an. External, variable but, you know it's configured a little bit differently because we. Don't know some things about the experimental variable, so. This is a technique that's in preparation so David, is sort of supervising, this project, and Jayne and Dan are working on it, the, idea is to smooth an external. Perhaps, experimental. Label on, the data graph to putative, con and infer associations. So. The idea here is you have cells from condition1, and condition2 but. You've learned your graph from single-cell, RNA sequencing. Data. After. From, the whole data that you on, the genes and. Now you can sort of interpolate, the labels to get a. Continuous. Variable that, can now give. You some enhanced, abilities to tell the difference between the two conditions. So, one thing it can do is, it can improve your ability, to correlate variables, with the experimental, condition so, before you had these discrete, labels and between. Condition and want and condition two there, really a significant, difference, can you tell me often you can only tell us it's, a monotonic relationship. Like an a but you can't tell a non monotonic relationship. Like, you have in B. And. So after, you we, call these melding, we've turned it into a verb so we're melding, variables, after we meld the variable, I mean for this latent dimension, that seems much easier to track, to. See if it correlates with it with the experimental, condition, so, you can think of the MELD dimension, as a continuous, experimental, variable, the. Other thing you can do is help us identify subtypes, so if you have condition a and condition B and you mix, samples from it you can see which cell types are enriched. In condition a versus condition B. So. Some, of sort, of the math behind it there they're also using graph signal processing so they D and graph signal processing is you have a graph and, you're treating this variable as a signal on that graph and what, you're actually doing is you're low-pass filtering filtering this, graph. So. Like, before when you power, the matrix. What. You're actually doing is you're keeping the high eigenvectors, of this markov matrix which arranged from 1 to 0 by the properties of Markov matrices so. Something is 1 to 0 and you raise it to a power it, goes down unless it's close to 1 and that's, the property that you're using that's sort of smoothing, on this graph or smoothing on the laplacian, so. You see that before the smoothing, the signal is kind, of noisy it has all these high frequency components, and as you smooth more and more you. Hone in on the low frequency, component so you're smoothing the external variable, along the main directions, of the graph assuming, that the mean directions of the graph sort of represent. The difference. What. T. Would be how much you power the matrix even though in this set up we actually don't, power it we have it as a convex optimization that. Trades smoothing, with reconstruction but. In magic we had a TA where you powered it. How. Strong, you power how much you power or, how much you smooth. So, for. Example if you have time series data and, you're looking at the expression of sox10 this is some embryonic, stem cell data you, impute continuity, to this and then you can tell there's two cell populations, and one sub population.

Rapidly. Increases, the expression of sox10 which is a marker for some. Lineage do you remember which one right now, and. So. In these kind of inferences. That you can make maybe. Even more useful kind of inference we can make we're, testing it on this hip, C data so, yells for the human you know profiling, project consortium. So, we, get basically. Signed off for C, this is single cell or any sequencing data from. These. Different infectious diseases so, this is Lyme disease, analysis. This. Is the difference between a, pivot, a patient's, first, visit and a patient's second visit so. We meld the visit label and what, we're starting this T is the, right side is colored, by the melded. Time so. Populations. That are enriched at the second visit are in the lighter colors populations. That are more common. And their first visit are in, the purple colors so you see the T cells there's, certain kinds of T cells that are showing up in yellow and, if you sort of zoom in on them you seen that the second, visit there's, certain regulatory to salt populations, that are appearing and regulatory t-cells, are actually quite, rare and it would be hard to pick this out without meld. Smoothing. The coloring on the label and their, memory, T cells that, are upregulated and. Visit to and, we can generally see how these. Populations. You. Know arise, and then maybe go away with, the different as the patients come back and, they're expected to come back more, and more time so we're excited about using meld, for this. Mm-hmm. You have a liar to, labels, and, so I if, I understand correctly so, I do, it as, you. Can have more than two if you have time, steps, like could have four or five but in this particular case is how are you have you will is like zero on one origin story. Two on one now, you do the matrix, pairings. Eventually, you'll guess a little, bit it's, kind of a little bit like diffusion, there's. Some extent that's exactly, it okay so, you have a soft probability. Yeah, yeah, you have a soft probability, that in some respect you could, interpret. As sort of a likelihood that it, was, generated. By condition, two versus. So. So the minute specific sentence setting the score can be interpreted as something for example these are two visits in time so. Now you. Can say that we, appeared at the time right you know you measure two time points we effectively at many time points but. It still a score, how. Likely that, self, you, know came from come, from one sample one or simple to nesting so when you generalize, to a model and this one Danny will become another Nami yeah it, would be a multi-dimensional. And. You can do the same thing right if you encode your condition with ten bits you can meld. All the ten bits. Answer. What. Do you think he's. Hallucinating. What, happens in your. Imputing. Physical. Analysis you haven't done really you, haven't changed the situation, at all if. The. Doctor wants to see whether there is a taking regulated. Regulatory. Cells they. Would still do the same for you darling.

So. Basically. It's. Not exactly. Correct, so. Even if you have to. You. Can see if there's. A strong regression, with the condition after Mel that you couldn't move water and. There's you. So imagine computing, p-value, to see if the distribution is in different, like, in this condition, a versus condition B you'll do some sort of statistical test, but, I'm saying you can do some kind of stronger inference on the MELD dimension, you can compute the correlation with, moreover, a and C you clues here, you get the original data right as. You can see but B and. You would not yes. Can. We zoom in on that was again so, what. What. About the data that you see in panel would. Lead, you to conclude that so, so to imagine so you have to understand. That each sample is very heterogeneous right, so. There are some so. Let's let's let's maybe go in reverse let's assume that this is the original is the ground truth okay, this. Is the original system, but now you measure that in two samples. Right so you've cut it into one sampling, and one sample here, the, result will be this movement right but there will be some cells here and some cells here that were actually here at the peak but. You're missing that information, now but. With milk you can recover that. Yeah. Which ones are which ones are closer to 1 and 2 versus part of the way exactly their genes are changing also yes you you, learn the manifold of the gene expression and. Now you use that manifold, to impute. To interpolate, this. External. Data. It's actually been shown you can have any p-value, show anything because everything is rare. 20,000. Genes anything. Where. They look at this and say well maybe this is happenings which have more samples, in. Between. Convinced. That this is what's going on rather than pretty. Have you tried helping, didn't know well. So one thing you could do is so. You originally measured two conditions, you don't melt and how you gain, is into three bins and, now you have three conditions and now you can do the same things that we've done we've validated it, by doing that we've left out a time, point that was actually, measured, and. We've melted on, the two time points that are flanking, it and we can impute back the middle, so. We've done validation, right but we're not super big on statistics, because we were like you know. Maybe. They shouldn't be generate. Some kind of null, distribution. Where you randomize things and you could get a peanut but it's very easy over. Easy to get good, p-values, that way so highly. Note elimination. Yeah. So. If you if you will do a random. So. Then. Another project that we did that sort of uses these same ideas, is, we, realize that we have all this high dimensional structure, but we don't have a very, good way of visualizing, this structure for exploratory, purposes. So if, we in in some cases we knew genes we were interested in so we're visualizing, biaxial, plots another, case we, visualized PCA, but. It came to our attention that all of these methods sort of leave out a lot from the data so because we learned the whole manifold, of the data there's should be some way of visualizing it, that preserves. Structure, a bit more and this led to a, project, called fate, that's, under revision now. So. The, idea of fate is that, because. We. Have a, natural understanding, of the shape of the data we. Can squeeze, it into lower dimensions, in such a way that. Captures. Both local, and global structure. And therefore sort of trajectory structures, in biology, but. Also the, sperits cluster, structures, and. We realized that not a lot of other visualization methods, really preserve.

Structure, Structure in this way. So. I'm gonna. Show this both on artificial data and data, from an embryo a body differentiation. System when, you start out with human, embryonic stem cells then, you, differentiate. Them into a variety, of different lineages over a 30-day, time course so. This artificial, tree, data which might lab sometimes calls the dumb tree but you'll find out that the dumb tree is harder than it looks, as, you see here so. If you take this dumb tree and you add noise to it and you rotate it in high dimensions, and, you look to see if, you can visualize it if you try to visualize it with PCA you're. Missing a lot of the structure in the dumb tree even, though it picked up sort, of the global, directionality. In some way it is going from blue to green but. It's losing, a lot of the. Branching. Structure. Tease. Me seems to be all the rage in this field I'm not sure maybe in other fields too but, tease. Me actually sort of shatters. Shatters. The data in, a, bad way because it only cares about keeping. Stochastic, neighbors, together, and. We believe that if, you don't keep just neighbors but, keep, manifold, intrinsic, relationships, altogether, it's, a better way of preserving, the structure in the data and you see fate. Can preserve the data but it can also denoise, it which is important, in, so. You see the lines that fate gives, are thinner. And. Also denoise it in the. Manifold intrinsic, dimensions. So, here's. Fate. Shown on and. We, especially hope that a lot of people start using this because this basically, can work on any any type of data here, you see the embryonic stem, cell differentiation system. The, red to, the left, is. All embryonic. Stem cells because, that's from day 0 to 3 and we started with 3 onyx M cells and. As. The, days go on you, create the lineages, that are colored or colored in orange that we picked up on days 6 or 9 and, then, finally the blue where the latest did we generated, 24 through 2007, and. Again. You see PCA kind of picks up the alesis time label and. The fact that there's fewer cells so there's less space occupied, by the red, than, the green and blue because it's they're, branching out as lineage, but, the internal, structure, is not.

Clarified, But. Teach me doesn't even do that so this is very bad for preserving. Continuum, structures, so you see the red is. Shattered. Out to on, either sides of the green and, the global relationships, are all completely messed up so can, I trust this. For. Understanding. Really the overall, global structure, of your data diffusion. Maps which is, somewhat, based on this it's based on this concept but it's an older method it. Doesn't preserve, in a way that you can visualize basically, what it's doing is it's taking each, branch each trajectory, to cluster and putting it out to a different dimension so, you need to visual you need to look at many many many, diffusion. Dimensions, to get at this even, though it has some of this information. And. This is how the fête algorithm, works so the first three steps are similar to the steps that you saw before which. Is you have the data you learn distances, and affinities, and, you power. The matrix, although usually, to a higher power because you're visualizing. In two or three dimensions but. Then we use some information. Theoretic techniques, after that so after that we, realize that we want to really preserve, the information in, this matrix sort, of information theoretically, as much as possible in low dimensions, so. We come up with an information theoretic distance, between these diffusion, probabilities. And. Then. Again. We don't want to lose any information so instead of just visualizing, eigenvector. 1 versus eigenvector, 2 which is what people do in pca we, squeeze, all the variability in in two dimensions, by running, metric. MVS on these, informational. Distances, that we came up with so we're. Making sure that we don't leave, any information, out there that we haven't captured and that we're capturing as much as possible in our distances. And. The idea of using an information. Theoretic distance, is that each, point, is a probability, distribution, each. Point is a distribution, between affinities. Named. Probabilities. Of walking, and if you just took a straight point wise distance, it would be dominated by what's. Happening with the nearest neighbors, and it would look a lot more like tease me but. If you use an information, theoretic distance, usually information, theoretic distance, have a damping factor so, you have a log as, you, doing Shannon entropy, so, this damping factor makes the low probability, differences, full changes, also, picked. Up by the distance. And. The idea again of using metric. MDS instead of eigen decomposition is. So, that we only get two or three or however many, dimensions you want. And. It's doing the best job it can to preserve all this information, and low dimensions, so, that gives you fate and. Here, again we've these. Are ground truth examples, where we generated, the stuff that's on the first column so, we generated, this and we see you, know what fate is giving us versus what other other, things are giving us sometimes. The other methods do do pretty well on, sort. Of like the Gaussian mixture model sometimes. They again do weirdly like tea sneeze putting the green and red in between the blue, and. Orangish. Color. Sometimes. They're very noisy like I so map can be very noisy because, it's. Is. Doing geodesic, distances, shortest path distances, their hardest path distance and will take every shortcut they can through every outlier they can, there's. Now. You can see things like force-directed layout, that are just trying to like splay things out and. You, get what one of my students called sneeze spots from it or it looks like pointers, sneeze out there's. Other methods called monocle that fit trees to data and, it's not quite appropriate but, some reviewers asked us to compare. Against against, that and. And then you can imagine that it does the same thing to biological, data so there's different types of data we visualize, here the. First is our embryonic stem, cell differentiation data, the second, is retinol. Bipolar, data. Which. They mainly. Thought had, a lot of clusters, but we can see that the clusters have some progression, within them. And. Then there's. Hematopoietic. Stem. Cell differentiation data and a, there's. IPSC reprogramming. And E and, when they're grayed out it means those methods don't scale so we had to so sample them to begin with which, turns, out to be a huge problem for most of these where's this kind of diffusion process, can be made made really fast, using some tricks. So. This. Is a visualization, of what fate, was trying to do it's trying to keep global, structure, like like PCA but. It's also clarifying local structure because they can denoise in nonlinear directions, and find nonlinear, directions. Of dimensionality. So you get more fine-grained, detail than you would and. You can sort of zoom in on those fine, details and I have a Suman, slide but, we. Can show for example that this orange branch that the, way it's splitting is actually real and there's some biological.

Conclusions. About what the differences between those two, retinal, cell types are so. What can we do after we run fake you can do a lot of different things you can cluster around the fake dimensions, we. Compute, intrinsic, dimensionality. To find sort, of branching points if. You have a lot of 1d, trajectories. In your data branching, points have, higher dimension, locally. We. Can use. These branching points in conjunction with the clusters to come up with sort of lineages, or regions of analysis, and. So we gave this kind of analysis to, our, collaborator, at the Yale stem-cell Center and. She was able to play. With this tool and identify, a lot of different lineages, so she came, up with this tree of lineages, and there. Are several intermediates. Here that she did not know so this is truly a troll toll for people, to visualize data their own data from their own domain and figure, out what's. In in the data and, use, it for further analysis, so we were able to sort. Out some of the populations that she was interested, in and then she can fact sort them and figure. Out what to do with it next. So. The, final, project that I'll talk about it kind of ties all of these things together. Yes. It seems like there are two big, thing ones when, we saw by using the division. Whatever to, do a small thing the other one is you replace to, do, and. Then. The other one is you uses the information of divergence well. Yeah. So. I'm wondering whether you have you guys tried to ablate, the two aspect, and see which one yeah. So actually all these steps are essential, there is it's a random collection each of these steps is essential, for getting. The end result so, if you remove the fusion doesn't work if, you're removing informational, this and doesn't work so all these are essential, steps, and. If you don't go from yeah we've validated that every step is it's basically necessary. I would do we still have that figure in the Varekai paper, no there's always constraints. When you submit was, wondering because people, use this need for like gene expression, clustering, and, there, I consider, one, big difference is that the. The cell. Count that the transcript.

Town Is not. Noise, no it's the noise is this, thing, right no it is it. Can be so, I think that I think, these knees may. Be okay, for some kind of clustering because, it kind of keeps local, neighbors together, but. If your data doesn't have, that, kind of cluster structure and it's it. Keeps them together. At a very particular granularity. That when, you're ready Larry is determined by the perplexity. You set in tase me so. Basically at that level of granularity it's. Kind of splitting your data out into clusters and I think that's an okay use case for tasting but this is not an okay use case if you're trying to find the relationships. Between your populations. Which. Clusters, are actually close to each other because. Also the KL, divergence penalty. Is very lopsided, it. Cares if you put close, neighbors far but there's, no effective, penalty for putting far things close or in any configuration. But. What tease me does that PCA, doesn't. Do for example is it does squeezed variability. Into two dimensions but it does it was stochastic rating in this one. And. So that's a little random so if you run tease me many. Times you'll get different results. So. The. Next round this project, I'll talk about is so I've talked a lot about how really understanding, the shape of the manifold helps you visualize helps. You figure out the differences between experimental, conditions, it helps you restore or, transcripts, so, can we do some of this with deep learning, which. Adds to it all the other features that deep, learning has just, even increased. Scalability. And the ability to do many, things at the same time because you have some sort of deep. Model through which you can process, your data. Heydo. And magic are both very scalable, and, David worked a lot on making them super scalable so, I'm. Sure he wants a to mention that. But. I'm, sure you guys know that in deep learning you can farm it out to a you know a bunch of GPUs or something the more jigglies you how that you it is to paralyze. This, but if you have a lot of CPUs that's, fine. So. Sassy. Is, an autoencoder based. Framework and that's what the a stands, for. So. We think really hard about making these acronyms. So. We. The, idea of saucy is actually to find these kind of patterns emergently. In a neural network. Instead. Of using. Some kind of diffusion based. Algorithm, space on constraints. And regularization. And. What. We're trying to do with saucy, is. We're. Trying to go to the realm where we there's massive, amounts of data. Jonathan. Was saying today that there's massive amounts of TCR data for example so, if we have massive, amounts of data like single cell data collected, on many many samples we, want to be able to run it through this system and come out and read a lot of patterns off the, fact that an.

Autoencoder, Neural, network is essentially a nonlinear, dimensionality reduction, as. Long as you have nonlinear, activations. So. Again. Okay so what, some of the things that we design saucey to do are. From. The ways the different layers are regularized we, can read these patterns off the data so actually, people, always complain that you don't know what the inside neural network is doing and that's not the true in our case of our neural network because, the, outside, it's, doing one interesting, thing which is denoising data, but other than that it's just recreating its own input that's what an autoencoder does, the, internal, layers are, good for visualization, we. Can read clusters, off one of the layers, we can batch a line or batch normalize, them and, then we can also recreate. The output so those the different layers and then finally. Like it gives you sort of the nonlinear dimensionality reduction, on, particular, patients, can give you, stratifications. Of the, patient's themselves. So. I. Guess you don't need an explanation of deep neural networks an, autoencoder was first introduced many many, years ago I don't know when it was first introduced it was very it was popularized, in, this science paper in 2006. And the idea is very simple that the only penalty that the neural network. Natively. Enforces, is that the input, be, recreated at the output, and so. This makes it unsupervised, you don't need any labels, and it, means that if you have, reduced. Dimensions, in the middle that they have some kind of dimensionality. Reduction going. On so. This is more, specifically, what. Chasity, looks like. So. Safi, has an MMD. Penalty. Or a. Penalty, to online batches. Enforced. On its lowest. Dimensional layer so the idea here is that, you. Treat the batches as different distributions, and align them there's. A layer that's two-dimensional, that you can use for visualization. There's. A layer that has a special regularization, that, we call that information, dimension, regularization that. Binarize, --is, activations. Or makes. Them easily binarize. Abal and, there. You can read clusters, off the data and finally the output layer do, noises the data because it's trying to recreate the data after throwing off all these dimensions in the middle. This. Because I defaced when the stuff that you showed earlier it seems like the particular, arimin traces sheets make a lot of difference. Tyrel business performance, and furious is like are using simply clean water is trying, to use of the probable state law so so saucy is it's not stochastic all that we have other significant. Role networks it is using a simple reconstruction, loss like mean squared error yeah. And. But. There's penalty terms due to the regularization, so. There's an, yeah. But they're all added to the. Yeah. And I can I can go over that I think. I have some slides to go over how how I can strain it so we yeah we constrain, is so that layer is giving us.

Some. People do. Register. Is, but. You can have, and. Actually Elsa I don't think that layer sizes, similar as is shown here I think it's a it's. As fan-out layer it's sparse. So. The visualization. Again you can compare to these other methods like PC eighties knee and. Things. Like that you just take the two-dimensional, layer and visualize, it that. It's the representation, for the cluster until the idea was this is that, if you have some, cluster of cells. You. Want, to be able to read off which cluster, they're in and the within cluster variation, is only stored, in a. Small range on top, of this binarization it's, still there because it recreates it but. The. Majority of the signal that these nodes. Are giving is which, cluster it's in so this particular cluster is encoded by basically, 0 1 1 and, you. Know a different cluster could be encoded by something different, so. The way you do this is we. Penalize for. Activation. Entropy. As as in fun Neumann entropy so. If, you, take all of your activation, values. And you treat them as a probability. Distribution and. You compute, Shannon entropy on it. If, all your, nodes are activated, to roughly, the same level then, you'll have high. Fun. Knowing and entropy if you can call it that if, one, of them is one and a lot of them are zeros. Or. A few of them are ones and a lot of them are zeros and you have low activation, entropy, and, so this gives you sort, of an erisa ball. Entries. Or, activation, so with no regularization you, see a lot of things are activated, middle high l1. Regularization, which a lot of people use just, avoids the very. High value and, then they're pushed over and you still have a lot of entropy there but. Our regularization gives, you a binarize, able shape so, if, you round, it or actually binarize, it you get cluster level information rather, than cell level information. From. This. And. That's, the main idea of the, regularization, on the layer that reads off the clusters for. The badge effects, this, is an embedding of patient one versus, patient two and this is a dengue patient data set. And we use a probabilistic. Distance. That's. Been years a little bit in neural networks but not too much it's called maximal, mean discrepancy. And. You. Can compute, this at the batch level you're computing, the batch one B, have. Minimal. Probabilistic. Distance, to batch two. But. It still has to reconstruct, and do. These other things so. It encourages it to, sort, of minimally, align the data and not totally Rhys Crandall it so, for, the, red and the blue are a different patient so the. Embedding. Is mainly organized by patient, which we don't want we want to see we. Want the, batch effects between when. The patients were measured what time of the day all that to go away so after, I'm MD you see that it's organized. By cell type so now we can see how much cd3, positive cells both these patients these patients have and things like that I don't know.

So. We tried this on the, stange data it was from. Hanson bangalore was a neural, biology department. And NIMH, hands so they, gave us it's, something like sixty patients, and, and. They, visited two, times and then there's an. Additional, number. Of patients who came with them and, they're the healthy controls we're living in the same environment and. So, there were 180, files each of them have. Many many cells and, so totally, this data was 20 million cells and you, had to batch normalize between one hundred and eighty different files and there's. The. Few better normalization, tools that are out there don't, really, normalize that this scale they do two conditions or something, like that. You. See what the cells look like if, you combine all the Acutes and the, healthy, versus, the convalescent, so kind of lesson are the acute people they came back some weeks later and there, are souls are starting to look a lot more like the healthy it's just kinda, a check but you see that there's no sort of batch, effect you can't see each of the acute patients separately, or anything, like that it's, it's a unified manifold, so, on this unified, manifold, you can go ahead and cluster like it's one dataset. Which. Wasn't, possible before there's other methods that will cluster totally. Separately and then try to match these clusters, and it turns out to a huge problem instead. You can just cluster, the this, data as, if it's one data set and you, come up with these populations. Of cells. We. Find that. Some of the populations of, cells for example are very interesting, they're those there's, the TCR, Gamma Delta population, and we saw that a lot in the, acutely infected. Patients, that they have, sort. Of like NK cells almost, they have some innate activity, they show signs of innate activity, and this is a rare population, but we can still pick up a systematic, difference so. Then you can, see. These, signatures. Of these different cell. Types for the different patients that are acute. Convalescent. Or healthy. And. Then you could embed it again with, with an MD based the, MMD based distance, we. See that all the patients. Themselves. Are stratified where the left side of the embedding is a lot of the acute patients and the right side is is convalescent.

And You see people are kind of listening to different different extents, and things, like that and then you can associate, this with the proportions, of different clusters that you. See in. Your clustering. And. Of course the point that I was making is that you're running all of this through these neural. Networks that are tricked, out or whenever, so, it. Should be very scalable, and we actually do find that it's very scalable so we've compared, against so, k-means in fina graph or clustering methods, t, sneeze sort of a clustering but also visualization, method say. This the old fates are compared. Against that nun supervisor. 8pc. A neural. Network completion. M&N. Is. A type of best normalization. Diffusion, maths all of these things and, it's super, super fast come, faster. Than everything besides I. Think. It, was, PCA. And that's, because now PCA, is done with random projection, matrices really, quickly. But, PCA doesn't give you a very very. Good results for your randomly linearly, projecting, you're losing a lot of information, so quality. Wise saw. Sees way better than any. Kind of PCA because it does so many other things and it's it's, a nonlinear, nonlinear. Method. So. Finally. It's just because John I added one one slide just because Jonathan, was talking a lot about dynamics. So we, have a neural, network that we're training also to predict dynamics, in these things and this network is generative and it's stochastic, so we're calling this the, transcoder. And. Because. Fate, and these other methods allow. Us to pick up trajectories, we're, hoping to make these trajectories and find some deep representation. Of the, logic that's used to, make these these, transitions, but that's an ongoing project, so, do. You if we're hoping to test this on some kind of evolution systems, protein, folding system isn't it and things like that. Those. Are the main projects, that I want to talk about again. The main idea is we. Pick up a lot of structure that could otherwise go, missing in data by using very careful, and nonlinear dimensionality reduction, in, manifold, learning methods, using. Graphs signal processing, and, so deep, learning and we believe that these have wide applicability much, beyond what I said to do the specific data says that I was showing it on I also. Have a lot of talented. Students, and postdocs. And in my lab I, try. To get that into Internet IBM let me know they, were doing a hackathon then and so it, just took pictures of them at a hack of them and actually I think they were having a lot of fun you know it. Looks funny and, of course we cooperate with a lot of people at Yale including, you, know immunologists. Infectious, disease people stem. Cell well this neurologists. Y'all, center for genome analysis. Um. Oh this software I've talked about is on github. Maybe, besides transcoder. Papers. Are on bio archive. Some. Of the papers on. Related. Topics are and nipson I see mo a lot of them are in bio archive, also or regular archives you, can feel free to me. Yeah.

So See is just in my pancreas. Yeah. We noticed that we have to even though nobody in my lab likes, or uses our but. I just want you as heaven bless your, code hittin huh let. Me make someone do it um. So. Let, me know why. Are you an RF an okay. What. Are you guys code in your own. Languages. Certain. Are you in. Centuries. This if. You have any questions let me know we. Have sort, of additional files on and they say something yeah yeah so. What. Up so you showed us scalability. Game right so. Kind. Of thing of you that's like magic, and say you need to think, about this pairwise, distance. And and that's that's where you've. Got a little awkward saying, yeah that's exactly where you saw. Me, go yeah so. It's so subtle but why in, that case I do I do have a slide that kind of shows that. You. Had, well. Yeah. Sure, but. Also a related question is that can you imagine using and you'll now have a. Way. To do, the same thing like. If. We do use the neural net could do the same thing I'll, show you. So. We don't get a neural net we're going saucy doesn't doesn't. Explicit. So the last layer of sassy, actually gives you the same kind of risk River constructions, is magic and it most closely compares, to magic more, so than to the other imputation, methods, and this is an actual innocence awesome, manuscript, but, this. Actually shows that this neural network also, learned semana phone without, diffusion, and without us having to define innocence, meant some other time. Just. Not incredible, if you the, graph but. Using. Consents. Canes in NLP we had only similar, we, have very similar thing where you, have a knowledge graph like, a large part of the relation. Between the entity, and. Then traditional people, use random walk to to. To, compute the fusion policies you mean you're doing me like word vector, something, like that yeah nowadays. We. Embed. It into. Embedding. And then and. Then you can compute any arbitrary, thing without doing any random walk that's a random walk very expensive. Right. Yeah, even. Though we have. Yeah. So here you can. Talk. About this this. Slider oh yeah oh. So, instead. Of computing all paths, between, points, or. Constraining. The paths to. Go all, through. Specific. Landmarks we fix the number of led words but. We pick enough of them to accurately. Describe. The graph. So. Now we essentially, get a linear, so. Now that effusion goes from original, points to land my points, series little points but if you. Do. The matrix. Association. Correctly, the landmarks, are chosen. To be sort of uniformly. Representing. The whole graph. So, he was very little information by. But, we still can exclude the vast majority of paths, so you literally a random, sample, there yeah. We. Actually use a spectral. Clustering approach to, remember. It was quite a you say image and fast random. Sve. So, it's also for fast and, so, it's so, essentially, the whole thing is linear so he's almost a liar a little bit like a random objection, yeah. We didn't ran a projections, to bigger the landmarks, and then. We compute, paths through to go through how that works, just. And. It works, almost. Exactly, the. Same result as the original. Version. Guys. You can look up the fate either on by archiver on cell sneak, peek and, he'll show face comparisons. And thousands. Of supplemental, figure isn't it. So. The spike thing in, the, run time there is that when, you're not using as. Many landmarks, yes exactly, that's a very good observation so. We. Use I think I default something like 2,000 landmarks, so, when you're you have fewer points in London that number there's no point in doing this and. Yeah. So only really large numbers does it. Help. To do to people in immersive quite exactly. The. Other motivation. For sausage isn't it just scalability it's also that all these tests are, implemented. On some kind of unified. Representation. Of the data so. We. Find that some, clustering methods they might represent the data as graph. Or some kind of cloud and, then you have a fashion emergency med, that represents data as, canonical. Correlation dimensions. Or something and there's very disparate stuff and, the. Different. Features you're reading off the data are not made from the same representation. Each. Other TC, in your data there's. No guarantee, that, TC, picks up the same clusters, as k-means because I have just different algorithm didn't reverse invasions, there was one of that reasons yeah when you did stochastic gradient descent you've, passed the gradient all the way back so all the layers are coordinated. Here we have one representation cyst, so. Just kind of guarantees, that you, get. The same clusters, as your visualization. So. That clustering. Or where you use the information. Measure, to constraint, is, that so. That's gonna effectively pick a forcible push.

You Towards picking a single. Single. You. Can worsen the the nose to become binary, sort of digital either. Zeros, or ones. But. Even more so this offender force than that because a single one yeah, another, one would be like l1 so sparsity, sparsity. Would just okay try, to make as many no to go to zero we're, not doing that ones it denotes these zeros are one so you wanted binary. Digital, so, entropy is being computed on each node independent, so each each, node, na Holy One, comes a probability, oh, Shannon. Shannon. Is repeating oh I'm calling it my name an injury because it's like an entropy of values rather than probabilities. That's, why I was calling it. Cross. Over. Cross. Over set, purse / it's. A cross the nodes across, yeah. But we, don't see that it gives us a single one because it's not it's necessarily, a sparsity, penalty, so. Usually. The neural network, you. Will not make it 100. Be. Given that it can still reconstruct, so usually the vectors, are whatever, three four ones. But. We don't we don't usually do that we, find some 100, we do a hunt one house is, you have to take the number of nodes which will give you the number of clusters and, in this case actually. We. Don't know, we don't know the number of clusters they were staying. Here we don't know the number cluster so we don't have to define we've got to set, the number of clusters we get, but. We can we can overall tweet the recognizer tweak, the granularity, by, upping this penalty or donning this penalty but we don't know exactly how, many are, good nesting like to me Takumi's, give us ten clusters or something. So. How do you. Some. Coefficient. Regulation, so. That's, sort of available, it depends on if you want really big clusters like few, cells versus B cells versus monocytes, or if you want to go inside the new cells and see the different subtypes that's pretty robust like we see that we change, it biology is pretty. Similar. Or whatever test case we have like a few mixtures, or whatever this, is pretty robust against those. Cool, thank, you guys.

*2018-09-28 13:17*

__Comments:__

*2018-09-30 08:26*

I find these techniques very interesting.

*2018-10-02 06:27*

28:33 "But we're not super big on statistics..." - "Doctors are" "Maybe they shouldn't be" Snap!

*2018-10-03 00:09*

yeah, weird!

*2018-10-03 21:04*

Most viewed of those 1hr videos

2018-09-29 07:27Amazing stuff!