Richard Jerousek and I.
He is in love with Saturn’s Rings
A short Research Biography
A bit of technical background: I am an AI research scientist, I worked on a variety of areas, from theory to practice and applications of AI. On implementations, I worked from Cuda level ambitions of changing the Conv layer, to creating ordered statistics on pooling layers, to implementing probabilistic auto-differentiation platforms. Here, this was my self promotion.
I failed my statistics course in undergrad, but somehow I ended up deeply in Probability Theory and Information Theory in my studies. Another one of those academia stories. I didn’t choose to study AI, it just grew on me. At some point, I formed an unhealthy obsession toward interpreting even viruses as torn pieces of a book called “The celestial condition of planet earth”. But I guess it was the question of unknownness that really dragged me there. In hindsight, and I will use this word, hindsight, a lot.
In AI, one of my dearest research path was on compression of bit string, for inference.I researched that for a while, the premise was to look at collection of bitstrings, and constructing invertible boolean functions that makes the bits statistically independant. And to do that, the sum of marginal entropies will decrease. Meaning the entoropy of the first bit plus the entropy of the second bit and so on decreases when the bits become independant.
I worked on that, and I realized the curse of dimensionality in the binary case. I could compress but couldn’t generalize But in hindsight, all hope is not lost.
I moved from that project, in parallel, i was working on discreteness in neural network, not to threshold values, but to interpret values as unnormalized log-probabilities on finite number of states, and It lead to this paper, Rediscovering CNNs Through Classification of Finite State Distributions. Again in hindsight, I dont fully approve of the approach discussed in the paper. I think it is a valuable starting point, to get out of the curse of dimensionality, already existing even in a single real number scalar value, the uncountable infinity.
Idealogically now, I prefer exponentially large but finite states to work with theoretically, than to get lost in the uncountable infinity of real numbers with the hope of locality and smoothness. I dont have anything against smoothness though, I can interpret it in the sense of forgetting. Number 4.2432 is not a point, its a region, because we missed the infinite 0’s. And in calculus class, every teacher, with a marker, shows a point, on the white board, with a region, as thick as the marker. Anyway
Working on finite state representations, I started experimenting using only probabilistic binary variables inside of CNNs, and derived the way to optimize the probability distributions governing the binary random variables. As expected the problem of sampling from conditional distributions makes the optimization very slow, but there are turn arounds that I will eventually write in more details in this website. At the time, I had so many parallel interests and research, and one of them was the question of Priors, and Edwin T Jaynes, was someone that influenced me from beginning of my research. The consistent issue that I faced, in optimizing probabilistic systems, was that the probability distribution of hidden variables, would tend to degeneracy, and the optimization stops. This problem could’ve been caused, either by incompleteness of the model family, meaning the model was not a universal approximator, or the regularization.
So there should have been a way to resolve that. In parallel “the prior” research helped me.
I remember 2015, I was talking to George, I call him my applied mathematics professor, about Rejection Sampling, and I was wondering what would be the distribution of rejected samples, just to play. And I eventually wrote a report, that was basically finding the distribution of the rejected samples, and I called it the complement distribution. And I thought that was the interesting finding, but it was the rate of rejection that was interesting.
So facing my optimization problem, and having that research somewhere in the cabinet of the thought, one day I realized, 1 - (rejection rate) is the probability of the model itself. So I revisted that, and with that line of thought Maximum Probability Theorem, as a name, was born to me.
I remember I was so happy that I ran to Alireza’s lab, my research buddy, and I told him about my finding which was the review of probabilistic machine learning. Where models are sets, and they have a probability mass, not density, no matter what the parameterization is. And that moment or maybe a day later, I remember I was thinking about all the time, that I smoked my cigarettes, at the staircase behind the department, and my head was painfully steaming, thinking in loops of what would be the prior over parameters of a model, and finally I have found a direction to it. such a long sentence right?
That is maybe a flavor of how I romanticized research. In summary, I have worked on generalization, optimization and model building. and to this day, I think, there is a way to look at bitstrings, and there is hope to use our computation efficiently. Or to stop deploying AI so fast, and before realizing at least some rigor about generalization. Not approximating bounds, but finding what is the best we could do for a guarantee. And also appreciate the subjectivity that is induced in any algorithm, usually on a complex society. And to understand where to use this machinery and where not to. We should’ve learnt some lessons from testing The Bomb in the atmosphere. And Maybe the solution to our untamed rational is out there, maybe someone wrote an article that is missed in the academic publication industry. I hope to see that some day.
Now to lighten the mood I brought one of my most ground breaking findings
Info Theory Entertainment: How to Classify Images, Using Only Mouse and Windows 98
This is a comic demo of my obsession with information theory.
Gather a training set of Cat images, and dog images. Create a folder called cats, and create a folder called dogs.
Put all cat images in the cat folder, put all the dog images in the dog folder. right click on the folders and send both to compress format using say ZIP. right click on the zip files, write down the size of each zipped folder.
When test image comes in, put the image in the cat folder, zip it, find how much the size of the zip file increases. do the same with the dog folder. whichever zip file that has the least increased size, it is the predicted class.
And In my spare time, I tried it on CIFAR 10. it has an accuracy of 15 percent. which was amazing for me :) If you are familiar with ZIP algorithm and the window sizes in Lempel Ziv like algorithms, you can see why 15 percent is surprisingly good for image data.
To be continued, Author at work ….