Core Developer @ Hudson River Trading
On 12/18/2021, 10:24:28 PM
This semester was pretty crazy. It's hard to tell it was actually harder than previous semesters or not; it may be that every semester seems harder than the prior. There were really a lot of fun things going on, and only now do I have time to catch my breath and compile these into one place.
These school projects can all be found under the Projects page along with an abstract, but I'll emphasize this semester's projects and the context behind them here1.
The first class was an independent study on (formal) program analysis. We followed the CMU Program Analysis textbook2 The first half of the textbook was focused on (static) dataflow analysis, and the latter on a survey of high level topics such as Hoare logic, SMT solvers, symbolic execution, concolic testing, and fuzz testing. We'll revisit dataflow analysis in a later project.
For the final project, we decided to veer off on a completely different track towards interactive theorem provers. When searching for a topic for my master's, formal verification and the Coq interactive theorem prover often came up. In my current master's group, the interactive theorem prover Agda is used instead. I wanted to figure out what "interactive theorem proving" really meant, and so made the goal of the final project to do achieve this. There are a few resources to learn Agda, none of which are for the faint-of-heart: we mainly used Swansea University's "Interactive theorem proving for Agda users" lecture notes. At a high level, what we found is that Agda can be seen like a normal functional programming language, and can be used as such -- superficially, the major differences are that its type system is more advanced3 and that we interpret types in the theory of propositional logic. We then end up programming logic, and using the type system to (type-)check our proofs. It is difficult to define Agda satisfactorily in a few lines4, so I will save it for a later post -- suffice it to say that the correspondence between theorem proving and programming is both straightforward and mindblowing at the same time.
Originally, for the final project, we intended to implement and prove the correctness of some data structure, such as red-black trees, in Agda. However, this proved too difficult in the limited time frame, and we chose to write up and explain the intuition behind several simple Agda samples.
Everything was very new in this class, and I was constantly fascinated every week. I think I really just like algebraic reasoning, and the freedom of independent studies5.
We settled on a problem that doesn't seem to have much interest nowadays: image vectorization. I didn't have much time to work on this throughout the semester, but I was lucky to have two team members who were able to do most of the complex implementation. The inspiration behind this project was about video compression: I had originally wanted to create a new vector video format, hopefully to encourage compression in highly-geometric (for which vector graphics are efficient), variable-resolution (large monotone areas like backgrounds can be encoded efficiently with few shapes), or differentiable (i.e., with small numbers of changes between frames) videos. Our advisor is interested in machine learning and architecture, so we decided to shape our project more along those lines. We figure that vectorizing highly-geometric images such as architectural photos may allow us to compress the original image into a much more high-information-content form (i.e., with meaningful shapes); the result may be useful to simply have a vector format that's easily manipulable, or useful as the input to a machine learning system (i.e., as a preprocessing step).
We played around with a few different known methods for vectorization, and found that a sampling method was most suitable for complex images. At a very high level, we first convolve pixel-wise with Sobel filters to find the color gradient at each pixel, and interpret a higher gradient as a higher "importance" or detail density. Then, we randomly sample points throughout the image, where the local point density is directly correlated with the importance (i.e., areas with more detail are more highly sampled -- this gives us an automatic variable resolution). The points are then triangulated so that we have a vector format (shapes), and exported to a vector image format (such as SVG). Of course, we still have one more semester to continue this project, so there are still many more details to mess around with and refine, such as optimizing the mesh and evaluation (currently we are using content loss as the evaluation method, and interpreting the vectorization process as effectively a style filter).
The implementation languages are C++ and Python. There were some annoyances with both languages6 but things mostly worked out fine with both. The current implementation can be found here. See also the report and presentation.
I spent far more time doing projects for my databases course than I had anticipated, which is probably mostly due to stupidly going overboard. In hindsight, it was fun and I learned useful things, but it took a lot of time away from other classes at the time. Code and write-ups for all three projects can be found here.
The first project involved using an object-relational mapper (ORM) library to interface with a SQL library, to write and perform several queries on a given toy schema, and to extend the schema with some reasonable extension. I chose to do this project in Haskell (stupidly), and use the Haskell beam library. The second project was to scrape some data source from the Internet, store the results in a MongoDB database in some natural representation, and then perform some sample queries on it. I chose to scrape eBay for various information about sellers and listings, and the implementation language was Common Lisp (SBCL). A brief summary of these projects is that I spent far longer learning how to use Haskell and SBCL and the specifics of their libraries than I spent learning the database technologies. What was especially frustrating (and thus especially instructive) is that the various libraries all had some bugs (quirks?) for which I had to write some wrapper code, either due to the library being incomplete or outdated. For example, the beam library incorrectly handled restricted some valid subqueries due to its use of extensively-parameterized query typing, and the web scraping libraries I used in SBCL included comment tag, style tag, and script tag contents as text; this is not the default behavior in modern browsers.
As with every other class this semester, we were pretty much given free reign for the final project. Every other group created a web application using a database -- we decided to create a DBMS instead to learn about the intricacies in the implementation. Initially, we proposed a relational DBMS (RDBMS), but were persuaded to attempt NoSQL. We decided to imitate MongoDB, and created CUDB ("Cooper Union DataBase," or "CUDA++"). It was a difficult project with many difficult design decisions (especially related to three-valued logic or restricting comparisons across values of different types -- gradual typing is hard), but the end result was satisfying. The implementation language was Rust, and that provided a very pleasant experience.
Databases was unexpectedly time-consuming, but cybersecurity was expectedly so7.
We had three labs, each of which took approximately one day: MIT 6.858 lab 1: buffer overflows, 6.858 lab 4: browser security, and cryptopals set 1. I loved all three of the labs -- each were set up to be just easy enough to "hack" without being artificial. The only qualms I had were with the lab 4 and the use of shadowjs, which is an outdated Javascript engine with poor error reporting. For lab 1, we had to use C; for lab 4, we had to use Javascript; and for the cryptopals challenge, I used Golang.
For the final project, we were motivated by the first lab to write a static analyzer for buffer overflows. The idea is very simple: track all possible buffer origins (fixed-size local arrays) that any pointer may refer to, and warn if there is a size mismatch on a list of predefined dangerous functions, such as (and most notably) strcpy
and strcat
. Of course, this is a very oversimplified model of buffer overflows, and the uses of unsafe functions like strcpy
and strcat
are already warned about by any modern C compiler, but it was quite instructive to put the ideas from the program analysis course into play to develop a custom dataflow analysis that suits our needs. The source code, report, and presentation can be found on the GitHub repo.
I took a class on Sports Medicine and Biomechanics. It's extremely practical for anyone who ever does exercise. Mostly, it felt like high school wellness classes (at least the physical activity and nutrition parts) in much more (and much more interesting) detail. We got to visit the NISMAT lab.
The other nontechnical course was Polar Imagination: an exploration into the literature and history of polar exploration. Somehow, I ended up writing about the concept of death within the Inuit people, based on a series of anthropological studies, and which can be called the suicidal imagination among this polar imagination. See the paper and the slides.
My master's project was in a state of chaos this entire semester, and was pushed out by the more immediate deadlines of regular coursework and job applications. I am severely behind and need to make up a lot of work over winter break to get back on track. One of the major problems was not finding a project or topic to latch onto until mid-semester, when things already got too busy to do any real work. The other was overcommitting on coursework this semester. The third was being stupid with the amount of time I put into some relatively-less-important projects (read: databases).
I had begun looking into topics by the beginning of the summer, and this simultaneously was for topics for Master's and for Ph.D applications. (I later dropped the latter and switched to job applications.) Originally, I was looking at operating systems and distributed systems, and then compiler optimizations. Looking at research groups at universities, it seemed like there wasn't too much recent research in operating systems or distributed systems (perhaps more in cybersecurity or verifiable programmable networks). Eventually I started to look at programming language research, which seemed to be highly focused on safety through language design (i.e., through DSL's), and formal verification of programs.
This was where I was at at the beginning of the semester, at which point I looked at some work in programming education at the suggestion of a professor. For some weeks I was thinking about something called Intentional Programming, which I have an earlier blog post about. Suffice it to say, after some weeks staring at this very abstract problem, I came to the same realization as everyone else who had tackled this problem: metaprogramming is harder than regular programming (for most programmers).
This led me to re-search for a research topic, and I came across the Hazel project by skimming the POPL conference papers. Hazel is an interactive environment used to demonstrate the Hazelnut bidirectional edit calculus. This edit calculus is a form of structured editing, which enforces that your code is always in a meaningful state, while maintaining a low viscosity. This ties in to intentional programming (both rely on structured editing and play a role in programming education) as well as the user-interface work I did when interning at MathWorks this summer.
I have yet to narrow down my project, and I will definitely write more about it as that progresses.
There's not much to say about computer center work. That work is just regular work: keep order in the computer center and answer calls. Clock in, clock out.
I was quite involved with tutoring activies, and I was forced to think about many things: some purely subject related (how do I better explain big-omega/big-O as opposed to worst-case/average-case/best-case analysis?); some related to the meta of programming (how should concepts like loops and pointers be taught, intuitively?); and some general advice for freshmen. One of the problems that I'm trying to tackle most is that of students who copy code and see no issue with it, or those who dismiss code as being esoteric and unnecessary because their major is not electrical engineering or computer science. At some point, it appears as though there is no effort to grok code, even at the most simple level -- only questions of memorization are answered correctly, and questions that require the understanding of fundamentals are met with blankness. I'm afraid to say this, because I'm sure students are trying their best, but it is how it appears for a number of students. How can an educator overcome this mental barrier? Is is natural that the level of understanding only comes around on second exposure, after having had some time to be removed from the material?
Leetcode grinding blah blah blah. Weekly contests are pretty fun. I can usually solve most medium questions in half an hour or less, although hard questions are mostly hit or miss.
I made a a repo to store some representative or otherwise significant data structures or algorithms problems. The build system is fun, because it uses Makefiles to create a simple build system with dependencies. Each folder acts as a "package" that may require other packages. This allows for fast rebuilding of a package and all its dependencies (recursively8). I plan to write about this and the Makefile atrocities required for this in some future post. It is also very easy to add examples to extend the repo: creating a new package with dependencies only involves adding a _depends.txt
file that lists its dependencies.
I wasn't able to do much outside of schoolwork, so the only hobbies had to be related to the act of schoolwork-ing. Mostly, this meant some periodic typing races to let off steam. At some point, I received a Redragon K552 with red switches, and it's the first mechanical keyboard I've been able to mess around with. I'm still not used to the feel of it -- I will probably save these typing shenanigans for another time.
This semester I picked up two computers with broken screens that were to be thrown out from the computer center, a Dell Latitude E6420 and a Lenovo Flex 3 1120. I've already written about the former -- it's fairly fast and the keyboard is comfortable (much like the Latitude E5420, except that this new laptop is a little crustier). The Lenovo Flex is fairly slow (read: 2014 Celeron), and I haven't messed around with it much yet. Its build is solid, and its keys have a very shallow keypress. Disassembling the screen assembly on both has been an invaluable experience.
I wanted to do a lot more reading and writing, but haven't found time until the break. Even with break, I don't think I'll have that much time, so I'll limit it to mostly writing for now.
1. One of the themes you'll see is a wide diversity of programming languages. This was unintentional at first, but then it became a game to use a different programming language for each project. I only failed with the cybersecurity final project, in which we had to revert from using OCaml to C++ due to the outdated LLVM bindings. The full list of languages is: C (cybersecurity), C++ (cybersecurity and senior projects), Python (senior projects), Golang (cybersecurity), Javascript (cybersecurity), Rust (databases), Haskell (databases), SBCL (databases), Agda (program analysis), and OCaml/Reason (master's).
2. A work in progress: the current draft can (currently) be found here.
3. Notably, it has dependent types, which can be parameterized on values as well as other types.
4. Or at least not at my current level of understanding. I hope that any topic is distillable to a few short sentences once you've spent enough time digesting it.
5. Next semester I will be doing an independent study on functional compilers, I am also looking forward to!
6. We learned that the order of options amongst positional arguments matter in g++
(at least in the version that we were using). And then using Python projects by data scientists is bleh -- it's too much of a pain to get other people's reference implementations up and running on your machine when it only runs on theirs.
7. Readings every week, quizzes on the readings every week, a nontrivial midterm, a final project, three major labs, and much nontrivial material.
8. Although dependency loops are not detected -- this is left to the burden of the programmer.
© Copyright 2023 Jonathan Lam