Jonathan Lam

Core Developer @ Hudson River Trading


Blog

On code reuse and copying

On 10/3/2021, 7:54:52 PM

Return to blog


tl;dr: cite and understand any copied code

This is my personal views on how to properly reuse (copy and potentially modify) code from external sources. It is not a legal document and may not be precise, but the ideas are presented to the typical CS student who does not care for those subleties. The guidelines may seem fairly obvious to you if you are accustomed to doing so, but there are too many students who still do this (at Cooper and otherwise). The main issue at hand here is academic integrity, but there are also potential legal consequences.

Properly reusing code is one of three few things I would say whose importance is not taught in the introductory CS courses at Cooper, and also is simple and methodic to carry out. (On the other hand, Cooper is great at teaching very technical skills.) The others would be how to ask a good question (e.g., on Stack Exchange forum) or the skill of consistent implicit documentation (ranging from comments to error messages to commit descriptions).

I was recently reminded about this via tutoring CS102 students, but this applies to all levels of computer science. This is understandable at the freshman level, because software is a medium unlike all of the other media used before. Because of its ephemeral-ness and the ubiquity of pre-built code snippets online, new CS students may be unaccustomed to the proper methods of citing code. However, I still see uncited code written by juniors and seniors, who (IMO) should be expected to know this skill. Seeing this in my fellow upperclassmen's code is a huge pet peeve of mine.

Understanding code reuse

It is important to know when code reuse is acceptable. The ubiquity of code snippets online (and the advent of the open-source mentality) is a wonderful tool that allows for the exponential growth of computer software. We also reuse code all the time in the form of specialized libraries, which are desirable because they provide a well-defined interface (API) with which to interact. However, sometimes we find isolated code snippets outside of libraries and need to include those in our code, and so we copy it. Reusing (good) code allows people to learn very quickly new software technologies than before and combine the best software to produce even better software; but it may also overstep legal bounds and undermine academic goals.

Most of the time, copying a trivial piece of code (e.g., a particular piece of syntax or standard library function usage) is fine. Programming languages are complicated and it's easy to forget the details of particular programming languages. These details are part of the official documentation of the language and can be treated like facts that you look up (rather than one's intellectual work).

Copying non-trivial code is not acceptable when it achieves the main goal of an academic assignment, or when it is not properly cited (or otherwise follow the legal obligations of the software license). Clearly, if the assignment is to write an atoi implementation in C, you cannot use the builtin implementation (or any other implementation) whose source code you find online.

Cite the copied code

If it was not clear before, I hope it is not difficult to believe that software is intellectual work, and is protected automatically by copyright like any other media. While much software available for anyone to use as defined by their legal licenses, there are often restrictions associated with their use. Different software licenses, even "open source" or "copyleft" licenses, have varying requirements on the software that uses their software, such as proper attribution.

Note that this is true even if you modify the code slightly. This still borrows a large part of intellectual property from the creator. Plus, it is still easy to tell when code is copied, even when modified; the style and coding level will generally be wildly different, so slight modifications are an ineffective way of hiding the fact that code is copied.

As a student, you probably won't be expected to care too much about specific software licenses. If you find something openly available on the web such as on a public GitHub repository or a Stack Overflow post (i.e., you're not hacking into proprietary software codebases like Microsoft Windows), chances are the license allows for code reuse with a simple attribution. In other words, simply state where the code came from.

For example:

// get URI query params; by ArtemBarger
// https://stackoverflow.com/a/901144
function getParameterByName(name, url = window.location.href) {
    name = name.replace(/[\[\]]/g, '\\$&');
    var regex = new RegExp('[?&]' + name + '(=([^&#]*)|&|#|$)'),
        results = regex.exec(url);
    if (!results) return null;
    if (!results[2]) return '';
    return decodeURIComponent(results[2].replace(/\+/g, ' '));
}

It's that easy. A link (and a brief description) will suffice.

Understand the copied code

If you don't understand your code, you will simply have bad code. There is a good chance that the copied code will not work without modification with your existing codebase, or it may not cover the exact behavior that you need (e.g., it may fail some edge cases that are required by your assignment). It may also follow a different style guideline than your code, making the combined codebase inconsistent and difficult to skim.

Worse than these materialistic issues is the main, idealistic one: you will simply not understand your code. The code has gained complexity and become more difficult to maintain. It may contain code that is more complex or more naive than the code that you are expected to write for your class. Problems that may stem from quirks or mistakes in the copied code may feel near-impossible to debug. If the professor asks you to explain the piece of copied code and you cannot, then they may ask you to scrap it anyways, and you will have lost time and not gained any knowledge.

The code that is copied may also simply be poor quality. There are many sites out there that churn out generally bad-quality code snippets in exchange for views and monetization. These codes are usually poorly-commented, use obscure variable names, and have premature optimization that obscure intent. They may use insecure or outdated functions that have security concerns but are convenient to use. Even popular sites like TutorialsPoint, GeeksForGeeks, and Quora are prone to having (IMO) poor quality answers, code explanations, and code snippets mixed in.

Another issue is the potential for malicious code. Executing unknown code is a big no-no in cybersecurity, for obvious reasons. (There is also the more difficult issue of zero-width characters, discussed below.)

Of course, a combination of not understanding the code and not citing it is a recipe for disaster, and the potential for severe academic penalties. That is a plain case of plagiarism.

Miscellaneous

Even if you inspect a code snippet and find it not to be malicious, it may be hiding some invisible nastiness: zero-width characters. This can break code compilation at the very least, and be a security concern at worst.

I've mentioned three sites that tend to have many code snippets: TutorialsPoint, GeeksForGeeks, and Quora. I don't particularly like these sites, if mainly because of lack of moderation (and thus poor question quality). Stack Overflow has always been my favorite (because of its moderation), even if question quality is falling nowadays. However, while Stack Overflow is great for finding high-quality code snippets, it has a strict question policy; for looser and more open-ended questions, I find that relevant subreddits are increasingly well-moderated and useful.

I have largely not addressed libraries, which are another form of code reuse. However, unlike copying code directly, I believe that using libraries is a much better practice in general. Good libaries will have well-documented API's and the citation is implicit (by importing the library). This leads to additional complexities like package management and versioning, but this is an artifact of managing larger software systems.

Also, while copying and reusing code is not always a good practice, reverse-engineering (i.e., understanding) a snippet of code is incredibly good practice. You can learn a lot from looking at other people's code, and if you totally understand what a snippet does and find it clever, then reusing the code (still with proper citation) is all the more justified.

This document is a rough guideline. For specific details, you can check your school's policy on code reuse, if there is one (e.g., MIT's code-writing guidelines; otherwise it will probably fall under general academic integrity (plagiarism). Most likely, exact guidelines will be up to delineation by the professor. You should also consult the software licenses of the software that you use (including any libraries you use), which should be easily Google-able. If you re-use software from a question-answer forum, then the license will be up to the hosting site (e.g., user-created content on Stack Exchange falls under the CC BY-SA 3.0 license, which is not a software license at all).


© Copyright 2023 Jonathan Lam