Hugging Face and ServiceNow Release StarCoder.
Hugging Face and ServiceNow Research, ServiceNow's R&D division, have released StarCoder, a free alternative to code-generating artificial intelligence systems. Code-generating systems like DeepMind's AlphaCode; Amazon's CodeWhisperer; and OpenAI's Codex, which powers Copilot, provide a glimpse at what's possible with artificial intelligence within the realm of computer programming. If ethical, technical and legal issues are eventually solved, they could cut development costs substantially while allowing coders to focus on more creative tasks.
According to a study from the University of Cambridge, at least half of developers efforts are spent debugging and not actively programming, which costs the software industry an estimated $312 billion per year. Only a small number of code-generating artificial intelligence systems have been made freely available to the public.
StarCoder, which is licensed to allow for royalty-free use by anyone, including corporations, was trained on over 80 programming languages. StarCoder integrates with Microsoft's code editor and can follow basic instructions. You can create an appUI and answer questions about code. The BigCode Project contributors worked hard over the last 6 months to bring the vision of Code LLM to fruition. We can't thank you enough for your contributions to the community.
Congratulations to all the @BigCodeProject contributors that worked tirelessly over the last 6+ months to bring the vision of releasing a responsibly developed 15B parameter Code LLM to fruition. We cannot thank you enough for the collaboration & contributions to the community. https://t.co/282sCRJq3k
— ServiceNow Research (@ServiceNowRSRCH) May 4, 2023
Advertisement
How to build model
Leandro von Werra is a machine learning engineer at Hugging Face and a co-lead on StarCoder. The creativity and capability of the open-source community is something we learned from Stable Diffusion last year, according to von Werra. The community built dozens of variant of the model within weeks of the release. The release of a powerful code generation model will allow anyone to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.
BigCode Project
The BigCode project, launched late last year, aims to develop state-of-the-art artificial intelligence systems for code in an open and responsible way. The StarCoder model was trained with an in-house compute cluster of 512 V 100 graphics cards from Hugging Face. Various BigCode working groups discuss ethical best practices and implement methods for training code models. The Legal, Ethics and Governance working group had questions about data licensing, the redaction of personally identifiable information, and the risks of outputting malicious code.
Advertisement
BigCode was inspired by Hugging Face's previous efforts to open source sophisticated text-generating systems. The Software Freedom Conservancy is against the use of public source code, not all of which is under a permissive license, to train and monetize Codex. Codex and Copilot are both available through OpenAI and Microsoft. Codex and Copilot are protected by the doctrine of fair use in the US.
Advertisement
Don't run afoul of licensing agreements. "Releasing a capable code-generating system can serve as a research platform for institutions that are interested in the topic but don't have the necessary resources or know-how to train such models." In the long run, we believe that this leads to fruitful research on safety, capabilities and limits of code-generating systems.
Advertisement
The 15-billion-parameter StarCoder was trained over the course of several days on an open source dataset called The Stack, which has over 19 million curated, permissively licensed repositories and more than six terabytes of code in over 350 programming languages. In machine learning, parameters are the parts of the system that are learned from historical training data and are used to define the skill of the system on a problem. The Stack dataset's contents are broken down in a graphic. The contents of The Stack dataset are broken down in a graphic.
Advertisement
Code from The Stack can be copied, modified and redistributed. The BigCode project provides a way for developers to "opt out" of The Stack, similar to efforts elsewhere to let artists remove their work from text-to-image training datasets. The BigCode team worked to remove the personal information from The Stack. They created a separate dataset of 12,000 files, which they plan to release to researchers.
The BigCode team used Hugging Face's malicious code detection tool to remove files from The Stack that might be considered unsafe, such as those with known exploits. The privacy and security issues with generative artificial intelligence systems are well-established. The journalist had a phone number. Copilot may be able to generate keys, credentials and passwords from its training data.
Code poses some of the most sensitive intellectual property for most companies. Sharing it outside their infrastructure poses immense challenges. According to his point, some legal experts have argued that code-generating artificial intelligence systems could put companies at risk if they were to inadvertently incorporate copyrighted or sensitive text into their production software. It's difficult to tell which code is permissible to deploy and which might have incompatible terms of use because systems like Copilot strip code of its licenses.
Customers can now prevent suggested code that matches public, potentially copyrighted content from being shown. Amazon has CodeWhisperer highlight and filter the license associated with functions it suggests that bear a resemblance to snippets found in its training data. ServiceNow is a company that deals mostly in enterprise automation software. ServiceNow will eventually build StarCoder into its products.
The amount of donated compute was substantial, so the company wouldn't reveal how much it invested in the BigCode project. The Large Language Models Lab at ServiceNow Research is building up expertise on the responsible development of generative artificial intelligence models to ensure the safe and ethical deployment of these powerful models for our customers. The open-scientific research approach to BigCode provides ServiceNow developers and customers with full transparency into how everything was developed and demonstrates ServiceNow's commitment to making socially responsible contributions to the community. StarCoder is not an open source in the strictest sense.
Commercial drivers
It is being released under a licensing scheme called OpenRAIL-M, which includes use case restrictions that derivatives of the model are required to comply with. StarCoder users must agree not to use the model to generate or distribute malicious code. While real-world examples are few and far between, researchers have demonstrated how StarCoder can be used to evade detection. The terms of the license remain to be seen.
Advertisement
There is nothing at the base technical level that can prevent them from ignoring the terms. Stable Diffusion's restrictive license was ignored by developers who used the generativeai model to create pictures of celebrity deepfakes. Von Werra doesn't feel the downsides of not releasing StarCoder outweigh the positives. At launch, StarCoder will not ship as many features as Copilot, but with its open-source nature, the community can help improve it along the way as well as integrate custom models.
As of this week, the StarCoder code repository, model training framework, dataset-filtering methods, code evaluation suite and research analysis notebooks are available on the internet. As the groups look to develop more capable code-generating models, the BigCode project will maintain them. There is work to be done. In the technical paper accompanying StarCoder's release, Hugging Face and ServiceNow say that the model may produce inaccurate, offensive, and misleading content.
.