[1] Li Y, Choi D, Chung J, et al. Competition-level code generation with alphacode[J]. Science, 2022, 378(6624): 1092-1097.
[2] Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.
[3] Li R, Allal L B, Zi Y, et al. Starcoder: may the source be with you![J]. arXiv preprint arXiv:2305.06161, 2023.
[4] Guo D, Zhu Q, Yang D, et al. DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence[J]. arXiv preprint arXiv:2401.14196, 2024.
[5] 01-ai. Meet Yi-Coder: A Small but Mighty LLM for Code[EB/OL]. [2024-09-05]. https://01-ai.github.io/blog.html?post=en/2024-09-05-A-Small-but-Mighty-LLM-for-Code.md.
[6] Pinnaparaju N, Adithyan R, Phung D, et al. Stable code technical report[J]. arXiv preprint arXiv:2404.01226, 2024.
[7] Mishra M, Stallone M, Zhang G, et al. Granite code models: A family of open foundation models for code intelligence[J]. arXiv preprint arXiv:2405.04324, 2024.
[8] Hu S, Tu Y, Han X, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies[J]. arXiv preprint arXiv:2404.06395, 2024.
[9] Lozhkov A, Li R, Allal L B, et al. Starcoder 2 and the stack v2: The next generation[J]. arXiv preprint arXiv:2402.19173, 2024.
[10] Page L. The PageRank citation ranking: Bringing order to the web[R]. Technical Report, 1999.
[11] Shen Z, Tao T, Ma L, et al. Slimpajama-dc: Understanding data combinations for llm training[J]. arXiv preprint arXiv:2309.10818, 2023.
[12] Dubey A, Jauhri A, Pandey A, et al. The llama 3 herd of models[J]. arXiv preprint arXiv:2407.21783, 2024.
[13] Luo X, Zhu Q, Zhang Z, et al. Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code Large Language Models[J]. arXiv preprint arXiv:2403.00338, 2024.
[14]Abbas A, Tirumala K, Simig D, et al. Semdedup: Data-efficient learning at web-scale through semantic deduplication[J]. arXiv preprint arXiv:2303.09540, 2023.
[15]Zhao H, Du L, Ju Y, et al. Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency[J]. arXiv preprint arXiv:2409.07045, 2024.
[16] Liu J, Xia C S, Wang Y, et al. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation[J]. Advances in Neural Information Processing Systems, 2024, 36.
[17]Austin J, Odena A, Nye M, et al. Program synthesis with large language models[J]. arXiv preprint arXiv:2108.07732, 2021.
[18]Jain N, Han K, Gu A, et al. Livecodebench: Holistic and contamination free evaluation of large language models for code[J]. arXiv preprint arXiv:2403.07974, 2024.
[19]Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding[J]. arXiv preprint arXiv:2009.03300, 2020.
[20]Zellers R, Holtzman A, Bisk Y, et al. Hellaswag: Can a machine really finish your sentence?[J]. arXiv preprint arXiv:1905.07830, 2019.
[21]Clark P, Cowhey I, Etzioni O, et al. Think you have solved question answering? try arc, the ai2 reasoning challenge[J]. arXiv preprint arXiv:1803.05457, 2018.
[22]Suzgun M, Scales N, Schärli N, et al. Challenging big-bench tasks and whether chain-of-thought can solve them[J]. arXiv preprint arXiv:2210.09261, 2022.
[23]Huang Y, Bai Y, Zhu Z, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models[J]. arXiv preprint arXiv:2305.08322, 2023.
[24]Li H, Zhang Y, Koto F, et al. CMMLU: Measuring massive multitask language understanding in Chinese[J]. arXiv preprint arXiv:2306.09212, 2023.
[25]Cobbe K, Kosaraju V, Bavarian M, et al. Training verifiers to solve math word problems[J]. arXiv preprint arXiv:2110.14168, 2021.