Abordando a Contaminação de Dados em Modelos de Linguagem
Analisando os riscos de contaminação de dados em modelos de linguagem de código fechado.
― 5 min ler
Índice
Nos últimos anos, os pesquisadores têm cada vez mais confiado em Modelos de Linguagem Grande (LLMs) para diversas tarefas de processamento de linguagem natural. No entanto, muitos desses modelos são fechados, ou seja, os detalhes sobre seus dados de treinamento e como eles funcionam não estão disponíveis publicamente. Essa falta de transparência gerou preocupações sobre Contaminação de Dados entre os pesquisadores.
O que é Contaminação de Dados?
Contaminação de dados acontece quando um modelo usa dados que já viu durante o treinamento para avaliar seu desempenho. Isso pode levar a métricas de desempenho inflacionadas que não representam com precisão as capacidades do modelo. A preocupação é especialmente relevante quando um modelo é avaliado com dados de teste que ele pode ter sido treinado, direta ou indiretamente.
Modelos Fechados e Seus Problemas
Muitos LLMs amplamente utilizados são oferecidos através de interfaces de programação de aplicativos (APIs), e seu funcionamento interno não é acessível ao público. Isso significa que os pesquisadores não conseguem facilmente determinar se o modelo foi exposto a conjuntos de dados específicos que poderiam enviesar suas avaliações. Como resultado, muitos estudos podem, sem querer, depender de dados contaminados, levando a comparações não confiáveis com outros modelos.
Uma Análise Sistemática
Uma revisão sistemática da literatura de pesquisa revela alguns números preocupantes sobre contaminação de dados em LLMs como GPT-3.5 e GPT-4. Um exame de vários artigos acadêmicos mostrou que um número significativo vazou dados que poderiam, potencialmente, beneficiar esses modelos. Através dessa análise, observou-se que muitos estudos não consideraram ou relataram adequadamente questões de contaminação de dados.
Vazamento de Dados
A Escala doNo total, a pesquisa indica que mais de 4,7 milhões de amostras de cerca de 263 conjuntos de dados distintos foram vazadas durante as avaliações de modelos como GPT-3.5 e GPT-4. Esse extenso vazamento de dados levanta sérias questões sobre a integridade das avaliações de desempenho e a validade dos resultados obtidos desses estudos.
Práticas de Avaliação Problemáticas
Uma revisão da literatura revela ainda várias práticas preocupantes relacionadas à avaliação. Muitos estudos sofreram com comparações injustas devido a diferenças nos conjuntos de dados usados para avaliação. Por exemplo, alguns modelos foram avaliados apenas em um pequeno subconjunto de amostras, enquanto outros foram testados em conjuntos de dados inteiros. Essas práticas podem levar a conclusões enganosas sobre a eficácia de um modelo.
Consequências para a Pesquisa
As implicações dessas descobertas são significativas. Quando ocorre contaminação de dados, isso não só distorce a avaliação de desempenho de modelos específicos, mas também tem consequências mais amplas para o campo da pesquisa como um todo. A dependência de dados contaminados pode dificultar o progresso científico e enganar as partes interessadas que se baseiam nessas avaliações para a tomada de decisões.
Práticas Sugeridas Para o Futuro
Para lidar com esses problemas, os pesquisadores devem adotar práticas mais rigorosas ao avaliar modelos fechados. Aqui estão algumas práticas sugeridas:
Evitar Vazamento de Dados: Ao planejar avaliações, os pesquisadores devem consultar as políticas de dados dos provedores de modelos. Usar acesso à API quando aplicável pode ajudar a evitar vazamentos de dados não intencionais.
Interpretar Desempenho com Cuidado: Seja cauteloso ao interpretar métricas de desempenho de modelos fechados. Considere a possibilidade de contaminação de dados ao avaliar resultados.
Comparação com Modelos Abertos: Os pesquisadores devem se esforçar para incluir comparações com modelos de código aberto para fornecer uma avaliação justa das alternativas fechadas. Isso garante um campo de jogo nivelado ao avaliar as capacidades do modelo.
Transparência: Relatórios devem incluir detalhes claros sobre os conjuntos de dados usados, a metodologia das avaliações e as condições sob as quais os modelos foram testados. Essa transparência ajudará na reprodutibilidade e aumentará a credibilidade das descobertas.
Atualizações Regulares: Os modelos são frequentemente atualizados, e as avaliações devem especificar a versão do modelo usada durante a pesquisa. Isso pode ajudar a manter a consistência entre os estudos.
A Importância dos Modelos de Código Aberto
Enquanto modelos proprietários podem parecer oferecer melhor desempenho, os pesquisadores devem considerar usar modelos de código aberto sempre que possível. Modelos de código aberto permitem maior transparência e escrutínio, possibilitando avaliações e comparações mais robustas.
Conclusão
A contaminação de dados em LLMs fechados apresenta um desafio significativo para pesquisadores e profissionais. A análise sistemática da literatura existente indica problemas generalizados com vazamento de dados e práticas de avaliação problemáticas. Para o futuro, a comunidade de pesquisa deve adotar melhores práticas para garantir a integridade das avaliações e, em última análise, promover avanços mais confiáveis no campo do processamento de linguagem natural. Ao priorizar transparência, interpretações cuidadosas e comparações abertas, os pesquisadores podem mitigar os efeitos da contaminação de dados e aumentar o valor de suas descobertas para a comunidade científica mais ampla.
Título: Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Resumo: Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of \emph{indirect} data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI's data usage policy, we extensively document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.
Autores: Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondřej Dušek
Última atualização: 2024-02-22 00:00:00
Idioma: English
Fonte URL: https://arxiv.org/abs/2402.03927
Fonte PDF: https://arxiv.org/pdf/2402.03927
Licença: https://creativecommons.org/licenses/by/4.0/
Alterações: Este resumo foi elaborado com a assistência da AI e pode conter imprecisões. Para obter informações exactas, consulte os documentos originais ligados aqui.
Obrigado ao arxiv pela utilização da sua interoperabilidade de acesso aberto.
Ligações de referência
- https://www.kaggle.com/datasets/ratthachat/writing-prompts
- https://github.com/facebookresearch/opendialkg
- https://huggingface.co/datasets/allenai/prosocial-dialog
- https://huggingface.co/datasets/multi_woz_v22
- https://github.com/alexa/dstc11-track5
- https://paperswithcode.com/dataset/dstc7-task-2
- https://huggingface.co/datasets/ConvLab/multiwoz21
- https://github.com/smartyfh/MultiWOZ2.4
- https://huggingface.co/datasets/newsroom
- https://github.com/thu-coai/OpenMEVA
- https://github.com/neulab/REALSumm
- https://github.com/Yale-LILY/SummEval
- https://github.com/facebookresearch/flores/tree/main/previous_releases/flores101
- https://paperswithcode.com/dataset/wmt-2020
- https://www.statmt.org/wmt22/translation-task.html
- https://github.com/IndoNLP/nusax
- https://github.com/biomedical-translation-corpora/corpora
- https://paperswithcode.com/dataset/wmt-2014
- https://github.com/facebookresearch/flores/tree/main/flores200
- https://github.com/google/wmt-mqm-human-evaluation/tree/main/generalMT2022
- https://inklab.usc.edu/NumerSense/
- https://www.cs.washington.edu/nlp/arithmetic
- https://huggingface.co/datasets/aqua_rat
- https://www.microsoft.com/en-us/download/details.aspx?id=52628
- https://github.com/friederrr/GHOSTS
- https://github.com/openai/grade-school-math
- https://huggingface.co/datasets/ChilleD/MultiArith
- https://gitlab.cs.washington.edu/ALGES/TACL2015/-/blob/master/questions.json?ref_type=heads
- https://github.com/arkilpatel/SVAMP
- https://github.com/bruzwen/ddxplus
- https://physionet.org/content/mimic-cxr/2.0.0/
- https://www.merckmanuals.com/professional/pages-with-widgets/case-studies?mode=list
- https://github.com/MJ-Jang/BECEL/tree/main
- https://github.com/mcdm/CommitmentBank
- https://huggingface.co/datasets/multi_nli
- https://paperswithcode.com/dataset/qnli
- https://paperswithcode.com/dataset/rte
- https://leaderboard.allenai.org/anli/submissions/get-started
- https://allenai.org/data/entailmentbank
- https://github.com/verypluming/MED
- https://github.com/AI-secure/adversarial-glue/tree/main
- https://github.com/facebookresearch/anli?tab=readme-ov-file
- https://super.gluebenchmark.com/
- https://github.com/swarnaHub/ConjNLI
- https://github.com/csitfun/ConTRoL-dataset
- https://github.com/verypluming/HELP
- https://github.com/HKUST-KnowComp/NLI4CT
- https://github.com/microsoft/TaxiNLI
- https://huggingface.co/datasets/SetFit/wnli
- https://github.com/howl-anderson/ATIS_dataset/tree/master
- https://github.com/sonos/nlu-benchmark
- https://www.microsoft.com/en-us/download/details.aspx?id=52398
- https://gluebenchmark.com/
- https://github.com/HLTCHKUST/Perplexity-FactChecking/tree/main
- https://github.com/chuchun8/PStance
- https://afshinrahimi.github.io/semeval2016-task6/
- https://github.com/cardiffnlp/tweeteval/tree/main/datasets/stance
- https://github.com/jkoppel/QuixBugs
- https://github.com/Kali-Hac/ChatGPT-MBTI
- https://jmir.org/api/download?alt_name=mededu_v9i1e45312_app1.xlsx&filename=3c2adca5ee88328073c589af108a5697.xlsx
- https://github.com/facebookarchive/bAbI-tasks/tree/master
- https://github.com/facebookresearch/clutrr
- https://github.com/Waste-Wood/e-CARE/
- https://github.com/SophonPlus/ChineseNlpCorpus
- https://github.com/kelvin-jiang/FreebaseQA
- https://hotpotqa.github.io/
- https://lc-quad.sda.tech/
- https://github.com/siatnlp/LegalQA
- https://github.com/lgw863/LogiQA-dataset
- https://github.com/CogComp/MCTACO
- https://github.com/UCSD-AI4H/Medical-Dialogue-System
- https://github.com/apple/ml-mkqa
- https://github.com/ianporada/modeling_event_plausibility
- https://github.com/ybisk/ybisk.github.io/tree/master/piqa
- https://whyu.me/reclor/
- https://github.com/davidgolub/SimpleQA/tree/master/datasets/SimpleQuestions
- https://github.com/HLR/SpartQA_generation
- https://github.com/ZhengxiangShi/StepGame
- https://github.com/google-research-datasets/TimeDial
- https://www.microsoft.com/en-us/download/details.aspx?id=52763
- https://github.com/brightmart/nlp_chinese_corpus
- https://aistudio.baidu.com/datasetdetail/38489
- https://facebookresearch.github.io/ELI5/
- https://tcci.ccf.org.cn/conference/2016/pages/page05_evadata.html
- https://allenai.org/data/open-book-qa
- https://allenai.org/data/qasc
- https://www.cs.cmu.edu/~glai1/data/race/
- https://allenai.org/data/socialiqa
- https://huggingface.co/datasets/squad_v2
- https://github.com/sylinrl/TruthfulQA
- https://www.microsoft.com/en-us/download/details.aspx?id=52419
- https://thukeg.gitee.io/kqa-pro/
- https://github.com/zhongwanjun/AR-LSAT
- https://huggingface.co/datasets/google/boolq
- https://github.com/allenai/contrast-sets/tree/main/BoolQ
- https://github.com/ALFA-group/BRON
- https://cve.mitre.org/
- https://allenai.org/data/complexwebquestions
- https://dblp.org/rdf/release/dblp-2022-06-01.nt.gz
- https://efficientqa.github.io/
- https://dki-lab.github.io/GrailQA/
- https://github.com/ysu1989/GraphQuestions
- https://github.com/Hello-SimpleAI/chatgpt-comparison-detection
- https://github.com/yongcaoplus/ProbingChatGPT
- https://github.com/AskNowQA/LC-QuAD2.0
- https://github.com/csitfun/LogiQA2.0
- https://zenodo.org/records/4617285#.YrNszNLMJhH
- https://ott-qa.github.io/
- https://github.com/iesl/protoqa-data
- https://github.com/ag-sc/QALD/tree/master
- https://github.com/sylinrl/TruthfulQA/tree/main
- https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family/tree/main/datasets/WQSP
- https://yago-knowledge.org/downloads/yago-4
- https://www.tau-nlp.sites.tau.ac.il/commonsenseqa
- https://rowanzellers.com/hellaswag/
- https://github.com/taylorwwebb/emergent_analogies_LLM/tree/main/letter_string
- https://allenai.org/data/arc
- https://huggingface.co/datasets/skrishna/coin_flip
- https://people.ict.usc.edu/~gordon/copa.html
- https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html
- https://nyu-mll.github.io/CoLA/
- https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/README.md
- https://github.com/RUCKBReasoning/CoT-KA
- https://github.com/qiangning/MATRES
- https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/object_counting/README.md
- https://allenai.org/data/strategyqa
- https://github.com/aakanksha19/TDDiscourse
- https://www.usna.edu/Users/cs/nchamber/caevo/
- https://adapterhub.ml/explore/sts/sts-b/
- https://github.com/cardiffnlp/tweeteval/tree/main/datasets/emoji
- https://github.com/MJ-Jang/BECEL/tree/main/data/mrpc
- https://lcl.uniroma1.it/wsdeval/
- https://pilehvar.github.io/wic/
- https://github.com/Moradnejad/ColBERT-Using-BERT-Sentence-Embedding-for-Humor-Detection/tree/master/Data
- https://www.kaggle.com/datasets/niraliivaghani/flipkart-product-customer-reviews-dataset
- https://www.cs.cornell.edu/people/pabo/movie-review-data/
- https://github.com/YJiangcm/SST-2-sentiment-analysis
- https://github.com/conversationai/unhealthy-conversations
- https://github.com/ewulczyn/wiki-detox/
- https://github.com/CLARIN-PL/chatgpt-evaluation-01-2023/
- https://github.com/google-research/google-research/tree/master/goemotions
- https://github.com/SALT-NLP/implicit-hate
- https://www.kaggle.com/datasets/rmsharks4/sarcasmania-dataset
- https://codalab.lisn.upsaclay.fr/competitions/7096#learn_the_details
- https://github.com/cardiffnlp/tweeteval/tree/main/datasets/sentiment
- https://github.com/allenai/real-toxicity-prompts
- https://adversarialglue.github.io/instructions/
- https://chalearnlap.cvc.uab.cat/dataset/24/description/
- https://github.com/allenai/contrast-sets/tree/main/IMDb
- https://clarin-pl.eu/dspace/handle/11321/710
- https://huggingface.co/datasets/sentiment140
- https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch
- https://huggingface.co/datasets/cnn_dailymail
- https://github.com/csebuetnlp/CrossSum
- https://github.com/ctr4si/MMN
- https://github.com/esdurmus/Wikilingua
- https://github.com/krystalan/ClidSum/tree/main#2-clidsum-benchmark-dataset
- https://github.com/honglizhan/CovidET
- https://github.com/ali-bahrainian/NEWTS
- https://github.com/armancohan/long-summarization/tree/master
- https://github.com/Yale-LILY/QMSum
- https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset
- https://github.com/nyu-mll/SQuALITY
- https://paperswithcode.com/dataset/samsum-corpus
- https://github.com/inverse-scaling/prize
- https://huggingface.co/datasets/ml4pubmed/pubmed-classification-20k
- https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
- https://www.kaggle.com/datasets/paultimothymooney/medical-speech-transcription-and-intent
- https://mtsamples.com/
- https://www.i2b2.org/NLP/Relations/
- https://paperswithcode.com/dataset/ace-2005
- https://github.com/ZihanWangKi/CrossWeigh
- https://huggingface.co/datasets/conll2003
- https://github.com/zhoujx4/DuEE
- https://github.com/zhoujx4/DuIE
- https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/MSRA
- https://github.com/truthless11/HRL-RE/tree/master/data/NYT11
- https://www.comp.nus.edu.sg/~nlp/conll14st.html
- https://github.com/microsoft/ContextualSP
- https://yale-lily.github.io/cosql
- https://taolusi.github.io/CSpider-explorer/
- https://github.com/luge-ai/luge-ai/tree/master/semantic-parsing
- https://github.com/salesforce/QGen/tree/main/Quiz_Design
- https://github.com/taoyds/sparc
- https://drive.usercontent.google.com/download?id=1TqleXec_OykOYFREKKtschzY29dUcVAQ&export=download&authuser=0
- https://github.com/ygan/SpiderSS-SpiderCG
- https://github.com/ygan/Spider-DK
- https://zenodo.org/record/5205322
- https://github.com/ygan/Spider-Syn
- https://www.latex-project.org/help/documentation/encguide.pdf
- https://leak-llm.github.io/
- https://openai.com/blog/chatgpt
- https://openai.com/gpt-4
- https://blog.google/technology/ai/lamda/
- https://ai.google/discover/palm2/
- https://cohere.com/models/command
- https://claude.ai/
- https://hitz-zentroa.github.io/lm-contamination/
- https://scholar.google.com/
- https://www.semanticscholar.org/
- https://dblp.org/
- https://arxiv.org/
- https://aclanthology.org/
- https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve-model-performance
- https://chat.openai.com/
- https://github.com/acheong08/ChatGPT
- https://github.com/rawandahmad698/PyChatGPT
- https://github.com/acheong08/ChatGPT-to-API
- https://openai.com/research/gpt-4
- https://huggingface.co/datasets
- https://www.kaggle.com/
- https://privacy.openai.com/policies
- https://arxiv.org/abs/2303.12528
- https://arxiv.org/abs/2303.12767
- https://doi.org/10.18653/v1/2023.acl-long.427
- https://arxiv.org/abs/2303.03186
- https://arxiv.org/abs/2307.11088
- https://www.sciencedirect.com/science/article/pii/S2666914523000568
- https://arxiv.org/abs/2307.15703
- https://arxiv.org/abs/2308.14508
- https://arxiv.org/abs/2306.04181
- https://arxiv.org/abs/2302.04023
- https://arxiv.org/abs/2212.10474
- https://arxiv.org/abs/2303.16421
- https://arxiv.org/abs/2303.09461
- https://arxiv.org/abs/2302.03494
- https://doi.org/10.18653/v1/2023.sicon-1.2
- https://arxiv.org/abs/2303.12712
- https://arxiv.org/abs/2307.02313
- https://arxiv.org/abs/2306.03024
- https://github.com/zeno-ml/zeno-build/tree/main/examples/chatbot/report
- https://arxiv.org/abs/2303.08014
- https://doi.org/10.18653/v1/2023.c3nlp-1.7
- https://arxiv.org/abs/2202.07646
- https://doi.org/10.18653/v1/2023.acl-long.313
- https://doi.org/10.18653/v1/2023.bionlp-1.8
- https://arxiv.org/abs/2307.03109
- https://arxiv.org/abs/2308.00304
- https://arxiv.org/abs/2307.09009
- https://arxiv.org/abs/2211.06869
- https://arxiv.org/abs/2303.00293
- https://arxiv.org/abs/2212.10522
- https://doi.org/10.18653/v1/2023.acl-long.870
- https://doi.org/10.2139/ssrn.4335905
- https://doi.org/10.18653/v1/2023.clinicalnlp-1.17
- https://doi.org/10.31234/osf.io/c3549
- https://doi.org/10.1021/acs.jchemed.3c00027
- https://arxiv.org/abs/2304.05906
- https://arxiv.org/abs/2302.13007
- https://arxiv.org/abs/2305.02182
- https://arxiv.org/abs/2305.13276
- https://arxiv.org/abs/2302.04752
- https://arxiv.org/abs/2304.06122
- https://arxiv.org/abs/2303.11436
- https://doi.org/10.18653/v1/2023.acl-long.626
- https://doi.org/10.18653/v1/2023.bea-1.30
- https://arxiv.org/abs/2305.12477
- https://arxiv.org/abs/2305.08391
- https://arxiv.org/abs/2304.01746
- https://aclanthology.org/2023.sigdial-1.20
- https://doi.org/10.18653/v1/2023.sicon-1.4
- https://arxiv.org/abs/2301.13867
- https://arxiv.org/abs/2305.07375
- https://arxiv.org/abs/2303.03836
- https://arxiv.org/abs/2304.02554
- https://arxiv.org/abs/2305.14627
- https://arxiv.org/abs/2304.02182
- https://doi.org/10.1177/05694345231169654
- https://doi.org/10.18653/v1/2023.semeval-1.298
- https://arxiv.org/abs/2306.09390
- https://arxiv.org/abs/2303.15056
- https://pubmed.ncbi.nlm.nih.gov/36753318/
- https://github.com/THU-KEG/EvaluationPapers4ChatGPT#evaluation-papers-for-chatgpt
- https://arxiv.org/abs/2308.08493
- https://doi.org/10.18653/v1/2023.dialdoc-1.11
- https://arxiv.org/abs/2303.15587
- https://doi.org/10.18653/v1/2023.starsem-1.4
- https://arxiv.org/abs/2301.07597
- https://doi.org/10.18653/v1/2020.acl-main.740
- https://aclanthology.org/2023.inlg-main.8
- https://doi.org/10.18653/v1/2023.wassa-1.19
- https://arxiv.org/abs/2303.05063
- https://arxiv.org/abs/2309.09150
- https://arxiv.org/abs/2303.14822
- https://doi.org/10.18653/v1/2023.acl-short.81
- https://arxiv.org/abs/2302.09210
- https://www.mdpi.com/1660-4601/20/4/3378
- https://arxiv.org/abs/2305.14020
- https://arxiv.org/abs/2308.00189
- https://doi.org/10.18653/v1/2023.acl-long.218
- https://arxiv.org/abs/2305.10276
- https://arxiv.org/abs/2303.10368
- https://arxiv.org/abs/2303.16416
- https://arxiv.org/abs/2302.07736
- https://arxiv.org/abs/2305.07004
- https://doi.org/10.18653/v1/2023.wassa-1.14
- https://arxiv.org/abs/2307.10236
- https://arxiv.org/abs/2305.08322
- https://aclanthology.org/2023.inlg-main.3
- https://doi.org/10.18653/v1/2023.bionlp-1.30
- https://arxiv.org/abs/2303.06273
- https://doi.org/10.18653/v1/2023.wassa-1.29
- https://arxiv.org/abs/2305.09645
- https://arxiv.org/abs/2301.08745
- https://arxiv.org/abs/2303.14310
- https://arxiv.org/abs/2304.03245
- https://doi.org/10.18653/v1/2023.bionlp-1.37
- https://arxiv.org/abs/2303.18027
- https://doi.org/10.18653/v1/2023.wassa-1.33
- https://arxiv.org/abs/2302.14520
- https://doi.org/
- https://doi.org/10.1016/j.inffus.2023.101861
- https://arxiv.org/abs/2305.10407
- https://arxiv.org/abs/2303.17276
- https://arxiv.org/abs/2301.12127
- https://arxiv.org/abs/2302.02083
- https://doi.org/10.18653/v1/2023.eacl-main.241
- https://doi.org/10.1371/journal.pdig.0000198
- https://arxiv.org/abs/2308.15118
- https://arxiv.org/abs/2305.00050
- https://arxiv.org/abs/2304.05613
- https://arxiv.org/abs/2305.18486
- https://arxiv.org/abs/2302.13795
- https://arxiv.org/abs/2309.06085
- https://arxiv.org/abs/2304.11633
- https://arxiv.org/abs/2308.09597
- https://arxiv.org/abs/2305.03111
- https://arxiv.org/abs/2305.11747
- https://arxiv.org/abs/2304.10619
- https://arxiv.org/abs/2305.13269
- https://arxiv.org/abs/2302.11520
- https://openreview.net/forum?id=iO4LZibEqW
- https://aclanthology.org/2023.finnlp-1.7
- https://arxiv.org/abs/2303.13547
- https://arxiv.org/abs/2304.14399
- https://arxiv.org/abs/2308.11224
- https://arxiv.org/abs/2304.03439
- https://arxiv.org/abs/2305.12147
- https://arxiv.org/abs/2305.01210
- https://doi.org/10.18653/v1/2023.findings-acl.229
- https://arxiv.org/abs/2303.16634
- https://arxiv.org/abs/2304.01852
- https://doi.org/10.18653/v1/2023.acl-short.138
- https://arxiv.org/abs/2303.11032
- https://doi.org/10.18653/v1/2023.bea-1.24
- https://doi.org/10.18653/v1/2023.bea-1.18
- https://arxiv.org/abs/2306.01169
- https://doi.org/10.18653/v1/2023.acl-long.324
- https://arxiv.org/abs/2303.13809
- https://doi.org/10.18653/v1/2023.wassa-1.54
- https://arxiv.org/abs/2303.15621
- https://arxiv.org/abs/2307.15780
- https://arxiv.org/abs/2303.09038
- https://arxiv.org/abs/2302.02094
- https://arxiv.org/abs/2303.08896
- https://arxiv.org/abs/2308.12488
- https://doi.org/10.1101/2023.04.20.23288859
- https://arxiv.org/abs/2303.01194
- https://arxiv.org/abs/2304.11490
- https://doi.org/10.18653/v1/2023.findings-acl.280
- https://doi.org/10.18653/v1/2023.repl4nlp-1.17
- https://aclanthology.org/2023.ccl-2.9
- https://doi.org/10.18653/v1/2023.wassa-1.61
- https://arxiv.org/abs/2303.13375
- https://doi.org/10.18653/v1/2023.findings-acl.396
- https://arxiv.org/abs/2302.06466
- https://doi.org/10.18653/v1/2023.bea-1.62
- https://arxiv.org/abs/2303.08774
- https://arxiv.org/abs/2302.06426
- https://aclanthology.org/2023.sigdial-1.23
- https://doi.org/10.18653/v1/2023.nlp4convai-1.2
- https://arxiv.org/abs/2304.04256
- https://arxiv.org/abs/2305.03423
- https://arxiv.org/abs/2304.01487
- https://arxiv.org/abs/2302.12813
- https://arxiv.org/abs/2304.03277
- https://arxiv.org/abs/2303.13780
- https://doi.org/10.18653/v1/2023.acl-short.37
- https://arxiv.org/abs/2308.11483
- https://doi.org/10.18653/v1/2023.latechclfl-1.2
- https://doi.org/10.18653/v1/2023.acl-srw.1
- https://arxiv.org/abs/2302.06476
- https://doi.org/10.18653/v1/2023.bea-1.58
- https://arxiv.org/abs/2302.03780
- https://arxiv.org/abs/2304.03325
- https://doi.org/10.1101/2023.02.21.23285886
- https://arxiv.org/abs/2303.01248
- https://doi.org/10.18653/v1/2023.findings-acl.529
- https://doi.org/10.18653/v1/2023.acl-demo.51
- https://arxiv.org/abs/2307.11019
- https://arxiv.org/abs/2306.11892
- https://arxiv.org/abs/2309.07423
- https://arxiv.org/abs/2304.07333
- https://hitz-zentroa.github.io/lm-contamination/blog/
- https://arxiv.org/abs/2210.13312
- https://aclanthology.org/2023.clasp-1.12
- https://arxiv.org/abs/2302.13814
- https://doi.org/10.18653/v1/2023.findings-acl.663
- https://arxiv.org/abs/2304.08979
- https://arxiv.org/abs/2305.03513
- https://openreview.net/forum?id=s7xWeJQACI
- https://arxiv.org/abs/2301.08653
- https://arxiv.org/abs/2303.13001
- https://arxiv.org/abs/2303.17650
- https://doi.org/10.18653/v1/2023.americasnlp-1.17
- https://arxiv.org/abs/2307.07697
- https://arxiv.org/abs/2304.09542
- https://arxiv.org/abs/2307.06464
- https://doi.org/10.18653/v1/2023.acl-long.828
- https://arxiv.org/abs/2303.07992
- https://doi.org/10.18653/v1/2023.semeval-1.277
- https://doi.org/10.18653/v1/2023.acl-long.650
- https://arxiv.org/abs/2303.04360
- https://arxiv.org/abs/2301.13819
- https://arxiv.org/abs/2304.14106
- https://doi.org/10.18653/v1/2023.wassa-1.23
- https://doi.org/10.18653/v1/2023.wassa-1.58
- https://arxiv.org/abs/2306.17582
- https://arxiv.org/abs/2305.13160
- https://arxiv.org/abs/2306.11698
- https://arxiv.org/abs/2303.04048
- https://arxiv.org/abs/2302.14229
- https://arxiv.org/abs/2302.12095
- https://doi.org/10.18653/v1/2023.clinicalnlp-1.49
- https://doi.org/10.18653/v1/2023.bea-1.53
- https://arxiv.org/abs/2302.07257
- https://arxiv.org/abs/2307.10635
- https://arxiv.org/abs/2309.10691
- https://arxiv.org/abs/2308.05342
- https://arxiv.org/abs/2304.04339
- https://arxiv.org/abs/2302.10205
- https://arxiv.org/abs/2303.07839
- https://arxiv.org/abs/2303.13648
- https://doi.org/10.18653/v1/2023.acl-long.403
- https://doi.org/10.18653/v1/2023.acl-long.173
- https://doi.org/10.18653/v1/2023.bea-1.52
- https://arxiv.org/abs/2305.13300
- https://arxiv.org/abs/2304.05351
- https://arxiv.org/abs/2306.09841
- https://arxiv.org/abs/2307.15020
- https://doi.org/10.18653/v1/2023.findings-acl.513
- https://arxiv.org/abs/2304.13712
- https://arxiv.org/abs/2307.05779
- https://arxiv.org/abs/2302.08081
- https://arxiv.org/abs/2303.11381
- https://arxiv.org/abs/2305.10601
- https://arxiv.org/abs/2303.10420
- https://arxiv.org/abs/2304.05454
- https://arxiv.org/abs/2304.02015
- https://arxiv.org/abs/2212.14548
- https://arxiv.org/abs/2304.03087
- https://arxiv.org/abs/2304.04193
- https://arxiv.org/abs/2307.10172
- https://arxiv.org/abs/2306.10968
- https://arxiv.org/abs/2305.15005
- https://arxiv.org/abs/2301.03462
- https://arxiv.org/abs/2304.09582
- https://doi.org/10.18653/v1/2023.semeval-1.221
- https://doi.org/10.18653/v1/2023.acl-long.869
- https://arxiv.org/abs/2309.03882
- https://arxiv.org/abs/2304.10513
- https://aclanthology.org/2023.cs4oa-1.5
- https://arxiv.org/abs/2307.02157
- https://arxiv.org/abs/2302.10198
- https://arxiv.org/abs/2304.11107
- https://arxiv.org/abs/2305.13304
- https://arxiv.org/abs/2304.10145
- https://arxiv.org/abs/2301.12867