Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language

Advancing Poetry Generation in Czech

A new model generates Czech poetry with improved rhyme and rhythm.

― 6 min read


Czech Poetry GenerationCzech Poetry GenerationRevolutioncreation through refined techniques.Innovative model enhances Czech poetry
Table of Contents

Automated systems that generate poetry are currently available for only a few languages. This article talks about a new model designed to create poetry in Czech. This model builds upon an existing larger language model that was pre-trained. We found that specifying certain characteristics of the Stanzas within the poem helps the model perform better. We also discovered that the way we break down words into pieces, called Tokenization, is very important. Using methods that break words into syllables or individual letters works better than those that break them into smaller word parts.

Poetry Generation in Czech Language

The main purpose of this project is to create a system that can generate Czech poetry according to certain rules about rhyming and rhythm. While larger language models like GPT and Llama are widely used, they do not always work well for Czech poetry. Previous experiments showed that even the best models, like GPT-4, have difficulty keeping up with the specific rules of Czech poetry, which leads to poor results.

To improve generation, we decided to fine-tune a model specifically for Czech poetry. We took the Czech version of GPT-2 and trained it with a large collection of Czech poems. The Czech language has unique features that require special attention, including a complex structure but simple spelling and rhythm.

Importance of Syllables

Instead of focusing on the meaning of the words, our model emphasizes how the words sound and how they fit into the rhythm. Modeling syllables helps the model create new words, which is a common practice in poetry to follow the Rhyme and rhythm. To achieve better results, we used models that do not follow strict rules, allowing for greater flexibility in creating new words and ensuring they match the specified characteristics of the stanzas.

Stanza Structure

In poetry, two key elements shape the structure of a stanza: rhyme and Meter. The rhyme scheme applies to the whole stanza, while the meter may change between lines. Each line of the stanza is carefully marked for meter. The rhyme scheme is represented using capital letters like ABAB, where each letter equals a line in the stanza.

Dataset and Its Features

We worked with a collection of Czech poetry organized by the Institute of Czech Literature. This dataset has over a thousand volumes of poetry, with details about the meter, rhyme, and other linguistic features. Although the annotations may have some errors, they serve as a solid foundation for training our model.

The dataset does not provide direct rhyme schemes, so we developed standardized representations for these, such as AABB or ABAB. We noticed that many poems do not follow strict genres, and as such, we used the year of publication as a way to categorize them.

Tokenization Strategies

One of the challenges we faced was how to break down the text for analysis. Traditional methods often encounter difficulties, especially with the Czech language because of its complex inflections and structures. We explored different ways of tokenization for our poetry model, deciding to look at syllables and individual characters.

With various tokenization methods, we tried both standard models and those focused on syllables. Our goal was to create a setup where the model could better generate poetry according to specific formats.

Model Training

For our foundation, we chose a specific Czech model that is a version of GPT-2. We enhanced this model with our dataset, breaking it down into structured inputs that highlighted rhyme and meter. We tried different training methods, including training the model initially with detailed formats and then refining it with simpler setups.

For measuring errors and accuracy during training, we used standard methods. We concentrated on next-word predictions, which is suitable for the model type we used.

Generating Text

To make poetry generation more effective, we created alternative ways to generate text. The basic generating method required input that outlined the stanza's parameters. Each word was generated in sequence until the end was reached.

In our enhanced generating method, we examined previously created lines and applied rhyme and meter rules to guide the new lines. This method proved particularly helpful for the verses meant to rhyme.

Evaluation of Quality

Automatically assessing the quality of generated poetry is challenging. To address this, we focused on a specific part of the task, mainly evaluating how well the poetry matched the expected standards of rhyme and meter. We trained classifiers to label stanzas according to their rhyme scheme, meter, and year of publication.

To further refine this process, we experimented with splitting the text into syllables before inputting it into the validator models. This step aimed to help models classify rhyme and meter more accurately.

Results and Observations

Our findings indicated that incorporating specific details about the verses significantly aided the model's performance. This detail provided better guidance on how to generate poetry that adhered closely to established rhyming and rhythmic rules.

We also compared how different tokenization methods affected generation. Models focusing on character-level tokenization performed better in producing rhymed poetry compared to those using standard subword tokenization.

Future Directions

We plan to enhance our model to generate entire poems rather than just stanzas, ensuring they are thematically and structurally connected. We will continue to refine our techniques in text generation and evaluation, further improving the quality of the generated poems.

Ethical Considerations

There is ongoing discussion about the ethics of using various data types for training language models. In our work, we used only poems that are in the public domain, as their authors passed away over 70 years ago. This approach minimizes ethical concerns.

While our base model was trained on varied data, we ensured that our applications adhere to ethical standards. Furthermore, we will continue to label generated works as automated to prevent any confusion or misinformation.

Computational Challenges

Our approach to generating poetry involves complex models, requiring powerful computing resources for effective training and performance. We acknowledge that multiple tokenization techniques can complicate the scalability of the generation process.

One of the main challenges is the risk of losing important context across verses when using certain tokenization methods. The models may also revert to basic patterns unless prompted with specific instructions, which can limit creativity.

Conclusion

In summary, we have developed a new approach to generating Czech poetry that focuses on its formal qualities. Our results show that adding clear annotations for rhythm and rhyme enhances the model's performance. Additionally, we found that character-level tokenization is advantageous for rhyming tasks.

Moving forward, we intend to extend our research to include full poems, considering thematic coherence and structural integrity. Through this work, we aim to contribute to the field of automated poetry generation and enhance the experience of producing Czech poetry.

Original Source

Title: GPT Czech Poet: Generation of Czech Poetic Strophes with Language Models

Abstract: High-quality automated poetry generation systems are currently only available for a small subset of languages. We introduce a new model for generating poetry in Czech language, based on fine-tuning a pre-trained Large Language Model. We demonstrate that guiding the generation process by explicitly specifying strophe parameters within the poem text strongly improves the effectiveness of the model. We also find that appropriate tokenization is crucial, showing that tokenization methods based on syllables or individual characters instead of subwords prove superior in generating poetic strophes. We further enhance the results by introducing \textit{Forced~generation}, adding explicit specifications of meter and verse parameters at inference time based on the already generated text. We evaluate a range of setups, showing that our proposed approach achieves high accuracies in rhyming and metric aspects of formal quality of the generated poems.

Authors: Michal Chudoba, Rudolf Rosa

Last Update: 2024-06-18 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2407.12790

Source PDF: https://arxiv.org/pdf/2407.12790

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles