Audiovisual Poetic Translation

week 1:

With recent innovations in authenticated ownership of digital art, there has been a movement to think about new ways to translate art into the digital realm. For my project I want to create a new, digitally friendly way to experience poetry. The end goal is to develop a website, where a block of poetry is uploaded and converted into a purely audio-visual experience. This audio-visual experience will be a continuously morphing animation with overlaid sound, essentially what the machine has dreamed up as an interpretation of the writing it received. There will also be a random seed slider to introduce entropy to the process so that entering the same block of text multiple times produces a new creation for each submission.

week 2:

Last week I prioritized delivering a vision on the end product. This week I focused on breaking down the larger composite project into manageable component parts. The breakdown is as follows. Firstly, I will need a website that accepts a block of text input from the user. As the animation and sound generation will be intensive and this project is focused on translating poetry, it will be character limited. After the user has submitted text to my website, I will need to parse through the submitted text to isolate specific images or concepts to translate into a visual. The next part will be to take that isolated image or concept and project it from text to a visual animation. After doing some research and listening to other class members talk about their project, I have decided to look into GANs and utilize some large, existing training datasets provided by OpenAI to go from text to animation. After the animation is completed, then I will need to push the animation through a sonfication process -- taking the pixels and translating them to sound.

I thought through multiple different hypotheses including scraping youtube to produce a frame to note training dataset and neural network, but decided to go with a simpler approach and see what I could build. I scoured the web and found some existing solutions that I based my small construction off of, which essentially reads the visual attributes of a drawn image and maps them to sonic attributes. A linear scan is performed on a black canvas where white pixels represent notes. The higher up the pixel on the canvas the higher the note on a chromatic scale rising from C3 as the bottom pixel. 




(hello world demo from class, spun up on local server)


Other hurdles for this project are going to be seamlessly stitching together multiple animations of different images and resolving the sonic disparity which may exist when transitioning from one animation to another.


week 3:

This week I got the GAN up and running using a combination of models as described in the picture below. 


Here we have two models working in concert to produce an animation from a given text prompt. The first model VQGAN is generative, producing some random array of pixels, the second model CLIP evaluates how close the image  that VQGAN generated is to the given text prompt. VQGAN’s next generation is guided by how close CLIP determined the previous output image was to the text prompt. Through this iterative process, multiple images are generated and layered as frames in an animation to show an image shifting from one input image to the interpretation of the input text string. 


Input { text: the cathedral of body overthrown, petals perfume with notes of inner fire }

Input { text: I wish I was as courageous as the first time I found courage, swinging open the apiary, inputImage: *constellation.png* }


week 4:


This week I focused on building out and hosting the front end for my website as well as strategic parsing of the text to atomic concepts and or images. Running random lines of text from some poems I’d previously written through the visual generator was giving indeterminate output and I realized after a while that I would have to do some linguistic preprocessing. The strategy of just grabbing n words repetitively in a poem does not produce decent animation as frequently there is not much imagistic content in some sequence of words. The most salient example is a string of articles coming one after another. After having this realization I wrote some code to remove nondescript elements from an input poem in a python notebook. The code does linguistic analysis on input text to find the part of speech for each word and only keeps nouns, adjectives, verbs, and adverbs.  

week 5:

In the weeks prior, I’ve made small demos as proofs of concept for each of the component parts of the larger project. This week, I spent time thinking about how to assemble these pieces into a cohesive final product. As a starting point, I revisited the “why” of the project to guide what exactly I am aiming to produce. The central thesis of this project is that “everyone is an artist, but not everyone has the technical skills to be able to express their creativity.” Effectively, I wanted to take a means of expression accessible to many, language, and magnify its nuance into a complex, stimulating work of multi-media art. This magnification makes use of principles in chaos theory: a simple input run through a complex, deterministic system yields unpredictable, complex results. The ultimate goal of this project is to create the complex system, so it was time to visualize/blueprint what the schematics of it would look like (even if it is motivated by overambition and the conflation of 5weeks remaining in the quarter to infinite time). I put together a blueprint for what the product will look like, from the front-end of the website, through precomputation, generative, and synthesizing processes ultimately to be reposted to the main website’s gallery. In the weeks to come, I will be building out the system, creating the ligament to bind together the component parts of the project.



week 6:

As a continuation from last week’s objectives, I’ve minimally built out the website frontend and backend. If there is time after the generative pipeline is complete, I want to return to the front end to do more aesthetic work and polishing of the look. This week was coding heavy and in addition to building out the website, I spent a lot of time focused on integrating python process into the pipeline. The first process I integrated was taking input text and parsing it into its descriptive parts. The second process I worked on integrating was translating descriptive text into the visualizations, animations. One thing I noticed was that on my low compute, weak laptop, producing the end visualization which is comprised of multiple animations (text split by nouns into multiple phrases and each phrase turned into animation), is pretty infeasible from a temporal perspective. Takes way too long. Will need to use GPU or some better compute to do this efficiently – and because of prohibitive pricing, this may not be an open tool to the public (yet).

week 7:

After building out the text to animation component, decided to move on to incorporating the sonification of images. Still some thought to be done about refining the part of the system that is text to animation as it is extremely slow on my local computer and potentially expensive to use some cloud service with more compute.

Sonification, spent the first half of the week building out the pixel attribute to sound attribute conversion strategy — led to just massive chaos. Tried to correct by moving to all notes in a scale as opposed to chromatic notes — still too much noise. Sounded terrible and distracting. Then decided the music in the background needed to make sense with what was on screen but also should be more of an ambiance. I realized that there may be a way to move from the text caption to sound. So I proceeded with this idea. 1) find an online db of free sounds, first thought was youtube but complications here — ended up finding Freesound and messing around with pinging their API in Postman. Worked well and decided to move forward with this strategy. Now am working on taking the sounds for each phrase, trimming them, assembling them, and overlaying over the animation.

week 8:

After much, MUCH bug squashing and coding this week I finished the generative pipeline from poem input to audiovisual output. Figuring out audio processing was an extreme hassle, as I had to try out numerous python libraries before I was able to find one that enabled me to do the editing needed to the audio files returned by pinging FreeSound’s API. This background audio processing work entailed retrieving an audio clip for each of the phrases, trimming each down to 15 seconds, stitching them together into one unified clip, and then synchronizing the returned clip with the animation. Once that was all working, I provisioned some GPU instances, hosted my website, and tested out the pipeline on some basic text inputs. Next week is going to be focused on cleaning up the presentation of the website and writing poetry!

weeks 9 & 10:

Taking the pipeline that worked (slowly) on my local machine and hosting it proved to be more difficult than I had anticipated. Instead of uploading the whole project as a monorepo, I changed the architecture to be microservices. My website and its server are hosted using Google Cloud App Engine. The website fields all incoming requests and when a user wants to generate their own animation this node server forwards the request to a separate Python server that is hosted on Google Cloud Compute engine and hardware accelerated by using Nvidia GPU. This part is pretty expensive so being the broke college student I am, once all the free trial credits are expended – the server will probably be shut down. Additionally, to the cost point, the server does not run 24/7 so if a user wants to generate an animation I have to manually ssh into the server and start the process to start receiving requests. This is suboptimal and if I continue working on this project, the first thing I will do is investigate slashing prices to allow a long-running process. Another disparity between the local and hosted version is that some sample videos, larger than 30 MB are not viewable, so I will upload them all with reference poems here.

The python process does the bulk of the work – I don’t want to bore you with all the intricacies but I’ll explain at a high level how it works and link the 2 open source code repos if you’re interested in delving deeper into the weeds. Firstly, the process takes the input poem and does some natural language processing to get descriptive phrases that make up the poem. It removes linguistic fluff like articles and conjunctions in favor of nouns and adjectives – more descriptive units that can be visualized. Next, each descriptive phrase is run through an iterative GAN structure to create a segment of the resultant animation. The iterative GAN structure was described in a previous week’s update but basically 2 models work together to produce the animation 1) VQGAN generates an image for a phrase and 2) CLIP, a discriminative model, evaluates how close the generated image is to the phrase and provides a number, a loss value, for how close the generated image is to the interpretation of the phrase. This back and forth process goes on for 200 iterations. In the meantime, the process pings the Freesound API to get back ambient music for each segment, trims each clip to 15 seconds, applies fade, and merges them into one large audio file. Lastly, the audio is overlaid on the animation, uploaded to Google drive, and the link is shared with the provided email. Code is open source:

the flowering garden was flooded,
the plants swept far and away,
their petals lost and decayed,
all that remained were the vacant
greenhouses, a city of glass

I’m listening to you now,
but also the rain
as it particalizes against glass
as it meters and mumbles old philosophy
of how things fallen are often reskied.

I’m listening to you as I slice an apple,
split it cleanly into two perfect halves
that happily yield their seeds

I’m listening to you, wanting
of this symmetry,
as in the divides of pain, symmetry
is never guaranteed

I’m listening to you over
the wanting of the metal earth
and its spinning magnetic core
that breaks up our line

I’m listening to you, down on all luck,
when you tell me that our gazes surely
meet on the face of the same moon

have you looked at the blood moon
long enough to find the sun
after the eclipse?

have you been in the house
while the crook sneaks back out,
stealing nothing but a few
moments from himself?

the basement of memory needs
to be pried open with a crowbar
and fumigated. but if you do this,
not inconsequential labor,
shed a little weight for
the algorithms of maintenance:

it is clean and complex,
how all things, even you,
wind back to themselves.

the bay holds her ships on a blue
tongue like pearls,
port to port they go on plying
beneath the haunt,
the bridge lights

are joined in their reflection
by the molar white mourning
of moon coaxing the sea
out from itself, the fog
into my lungs. my face ripples
where you cast your stones.
we prophesize the end coming as snowfall,
opaline against pastel light

all my runs are smooth now, athletic as
a jazz tune. You hold me the way
white light collects color.
Is color. Gives color

I am a child of hope
a child of the Mojave desert,
the prickly pear, waxy cactus that holds
onto two pearls of rain
to make its living,
the xerophyte fighting,
for years yearning, enduring
the photons streaming
down from the Arizona sun

windfalls of dust pattern my skin in
logo syllables forming your name and
your name and your name

storm though you were,
when will you return?
the wind answers, bringing
in fox bones to kiss
for the loneliness

my spine aches into needles
keeps out the wrens of refuge,
I can’t do anything but survive the heat,
rooted here, my whole biology built
for beating back everything except you

I fork into beauty
an array of orange flowers to charm
the severe and remaining gods
of this wasteland, to say to them:
we’ve been sparing, limited --
spent the warm springs in wait.
why shouldn’t we greet each other
now with our gifts?

I bloom and bust,
launching my demands to
the sparing sky and
wait for you,
rain that may never come

if not now, when?
sure, there is always later
but later always comes as the carcinoma
pent up in the genome, an amassed fury
threatening to overthrow
the cathedral of body.
how do you plan to answer the problem

of body if not by throwing ours
against each other? we roughhouse
playing pain, and then repent by crushing
lavender between our palms to be pretty

again, to hear the psalm of the flower
release its notes of perfume and
pardon us for our minor cruelties.

meanwhile the mist drops low
and hangs unblinking like a bright eye
over Pier 49. an angler snares a fin
larger than himself, reels it in, and
desiccates under the weight
of his lovely Leviathan
while it hammers to return to the ocean

the way that the Mother praying
mantis, now with her child in view,
is the widow and the widow

maker, the Taker of Heads.
her procreation ritual --
her prayer, is a green fluttering,
green wrestling of green wings,

even the original sweetness comes
with a price, but still I open
the door for you –
the door to splendor
must remain ajar at all costs