Open questions There are numerous further avenues of research. One strand is around audio
tokenization: what are desirable properties of audio tokens, how can we measure them, and how
can we optimize for them? Another is around evaluations. In comparison to text, the richness of
the set of established benchmarks for generative text/audio tasks is less developed. This work has
focused on speech recognition and speech translation, for which the benchmarks are more mature.
The establishment of more benchmarks and metrics for generative audio tasks will help to accelerate
research further.
Acknowledgements
We would like to thank Nobuyuki Morioka and Yifan Ding for their help in re-creating the TTS-
augmented WMT/TED dataset which was also used in Jia et al. [2022a] and Adam Roberts and Ron
Weiss for their advice and reviews. We would like to thank Slav Petrov, Colin Cherry and the PaLM-2
team for their advice and support.
References
A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen,
A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank. Musiclm: Generating music
from text. arXiv preprint arXiv:2301.11325, 2023.
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican,
M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural
Information Processing Systems, 35:23716–23736, 2022.
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. T. Passos, S. Shakeri, E. Taropa, P. Bailey,
Z. Chen, E. Chu, J. Clark, L. E. Shafey, Y. Huang, K. S. Meier-Hellstern, G. Mishra, E. Moreira,
M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. ’Abrego, J. Ahn,
J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K. M. Brooks, M. Catasta, Y. Cheng,
C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crépy, S. Dave, M. Dehghani, S. Dev,
J. Devlin, M. C. D’iaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. García,
S. Gehrmann, L. González, G. Gur-Ari, S. Hand, H. Hashemi, L. Hou, J. Howland, A. R. Hu,
J. Hui, J. Hurwitz, M. Isard, A. Ittycheriah, M. Jagielski, W. H. Jia, K. Kenealy, M. Krikun,
S. Kudugunta, C. Lan, K. Lee, B. Lee, E. Li, M.-L. Li, W. Li, Y. Li, J. Li, H. Lim, H. Lin, Z.-Z.
Liu, F. Liu, M. Maggioni, A. Mahendru, J. Maynez, V. Misra, M. Moussalem, Z. Nado, J. Nham,
E. Ni, A. Nystrom, A. Parrish, M. Pellat, M. Polacek, A. Polozov, R. Pope, S. Qiao, E. Reif,
B. Richter, P. Riley, A. Ros, A. Roy, B. Saeta, R. Samuel, R. M. Shelby, A. Slone, D. Smilkov, D. R.
So, D. Sohn, S. Tokumine, D. Valter, V. Vasudevan, K. Vodrahalli, X. Wang, P. Wang, Z. Wang,
T. Wang, J. Wieting, Y. Wu, K. Xu, Y. Xu, L. W. Xue, P. Yin, J. Yu, Q. Zhang, S. Zheng, C. Zheng,
W. Zhou, D. Zhou, S. Petrov, and Y. Wu. Palm 2 technical report. arXiv preprint arXiv:2305.10403,
2023.
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders,
F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings
of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille,
France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL
https://aclanthology.org/2020.lrec-1.520.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised
learning of speech representations. Advances in neural information processing systems, 33:
12449–12460, 2020.
A. Bapna, C. Cherry, Y. Zhang, Y. Jia, M. Johnson, Y. Cheng, S. Khanuja, J. Riesa, and A. Con-
neau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint
arXiv:2202.01374, 2022.
L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck,
P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the
2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference
on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61. Association for
Computational Linguistics, 2019. URL https://aclanthology.org/W19-5301.
18