久久久无码人妻精品无码_6080YYY午夜理论片中无码_性无码专区_无码人妻品一区二区三区精99

Domestic AI models tackle unique demands of Cantonese

Cultural elements, colloquialisms require local solutions, experts say

By OASIS HU in Hong Kong | China Daily | Updated: 2024-10-21 07:42
Share
Share - WeChat
LU PING/CHINA DAILY

"Is there a wife in a wife cake?" Heyson He Lixi, an influencer promoting Cantonese, asked.

"There is no actual wife in a wife cake, as it is just a traditional Hong Kong pastry," answered Sensechat, the world's first Cantonese large language model, a ChatGPT-like product, in fluent Cantonese.

He raised a more challenging question, "What is sorrowful rice?"

"Sorrowful rice, a well-known delicacy in Hong Kong, derives its name from the martial arts masterpiece The Return of the Condor Heroes by Hong Kong writer Jin Yong. In the novel, the character Yang Guo experiences deep sorrow after losing his beloved, Xiaolongnyu," Sensechat responded.

He was impressed — it was the first artificial intelligence tool he had encountered that exhibited a profound understanding of Cantonese culture.

Born in 1998, He is fond of Cantonese and has been creating videos promoting the language since college. Over the years, he has amassed a fan base that surpasses 700,000 across various online platforms, including YouTube and popular Chinese mainland social media platforms like Bilibili, Xiaohongshu, and Douyin. Due to the recent AI craze, He started producing videos evaluating the tools' ability to understand Cantonese.

The outcomes are underwhelming. Many AI large language models (LLMs) and AI software developed on the mainland fail to recognize Cantonese phonetically. Some Western-developed AI software can listen to Cantonese, but cannot speak it accurately. ChatGPT, for instance, often blends Cantonese with Mandarin. Suno, an AI large language model tool that specializes in generating songs, can pronounce Cantonese to a degree, but its primary focus remains music creation.

In July, the Sensetime Group, an AI developer based in Hong Kong, introduced Sensechat, a Cantonese version of its proprietary LLM, and announced that it would be available for free to Hong Kong users indefinitely.

Upon a friend's recommendation, He downloaded Sensechat.

"I felt 85 percent satisfied with Sensechat," he said. "The application still requires to be further refined, but it is one of the few that can truly understand Cantonese."

The application emphasizes one of the unique traits of Cantonese — its colloquial nature.

Pronunciation of Cantonese involves extensive use of modal particles, which are often used at the end of sentences to indicate mood. These particles usually go unnoticed by most AI tools, but Sensechat captures them effectively.

In terms of written text, Sensechat can understand and reflect the nuances between the two forms of written Cantonese. It has a standardized form used in formal situations, similar to Mandarin, and a phonetic style for everyday use. This characteristic, He said, is often overlooked by other large language models.

He recorded his interactions with Sensechat, and shared it online, garnering over 150,000 views. "Cantonese speakers truly need such a tool," He said.

Data size matters

Training an LLM typically involves three stages, said Cao Jiannong, the chair professor in the Department of Computing at Hong Kong Polytechnic University.

The first stage requires pre-training using extensive data, followed by fine-tuning with high-quality data. In the third stage, humans are needed to align the output of the LLM with local culture, ethics, morals, laws, and other rules to restrict the risk of generating inaccurate, biased, or unlawful content.

Developing a Cantonese LLM faces difficulties in all three stages, Cao said.

While Hong Kong's internet infrastructure is relatively well-developed, there is a scarcity of Cantonese content available online. A major factor contributing to this scarcity is that while Cantonese is widely spoken in daily life, the written form of Cantonese is Chinese.

Moreover, English has long served as the official language in Hong Kong. Consequently, a significant portion of the city's online information, including official archived documents in areas such as law, finance, politics, and medicine, is predominantly available in English, Cao said.

LLMs rely heavily on abundant data for their training, said Francis Fong Po-kiu, honorary president of the Hong Kong Information Technology Federation, a local IT-related business association. Without data, there is simply no way to develop a language model, he said.

Literature scarcity

Cantonese web resources suffer not only from a shortage in quantity, but also a lack of quality, said Cao.

When it comes to written material, Hong Kong has not prioritized literature, resulting in a scarcity of quality Cantonese literary works, said Keith Li King-wah, chairman of Hong Kong Wireless Technology Industry Association.

Most available Cantonese texts come from online forums and social media, and often contain low-quality and even offensive language, potentially leading AI models to produce crude content, Li said.

Collecting speech data presents another problem.

Despite access to Cantonese videos online, such as movies and TV dramas, they cannot be used due to background noise, said Albert Lam Yun-sang, the chief technology officer and chief scientist at Fano Labs, a Hong Kong-based startup focusing on speech and language technologies.

Besides insufficient data, Cantonese's intricate linguistic characteristics are another obstacle in training an AI model.

The Economist magazine analyzed language learning time, and found that mastering Cantonese requires 88 weeks of study, placing it alongside Mandarin, Arabic, Japanese, and Korean in the top five most difficult languages to learn.

Lu Lewei, director of the Sensetime Research Institute, said that Cantonese is highly colloquial with numerous inflections. It has nine tones and even a slight variation in pronunciation can alter a word's meaning.

The language also features a blend of Chinese and English and a mix of old and modern terms.

In language modeling, the simplicity of a language offers advantages. The more complex the language is, the harder for the AI model to learn about it, Lam said.

Furthermore, underlying Cantonese is the local culture, which can be challenging for those tasked with aligning the output of large language models, Cao said.

Urgent need

Despite the difficulties involved in creating Cantonese AI models, demand for them is undeniable, said Fong from the Hong Kong Information Technology Federation.

The global Cantonese-speaking population is nearly 120 million, and 85.2 million of those are native Cantonese speakers.

In Hong Kong, 6.3 million residents, or 88.2 percent of the city's population, use Cantonese as their spoken language. In other cities within the Guangdong-Hong Kong-Macao Greater Bay Area, Cantonese is the predominant dialect, with 67 million residents in Guangdong province conversing in it.

In the future, AI will be akin to today's computers and fundamentally a tool for the general public. Without Cantonese AI tools, Cantonese-only speakers may encounter significant inconvenience and marginalization in both the offline and online world, Cao said.

For a city, lack of AI expertise could result in decreased productivity in sectors such as education, healthcare, finance, and law. These limitations could impede the whole city's development, Cao added.

Fong said AI models from other countries or regions may struggle to grasp Cantonese culture accurately. This could lead to cultural or political misinterpretations, resulting in the spreading of incorrect messages.

Dependence on outside AI models could make privacy and security vulnerable, Fong said.

Government officials, for instance, might face national security risks and local companies might leak data if they inadvertently disclose sensitive information to the models developed in foreign jurisdictions, he added.

Fong urged the Hong Kong Special Administrative Region government and local organizations to develop Cantonese LLMs.

In July, Sun Dong, Hong Kong's Secretary for Innovation, Technology, and Industry, announced that the SAR government is cooperating with local universities to develop a Hong Kong-based large language model.

A document co-pilot application for civil servants is now being used on a trial basis.

The model has already been implemented in Sun's department and the system will eventually become available to all Hong Kong residents, the secretary said.

The bureau said plans are underway to expand the pilot application to three other government bureaus, but it gave no indication when Hong Kong residents would gain access to it.

Fong said if it could be launched successfully, the government LLM would have many benefits.

It would be a positive step in resolving the issue of some Western AI models limiting their usage in Hong Kong. Also, implementing a localized AI model could safeguard privacy and provide more convenience to residents, Fong said.

Cao said it's unclear what specific features the government's AI model could offer and how it would distinguish itself from other similar products.

"I don't think the government has done enough research on what they want to do," Cao said.

Local startups

Local technology companies, meanwhile, are actively meeting the needs of the Cantonese-speaking market.

One startup, Votee AI, developed an opensource Cantonese LLM this year.

After years of operating in the local market, Votee AI has gathered substantial amounts of open-source Cantonese data along with primary data.

Taking a community-centered approach, they have also collaborated with local Cantonese linguists and AI researchers, including the team behind the online Cantonese dictionary "words.hk", to capture the nuances of Hong Kong speech.

Sensetime has also accumulated a vast reservoir of internal open-source data.

The company has synthesized data by leveraging advanced technologies and bought supplementary information from external channels to collect data.

To combat the shortage of high-quality Cantonese data, Sensetime also collected audio Cantonese data from hundreds of its local employees.

Sensechat's clients include customer service providers, financial institutions, legal firms, healthcare companies, and others.

For Hong Kong residents, the company promises to provide the service for free indefinitely for free on both the web version and mobile application.

A local tech industry insider, who chose to stay anonymous, said Sensechat should opensource its technology to allow more residents and organizations to access it freely, to benefit the city.

After trying the Sensechat platform, he said its understanding of some Hong Kong slang could be more precise. Nonetheless, "it should be recognized that Sensechat filled a void in the local market," he said.

Cultural roots

In addition to developing local AI models, existing mainstream language models should be encouraged to improve their Cantonese functions, said Li from the Hong Kong Wireless Technology Industry Association.

However, mainstream AI language models are primarily developed by commercial entities in the West. Without market demand, they may not be willing to enhance their products' Cantonese capabilities.

Li believes the Hong Kong SAR government and local organizations should take the lead in collecting Cantonese data, digitize cultural content, and share these resources openly to enrich the Cantonese body of information.

Cantonese speakers can also actively use the language to engage with mainstream AI language models.

These actions can demonstrate to AI model developers that there is a market demand for Cantonese, while interaction with these models can also enhance their understanding of Cantonese culture.

The key to encouraging more people to use Cantonese lies in making Cantonese culture appealing, Li said.

Language is not just a communication tool; it encapsulates the cultural essence and identity of its speakers, he said.

The marginalized status of Cantonese in the digital sphere is a reflection of the decline of the cultural significance of the region.

In the 1970s and 1980s, Hong Kong, although just a city, was so culturally influential that Cantonese was a popular language around the world, Li said.

"At that time, the whole world watched Hong Kong movies and TVB(television shows), knew Jackie Chan and Bruce Lee, and sang Cantonese songs. However, in the present day, even many students in Hong Kong cannot speak Cantonese," he said.

"The focus of government policies should not only be on technology, but also on culture."

He, the influencer, said he learned Cantonese from his grandparents when he was a child, which later made him more proficient in the language than other school students. The confidence this gave him motivated him to become a Cantonese blogger.

However, as He aged, Cantonese became so marginalized that even voice-operated devices and software in his home failed to understand Cantonese commands.

While He could communicate with these devices in Mandarin and English, his grandparents, who only speak Cantonese, struggled to keep pace.

He hopes that Cantonese LLMs will one day help his elderly grandparents manage their daily lives through voice-controlled apps capable of understanding Cantonese.

Top
BACK TO THE TOP
English
Copyright 1995 - . All rights reserved. The content (including but not limited to text, photo, multimedia information, etc) published in this site belongs to China Daily Information Co (CDIC). Without written authorization from CDIC, such content shall not be republished or used in any form. Note: Browsers with 1024*768 or higher resolution are suggested for this site.
License for publishing multimedia online 0108263

Registration Number: 130349
FOLLOW US
久久久无码人妻精品无码_6080YYY午夜理论片中无码_性无码专区_无码人妻品一区二区三区精99

    99er在线视频| 日韩视频在线观看一区二区三区| 九九热免费精品视频| 午夜探花在线观看| 国产精品无码一本二本三本色| 9999在线观看| wwwwww.色| 乱人伦xxxx国语对白| av在线免费看片| 日韩中文字幕二区| 欧美一区二区激情| 性欧美18一19内谢| 久久国产这里只有精品| 久久精品.com| 国产a级片网站| 四虎免费在线观看视频| 黄色一级片免费的| 亚洲精品乱码久久久久久自慰| 欧美大片免费播放| 中文字幕日韩久久| 黄色手机在线视频| 99久久国产宗和精品1上映| 国产视频在线观看网站| 一区二区三区一级片| 色戒在线免费观看| 99999精品视频| 人妻久久久一区二区三区| av 日韩 人妻 黑人 综合 无码| 国产精品嫩草影院8vv8| 国产理论在线播放| 一本久道中文无码字幕av| 日韩少妇内射免费播放| 久草视频国产在线| 白白操在线视频| 欧美做暖暖视频| 欧美一级特黄aaaaaa在线看片| 五月天开心婷婷| 在线免费黄色网| 成年人三级黄色片| www.av91| 国产精品视频网站在线观看| 一级黄色免费在线观看| 色婷婷综合网站| 久热精品在线播放| 亚洲综合婷婷久久| 天天综合网久久| 亚洲天堂伊人网| 成人性生交视频免费观看| theporn国产精品| 日本美女久久久| 欧美少妇一级片| 日韩一级特黄毛片| 精品无码国产一区二区三区av| 久久99久久99精品| 国产视频九色蝌蚪| 日本精品www| 男女爽爽爽视频| jizz大全欧美jizzcom| 国产美女18xxxx免费视频| 亚洲一区二区三区观看| 黄色www在线观看| 少妇一晚三次一区二区三区| 日韩精品一区二区三区四| 国产69精品久久久久久久| 黑人糟蹋人妻hd中文字幕| 九九热在线免费| 波多野结衣在线免费观看| 青青青在线观看视频| 久久婷婷五月综合色国产香蕉| 欧美黄色一级片视频| 手机看片一级片| 国产奶头好大揉着好爽视频| 国产一区二区三区小说| 情侣黄网站免费看| 黄色手机在线视频| 男人天堂网站在线| 久久久久久人妻一区二区三区| 99精品视频在线看| 亚洲天堂伊人网| 欧美做暖暖视频| 国产二区视频在线播放| 日本一二区免费| 日本熟妇人妻xxxx| wwwwww.色| 亚洲精品天堂成人片av在线播放| 九一国产精品视频| 国产淫片av片久久久久久| 精品久久久久久中文字幕2017| 国产又猛又黄的视频| 91精品国产吴梦梦| 已婚少妇美妙人妻系列| 国产大尺度在线观看| 91视频最新入口| 制服丝袜中文字幕第一页| 免费毛片网站在线观看| 高清av免费看| 国产黄色片免费在线观看| www.九色.com| 岛国毛片在线播放| 欧美成人精品免费| 天天爽天天爽夜夜爽| 国产 国语对白 露脸| 久久午夜夜伦鲁鲁一区二区| 日本黄色a视频| 国产精品专区在线| 99九九99九九九99九他书对| 婷婷五月综合缴情在线视频| www.精品在线| 欧美日韩性生活片| 青春草在线视频免费观看| 黄色高清无遮挡| 丰满的少妇愉情hd高清果冻传媒| 99sesese| 无码人妻丰满熟妇区毛片18| 色一情一乱一乱一区91| 99视频在线免费| 免费 成 人 黄 色| 青青草原网站在线观看| 五月婷婷六月合| 日本黄色三级大片| 99热亚洲精品| 国产福利片一区二区| 亚洲77777| 亚洲爆乳无码专区| 免费观看美女裸体网站| 懂色av粉嫩av蜜臀av| 国产91色在线观看| 成人小视频在线看| 国产精品一线二线三线| 日韩最新中文字幕| 亚洲高清视频免费| 浓精h攵女乱爱av| 欧美s码亚洲码精品m码| 大伊香蕉精品视频在线| 超碰10000| 国产欧美综合一区| 在线无限看免费粉色视频| 91精品999| 午夜免费福利视频在线观看| 天天爽天天爽夜夜爽| 国产一区亚洲二区三区| 波多野结衣家庭教师在线播放| 日韩免费在线观看av| 日本一区二区三区四区五区六区| 波多野结衣国产精品| 香港日本韩国三级网站| 天天爽天天爽夜夜爽| 日本va中文字幕| av五月天在线| 免费看污黄网站| 天美星空大象mv在线观看视频| 精品久久久久av| 亚洲成色www.777999| 99久久激情视频| 中文字幕第80页| 五月婷婷深爱五月| 中文字幕国内自拍| 在线能看的av网站| 亚洲在线观看网站| 欧美精品一区二区性色a+v| 操bbb操bbb| 一卡二卡三卡视频| 国产伦精品一区二区三区四区视频_| 亚洲精品久久久久久久蜜桃臀| 免费网站在线观看视频| 精品国产一区三区| 成人黄色片视频| 精品亚洲一区二区三区四区| 日本黄色福利视频| 特级黄色录像片| 成人在线播放网址| 乱妇乱女熟妇熟女网站| 黄色片视频在线免费观看| 久久精品视频91| 成人亚洲精品777777大片| 特级黄色片视频| 日韩激情视频一区二区| 国产91在线免费| 日韩一级免费片| 无码人妻精品一区二区三区99v| 欧美三级午夜理伦三级老人| 欧美中日韩在线| 欧美一级片中文字幕| 一本久道中文无码字幕av| 手机免费看av网站| 999久久欧美人妻一区二区| 免费看一级大黄情大片| 99视频精品免费| 亚洲天堂av免费在线观看| 给我免费播放片在线观看| 成人观看免费完整观看| 久久黄色片网站| 青青视频免费在线| 成人黄色片视频| 天天干天天操天天干天天操| 成人免费播放器| 天天综合网久久| 男女私大尺度视频| www.精品在线| 欧美中文字幕在线观看视频|