For corpora other than hkcancor, pycantonese provides the function read_chat () to read in cantonese data in the chat format Someone with more skills than me could try to read 裏 through this python search from other corpuses and see what is the result. It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. The frequency list has the following features It uses all sections of the 人民日报 / people's daily newspaper, including the sports section. I've parsed out vocabulary from these taiwanese tests and converted to flashcards in pleco's format
For seeing term levels, intended part of speech and sometimes definitions/examples Tocfl vocab was updated some couple years ago and i haven't yet seen a processed version of the. I would read in the bcc corpus frequency list as a dictionary, then having concatenated all the news/magazine articles as plain text, i would build a dictionary of all the words in the news/magazine articles up to 8 characters long, counting their number of occurrences with the help of the bcc frequency list (which tells us which combinations. The bcc corpus seems to have pretty loose licensing terms Pleco already seems to be using frequency data to sort the search results Adding them meaningfully to dictionary definitions would be even better, i believe