Friday, April 7, 2017

Lithuanian gives you good practice with regex

I'm parsing Lithuanian verbs, and in tidying up the raw HTML, I'm compelled to come up with little beauties like this:

<a href="\/[a-z]*\/[a-z]*\W*[a-z]*\W*[a-z]*">

...in order to to clear away the HTML formatting.

The problem lies in the fact that Lithuanian diacritics (č, ž, ė, ę and so on) are not recognised as letters [a-z] but rather as non-word characters \W.


No comments:

Post a Comment

1,050 hours

It took me 13 working days to complete my first 100 "work" pomodoros as a Junior Software Tester at Profectus Group.  Much of ...