nlp.py.nonbrk.en 754 B

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
  1. ## List of Non-breaking tokens containing period for English
  2. ## 20-Sep-2012
  3. ##
  4. ## Adapted from Moses tokenizer
  5. ##
  6. # titles
  7. Adj.
  8. Adm.
  9. Adv.
  10. Asst.
  11. Bart.
  12. Bldg.
  13. Brig.
  14. Bros.
  15. Capt.
  16. Cmdr.
  17. Col.
  18. Comdr.
  19. Con.
  20. Corp.
  21. Cpl.
  22. DR.
  23. #Dr. handled by ignoring case
  24. Drs.
  25. Ens.
  26. Gen.
  27. Gov.
  28. Hon.
  29. Hr.
  30. Hosp.
  31. Insp.
  32. Lt.
  33. MM.
  34. MR.
  35. MRS.
  36. MS.
  37. Maj.
  38. Messrs.
  39. Mlle.
  40. Mme.
  41. #Mr. handled by ignoring case
  42. #Mrs. handled by ignoring case
  43. #Ms. handled by ignoring case
  44. Msgr.
  45. Op.
  46. Ord.
  47. Pfc.
  48. Ph.
  49. Prof.
  50. Pvt.
  51. Rep.
  52. Reps.
  53. Res.
  54. Rev.
  55. Rt.
  56. Sen.
  57. Sens.
  58. Sfc.
  59. Sgt.
  60. Sr.
  61. St.
  62. Supt.
  63. Surg.
  64. # addded
  65. U.S
  66. U.S.
  67. U.S.A
  68. U.S.A.
  69. # Misc
  70. v.
  71. vs.
  72. i.e.
  73. rev.
  74. e.g.
  75. # assuming that the last period in p.m. and a.m. is never a punctuation
  76. p.m.
  77. a.m.
  78. # Added from Symantec data
  79. Ver.
  80. Misc.
  81. etc.