使用正则表达式提取格式不同的日期并对它们进行排序-Pandas

我认为这是Coursera文本挖掘作业之一。好了，您可以使用正则表达式并提取以获取解决方案。dates.txt，即

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)

def date_sorter():
    # Get the dates in the form of words
    one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
    # Get the dates in the form of numbers
    two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
    # Get the dates where there is no days i.e only month and year  
    three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
    #Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
    dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())

date_sorter()

输出：

9 1971-04-10
84 1971-05-18
2 1971-07-08
53 1971-07-11
28 1971-09-12
474 1972-01-01
153 1972-01-13
13 1972-01-26
129 1972-05-06
98 1972-05-13
111 1972-06-10
225 1972-06-15
1972年7月31日
171 1972-10-04
191 1972-11-30
486 1973-01-01
335 1973-02-01
415 1973-02-01
36 1973-02-14
405 1973-03-01
323 1973-03-01
422 1973-04-01
375 1973-06-01
380 1973-07-01
345 1973-10-01
57 1973-12-01
481 1974-01-01
436 1974-02-01
104 1974-02-24
299 1974-03-01

如果只想返回索引，则 return pd.Series(dates.sort_values().index)

解析第一个正则表达式

 ＃?：非捕获组

（（？：\ d {，2} \ s）？＃两位数字组。`？表示在前的令牌或组。此处的2或1数字和空格出现一次或更少。

 （？：Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec）[az] *＃组中以任意字母[[]`结尾的单词出现了多次（ *）。

 （？：-| \。| \ s |，）＃模式匹配-，。，space

 \ s？＃（`？`这里只暗示空格，即前面的标记）

 \ d {，2} [az] *＃小于等于两个数字，末尾有任意数量的字母（*）。（例如：可能是1月1日，13日，22日，1月，12月等）。

 （？：-|，| \ s）？＃字符-/，/ space可能只出现一次，而由于末尾的'？`可能不会出现

 \ s？＃空间可能存在或根本不存在（最大为1）（这里的“？”仅指空间）

 \ d {2,4}）＃匹配数字2或4

希望能帮助到你。

其他 2022/1/1 18:34:12 有274人围观

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节

关注并接收问题和回答的更新提醒

参与内容的编辑和改进，让解决方法与时俱进

请先登录

使用正则表达式提取格式不同的日期并对它们进行排序-Pandas

撰写回答

推荐问题

Greasemonkey 1.0中的jQuery与使用jQuery的网站冲突

如何使用JSON-LD标记面包屑列表中的最后一个非链接项目

如何在Spring MVC中使用AJAX渲染视图

使用动态where子句休眠

如何使用jQuery访问父窗口对象？

使用Curl和PHP使会话保持活动状态

如何建立一个动态查询，该查询增加了迄今为止的天数，并使用标准API比较该日期与另一个日期？

使用LESS构建选择器列表

如何使用CSS将跨度更改为类似pre？

在mysql sproc中使用变量作为表名

如何使用C＃获取两个DateTime对象之间的时差？

我可以在php中的SESSION数组上使用array_push吗？

Django-如何使用South重命名模型字段？

使用Spring Functional Web Framework的REST端点的背压

使用GhostDriver时如何设置屏幕/窗口大小

如何使用最新版本的jQuery并在RichFaces中为jQuery取回“ $”？

我可以使用BeautifulSoup删除脚本标签吗？

多态对象的JSON使用者

我如何重新连接使用selenium的webdriver打开的浏览器？

如何使用Servlet和Ajax？

分类汇总

您的鼓励是对我最大的支持