Heritrix, how to exclude some file formats
I wanna exclude some media like vidoe, audio, pdf, etc. files. In another
word, I only want texts and images. How can I configure my job to do that?
I think I can do this by modifying this part in crawler-beans.cxml of the
job:
<bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
<property name="decision" value="REJECT"/>
<!-- <property name="listLogicalOr" value="true" /> -->
<!-- <property name="regexList">
<list>
</list>
</property> -->
</bean>
However, I am not sure how to do that.
No comments:
Post a Comment