我想基于Weka j48决策树输出,使用耶拿创建一个本体。但是,在将该输出输入到Jena之前,需要将其映射为RDF格式。有什么办法做这种映射吗?
编辑1:
映射前j48决策树输出的样本部分:
与决策树输出相对应的RDF的样本部分:
这两个屏幕来自本研究论文(幻灯片4):
可能没有内置的方法可以做到这一点。
免责声明:我以前从未与Jena和RDF合作。因此,此答案可能不完整或错过了预期的转换要点。
But nevertheless, and first of all, a short rant:
<rant>
The snippets that are published in the paper (namely, the output of the Weka classifier and the RDF) are incomplete and obviously inconsistent. The process of the conversion is not described at all. Instead, they only mentioned:
The challenge we faced was mainly to make J48 classification outputs to RDF and gave it to Jena
(sic!)
Now, they solved it somehow. They could have provided their conversion code on a public, open-source repository. This would allow others to provide improvements, it would increase the visibility and verifiability of their approach. But instead, they wasted their time and the time of the readers with screenshots of various websites that serve as page-fillers in a pitiful attempt to squeeze yet another publication out of their approach.
</rant>
The following is my best-effort approach to provide some of the building blocks that may be necessary for the conversion. It has to be taken with a grain of salt, because I'm not familiar with the underlying methods and libraries. But I hope that it can be considered as "helpful" nevertheless.
The Weka Classifier
implementations usually do not provide the structures that they are using for their internal workings. So it is not possible to access the internal tree structure directly. However, there is a method prefix()
that returns a string representation of the tree.
The code below contains a very pragmatic (and thus, somewhat brittle...) method that parses this string and builds a tree structure that contains the relevant information. This structure consists of TreeNode
objects:
static class TreeNode
{
String label;
String attribute;
String relation;
String value;
...
}
The label
is the class label that was used for the classifier. This is only non-null
for leaf nodes. For the example from the paper, this would be "0"
or "1"
, indicating whether an email is spam or not.
The attribute
is the attribute that a decision is based on. For the example from the paper, such an attribute may be word_freq_remove
The relation
and value
are the strings representing the decision criteria. These may be "<="
and "0.08"
for example.
After such a tree structure has been created, it can be converted into an Apache Jena Model
instance. The code contains such a conversion method, but due to my lack of familiarity with RDF, I'm not sure whether it makes sense conceptually. Adjustments may be necessary in order to create the "desired" RDF structure out of this tree structure. But naively, the output looks like it could make sense.
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.List;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.Property;
import org.apache.jena.rdf.model.Resource;
import org.apache.jena.rdf.model.Statement;
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
public class WekaClassifierToRdf
{
public static void main(String[] args) throws Exception
{
String fileName = "./data/iris.arff";
ArffLoader arffLoader = new ArffLoader();
arffLoader.setSource(new FileInputStream(fileName));
Instances instances = arffLoader.getDataSet();
instances.setClassIndex(4);
//System.out.println(instances);
J48 classifier = new J48();
classifier.buildClassifier(instances);
System.out.println(classifier);
String prefixTreeString = classifier.prefix();
TreeNode node = processPrefixTreeString(prefixTreeString);
System.out.println("Tree:");
System.out.println(node.createString());
Model model = createModel(node);
System.out.println("Model:");
model.write(System.out, "RDF/XML-ABBREV");
}
private static TreeNode processPrefixTreeString(String inputString)
{
String string = inputString.replaceAll("\\n", "");
//System.out.println("Input is " + string);
int open = string.indexOf("[");
int close = string.lastIndexOf("]");
String part = string.substring(open + 1, close);
//System.out.println("Part " + part);
int colon = part.indexOf(":");
if (colon == -1)
{
TreeNode node = new TreeNode();
int openAfterLabel = part.lastIndexOf("(");
String label = part.substring(0, openAfterLabel).trim();
node.label = label;
return node;
}
String attributeName = part.substring(0, colon);
//System.out.println("attributeName " + attributeName);
int comma = part.indexOf(",", colon);
int leftOpen = part.indexOf("[", comma);
String leftCondition = part.substring(colon + 1, comma).trim();
String rightCondition = part.substring(comma + 1, leftOpen).trim();
int leftSpace = leftCondition.indexOf(" ");
String leftRelation = leftCondition.substring(0, leftSpace).trim();
String leftValue = leftCondition.substring(leftSpace + 1).trim();
int rightSpace = rightCondition.indexOf(" ");
String rightRelation = rightCondition.substring(0, rightSpace).trim();
String rightValue = rightCondition.substring(rightSpace + 1).trim();
//System.out.println("leftCondition " + leftCondition);
//System.out.println("rightCondition " + rightCondition);
int leftClose = findClosing(part, leftOpen + 1);
String left = part.substring(leftOpen, leftClose + 1);
//System.out.println("left " + left);
int rightOpen = part.indexOf("[", leftClose);
int rightClose = findClosing(part, rightOpen + 1);
String right = part.substring(rightOpen, rightClose + 1);
//System.out.println("right " + right);
TreeNode leftNode = processPrefixTreeString(left);
leftNode.relation = leftRelation;
leftNode.value = leftValue;
TreeNode rightNode = processPrefixTreeString(right);
rightNode.relation = rightRelation;
rightNode.value = rightValue;
TreeNode result = new TreeNode();
result.attribute = attributeName;
result.children.add(leftNode);
result.children.add(rightNode);
return result;
}
private static int findClosing(String string, int startIndex)
{
int stack = 0;
for (int i=startIndex; i<string.length(); i++)
{
char c = string.charAt(i);
if (c == '[')
{
stack++;
}
if (c == ']')
{
if (stack == 0)
{
return i;
}
stack--;
}
}
return -1;
}
static class TreeNode
{
String label;
String attribute;
String relation;
String value;
List<TreeNode> children = new ArrayList<TreeNode>();
String createString()
{
StringBuilder sb = new StringBuilder();
createString("", sb);
return sb.toString();
}
private void createString(String indent, StringBuilder sb)
{
if (children.isEmpty())
{
sb.append(indent + label);
}
sb.append("\n");
for (TreeNode child : children)
{
sb.append(indent + "if " + attribute + " " + child.relation
+ " " + child.value + ": ");
child.createString(indent + " ", sb);
}
}
@Override
public String toString()
{
return "TreeNode [label=" + label + ", attribute=" + attribute
+ ", relation=" + relation + ", value=" + value + "]";
}
}
private static String createPropertyString(TreeNode node)
{
if ("<".equals(node.relation))
{
return "lt_" + node.value;
}
if ("<=".equals(node.relation))
{
return "lte_" + node.value;
}
if (">".equals(node.relation))
{
return "gt_" + node.value;
}
if (">=".equals(node.relation))
{
return "gte_" + node.value;
}
System.err.println("Unknown relation: " + node.relation);
return "UNKNOWN";
}
static Model createModel(TreeNode node)
{
Model model = ModelFactory.createDefaultModel();
String baseUri = "http://www.example.com/example#";
model.createResource(baseUri);
model.setNsPrefix("base", baseUri);
populateModel(model, baseUri, node, node.attribute);
return model;
}
private static void populateModel(Model model, String baseUri,
TreeNode node, String resourceName)
{
//System.out.println("Populate with " + resourceName);
for (TreeNode child : node.children)
{
if (child.label != null)
{
Resource resource =
model.createResource(baseUri + resourceName);
String propertyString = createPropertyString(child);
Property property =
model.createProperty(baseUri, propertyString);
Statement statement = model.createLiteralStatement(resource,
property, child.label);
model.add(statement);
}
else
{
Resource resource =
model.createResource(baseUri + resourceName);
String propertyString = createPropertyString(child);
Property property =
model.createProperty(baseUri, propertyString);
String nextResourceName = resourceName + "_" + child.attribute;
Resource childResource =
model.createResource(baseUri + nextResourceName);
Statement statement =
model.createStatement(resource, property, childResource);
model.add(statement);
}
}
for (TreeNode child : node.children)
{
String nextResourceName = resourceName + "_" + child.attribute;
populateModel(model, baseUri, child, nextResourceName);
}
}
}
该程序从ARFF文件中解析著名的Iris数据集,运行J48分类器,构建树结构,并生成并打印RDF模型。输出显示在这里:
由Weka打印的分类器:
J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
内部构建的树结构的字符串表示形式:
Tree:
if petalwidth <= 0.6: Iris-setosa
if petalwidth > 0.6:
if petalwidth <= 1.7:
if petallength <= 4.9: Iris-versicolor
if petallength > 4.9:
if petalwidth <= 1.5: Iris-virginica
if petalwidth > 1.5: Iris-versicolor
if petalwidth > 1.7: Iris-virginica
生成的RDF模型:
Model:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:base="http://www.example.com/example#">
<rdf:Description rdf:about="http://www.example.com/example#petalwidth">
<base:gt_0.6>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth">
<base:gt_1.7>Iris-virginica</base:gt_1.7>
<base:lte_1.7>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength">
<base:gt_4.9>
<rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength_petalwidth">
<base:gt_1.5>Iris-versicolor</base:gt_1.5>
<base:lte_1.5>Iris-virginica</base:lte_1.5>
</rdf:Description>
</base:gt_4.9>
<base:lte_4.9>Iris-versicolor</base:lte_4.9>
</rdf:Description>
</base:lte_1.7>
</rdf:Description>
</base:gt_0.6>
<base:lte_0.6>Iris-setosa</base:lte_0.6>
</rdf:Description>
</rdf:RDF>
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句