java利用正则表达式截取想要的内容

下面代码是从a.txt中读取内容并且输出,且输出想要截取的内容。

直接甩代码:


import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.springframework.beans.factory.annotation.Autowired;



public class ReadTxt {

//	@Autowired
//	public static StockInfoMapper stockInfoMapper;
	
	public static void main(String[] args) {
		String file_name = "C:/Users/Administrator/Desktop/a.txt";
		readFileByLines(file_name);
	}
	
	public static void readFileByLines(String fileName)
	{
		BufferedReader reader = null;
		
		try{
			System.out.println("以行为单位读取txt");
			reader = new BufferedReader(new InputStreamReader(new FileInputStream(fileName), "gbk"));
			String tempString = null;
			int line = 1;
			Pattern p = Pattern.compile("(?<=com\\/).*?(?=\\.html)");
			Pattern p1 = Pattern.compile("(?<=\">).*?(?=\\()");
			Pattern p2 = Pattern.compile("(?<=\\().*?(?=\\))");
			//一次读一行,直到读入null为结束
			while((tempString = reader.readLine()) != null)
			{
				String temp2 = tempString.replace(" ", "");
				if(temp2.length()!=0)
				{
					StockInfo stockInfo = new StockInfo();
					Matcher matcher = p.matcher(temp2);
					while(matcher.find())
					{
						System.out.println(matcher.group());
						stockInfo.setStock_url(matcher.group());
					}
					Matcher matcher1 = p1.matcher(temp2);
					while(matcher1.find())
					{
						System.out.println(matcher1.group());
						stockInfo.setStock_name(matcher1.group());
					}
					Matcher matcher2 = p2.matcher(temp2);
					while(matcher2.find())
					{
						System.out.println(matcher2.group());
						stockInfo.setStock_code(matcher2.group());
					}
					System.out.println("line " + line + ":" + temp2);
					line++;
				}
//				System.out.println("line " + line + ":" + tempString);
//				line++;
			}
			
			reader.close();
		}catch(IOException e)
		{
			e.printStackTrace();
		}finally {
			if(reader != null)
			{
				try
				{
					reader.close();
				}catch(IOException e1)
				{
					e1.printStackTrace();
				}
			}
		}
	}

}

 

输出内容如下:

sz300750

宁德时代
300750

line 2891:

  • 宁德时代(300750)
  • 这个里面的正则表达式运用到了零宽断言,

    比如(?<=com\\/).*?(?=\\.html) 就是截取com/后面和.html前面的这个内容。
    ?<=com是指不包括"com/"

    ?=\\.html指不包括.html    备注:如果是<=就是包括

    举例来说(?=exp)代表匹配以exp结尾的字符串,但匹配出来的结果并不带exp,?<=exp)匹配以exp开头的字符串,但结果不带exp

    中间的表达式".*?"

    ".*"(任意字符匹配0次或多次)

    "?"(前面的内容匹配0次或1次)

    你可能感兴趣的:(JAVA,爬虫)